Preventing overreliance: The case for deliberate AI errors in human-in-the-loop systems

Adam Jones

Preventing overreliance: The case for deliberate AI errors in human-in-the-loop systems

This is my personal blog, where I write solely in my personal capacity. It does not represent the positions of any organisations I'm associated with.

Also, everything here is provded "as is", without warranties of any kind, express or implied.

8 July 20248 July 2024

TLDR: Human-in-the-loop systems only work if humans are genuinely evaluating system recommendations. For powerful systems with high recommendation acceptance rates, we should intentionally insert bad recommendations to both check and strengthen oversight systems. There’s lots of work that can be done on this right now.

I previously wrote about human-in-the-loop systems: what they are, what they solve, and what they don’t.

In short, human-in-the-loop systems get humans to oversee decisions and make the final decision. This can help well-intentioned actors build safer systems, provided the human reviewer is able to oversee decisions effectively.

A challenge to getting effective oversight is that as AI systems improve, almost all decisions will be approved. Reviewers will then likely become complacent and just start approving everything. For example, an operative who carried out human reviews for AI-suggested military strikes explained “At first, we did checks to ensure that the machine didn’t get confused. But at some point we relied on the automatic system [...] I would invest 20 seconds for each target at this stage, and do dozens of them every day. I had zero added value as a human, apart from being a stamp of approval.”

How can we fix this?

Mandating a minimum percentage of rejected decisions would reduce reviewer complacency. If humans are used to seeing and rejecting mistakes, they’re more likely to review recommendations properly. Determining the exact percentage necessary here would be useful further work: I’d imagine this probably sits around 5-10%. It might also be the case that for different types of decisions this percentage could be different, or there are other methods¹

For example in low-risk cases, filtering for riskier decisions might be appropriate. This is only suitable for decisions where full human-in-the-loop is not necessary, given under this model some decisions will be acted upon without any human oversight.

Here, a model could predict the likelihood a human reviewer would reject the recommendation. If this is above a certain threshold, it is then reviewed by a human.

This is similar to how many fraud models in finance work: they approve transactions below some risk threshold, and require different levels of review for those above the threshold (where you might need to complete 2-factor authentication, or call your bank).

that could work.

Injecting intentionally bad recommendations could achieve this percentage. These would be recommendations that we know to be bad, and should be rejected by the human reviewer. These could be used to ‘top-up’ the decisions to review so that they meet the minimum percentage of rejected decisions. This means human reviewers are more used to rejecting recommendations and avoid becoming complacent.

Bad recommendations being approved highlight specific areas for improvement. Patterns in which false recommendations are approved can highlight problems with the review process, policies that need clarification, or reviewers who might need more training.

Bad recommendations can help verify the efficacy of human oversight. For example, organisations could publish the percentage of bad recommendations that are caught by human reviewers. This could show others that their review systems work, without divulging the sensitive contents of these systems.²

This is particularly important in the national security space. I could imagine seeing these figures in an IPCO annual report.

Doing this well would likely require standards around how to measure and present this information - but even if it’s not possible to directly compare systems, it should at least be possible to show whether a specific oversight system is improving or worsening.

Bad recommendations can be used to measure the effect of changes, which enables better research into improving oversight systems. Being able to track the percentage of bad recommendations caught allows for experimenting with ways to make our systems safer. For example, the effect of changes to user interfaces, safety culture initiatives, or additional training could all be measured.

Bad recommendations should be realistic. They should closely mimic the types of errors or problematic outputs the AI system might genuinely produce. This ensures that human reviewers are able to catch the kinds of mistakes they're likely to encounter in real scenarios. Obviously fake recommendations would defeat the purpose, as they’d be too easy to spot: in some sense, it should be worrying if your staff can catch 100% of bad recommendations.

Bad recommendations should never be actioned. To be clear: if a reviewer makes a mistake and approves a known bad decision, the system should not actually carry out that bad plan.

Won’t this create a bunch of extra work

This is an objection I’ve heard to this idea. There are usually two scenarios:

The case where oversight is being done well: Once deployed it will be 5-10% extra work for the reviewers (and less if there are more true rejections). This is a small price to pay to be able to continuously improve the oversight measures, and verify and demonstrate the effectiveness of your oversight system. There is also some work to develop such systems, for example to update the software to generate and inject these bad examples. Making this easy is valuable future work.

The case where oversight is being done poorly: this would require significant effort to improve the system such that the human review was effective. In this case, that is kind of the point. Once improved, we’re back to the case where oversight is being done well.

One neat thing is that it’s difficult for organisations to object to this proposal on the grounds of it creating work above the % of cases injected - because presumably that means their existing human oversight systems are just not good.

Won’t this be stressful for the operatives

Having a system that is in some sense testing operatives might add a little stress to their jobs.

However, some stress about getting important things wrong is appropriate. Operatives probably should worry that they might be bombing innocent people, and try very hard to avoid this. If the worry of being called out by the system is greater than this then maybe something has gone wrong.

Additionally, how stressful this is highly dependent on how organisations handle people approving bad recommendations. In line with best practices from fields like aviation and nuclear power, mistakes should be treated as opportunities for system-wide learning and improvement.

What can you do to help?

If you thought the above was a good idea, these are things that you (yes you!) could try and help with.

Come up with a name for this thing: This approach to improving human-in-the-loop systems needs a descriptive name to aid in its adoption and discussion. A good name would capture the essence of intentionally inserting bad recommendations to strengthen oversight.³

If someone else has come up with this idea, and given it a name let me know! I did try searching online, asking a few AI assistants, and asking a few friends in AI governance and none seemed to have heard of it.

Some ideas I’ve come up with:

intentionally bad AI outputs to test oversight systems (probably the best, but very verbose)
AI oversight challenge injection
AI oversight complacency prevention
red-teaming AI oversight systems (a bit vague, and maybe some confusion with general AI red-teaming)

Test it out empirically: As someone once said, “In theory, theory and practice are the same. In practice, they are not.” You could design and run an experiment comparing the effectiveness of human reviewers with and without injected bad recommendations. Testing this out in different domains, or with different kinds of people also seems likely to affect the results: understanding when it works and how it can be improved would be very useful. Long-term studies could also assess how this method affects reviewer complacency and system safety over time.

Figure out how we can easily create convincing bad recommendations: Developing methods to generate realistic, challenging bad recommendations is crucial for this approach's success. This research could explore:

using machine learning (particularly GANs) to create convincing errors
developing statistical techniques to show the errors are ot too obvious
analysing patterns in real AI mistakes, so we can replicate them
developing a toolkit for domain experts to craft plausible incorrect outputs
identify best practices for red-teams creating bad recommendations

Understand the risk of human complacency: Psychology research into the factors that lead to reviewer complacency is likely to help design effective countermeasures. Researchers could investigate how factors like the true negative rate, time pressure, repetition, and trust in AI systems affect reviewer complacency. Key outputs could include what the minimum percentage of rejected decisions should be, and what other features of the decision process are necessary to ensure good oversight.

Make it easy to adopt. The easier this proposal is for people to adopt, the more likely they are to voluntarily adopt it and the less hard they will fight regulations that attempt to enforce it. Work here could include:

writing easy to follow, evidence-based guides to implementing effective human oversight
building open-source tools that make implementing this easy, or at least provide a simple reference implementation to copy
implementing this at your company, and then giving talks that explain how you did it to others who can replicate it at their organisations

Advocate for its adoption: Concretely this might involve:

writing this up as a directly implementable standard
collaborating with the AI Standards Lab or standard-setting organisations to get this adopted into common AI standards
putting pressure on organisations using human oversight to implement this voluntarily

There are likely many other things you could do to help with this, and there are many other neglected proposals in AI governance. Go be a person who actually does things.

For example in low-risk cases, filtering for riskier decisions might be appropriate. This is only suitable for decisions where full human-in-the-loop is not necessary, given under this model some decisions will be acted upon without any human oversight.

Here, a model could predict the likelihood a human reviewer would reject the recommendation. If this is above a certain threshold, it is then reviewed by a human.

This is similar to how many fraud models in finance work: they approve transactions below some risk threshold, and require different levels of review for those above the threshold (where you might need to complete 2-factor authentication, or call your bank). ↩
This is particularly important in the national security space. I could imagine seeing these figures in an IPCO annual report. ↩
If someone else has come up with this idea, and given it a name let me know! I did try searching online, asking a few AI assistants, and asking a few friends in AI governance and none seemed to have heard of it. ↩

Preventing overreliance: The case for deliberate AI errors in human-in-the-loop systems

How can we fix this?

What can you do to help?

Footnotes