A rough plan for AI alignment assuming short timelines

Adam Jones

A rough plan for AI alignment assuming short timelines

This is my personal blog, where I write solely in my personal capacity. It does not represent the positions of any organisations I'm associated with.

Also, everything here is provded "as is", without warranties of any kind, express or implied.

31 March 202531 March 2025

This is my rough plan for how I would plan to intent align powerful AI systems if we're getting them in the next 2-5 years. This only covers intent alignment, and not other problems in AI safety.

In short the plan is:

Align human-level AI, focusing on the model's covert behaviours
1. Train models to be incapable of doing bad things: Machine unlearning, data input controls and disinformation injection to nerf:
  - dangerous capabilities e.g. 'how to build a weapon'
  - scheming and sabotage capabilities
2. Train models to not "want" to do bad things: Reduce the propensity to do harmful things, with:
  - RLHF (and friends, e.g. constitutional AI)
  - RL penalities for bad or non-transparent behaviour
  - Some direct fine-tuning techniques
  - Some scalable oversight techniques
3. Stop models from doing bad things when they try: Faithful human-interpretable chain-of-thought, with effective monitoring using:
  - AI-control-like techniques
  - scalable oversight techniques
- Measure how we're doing at the above: using rigorous evaluations and other measurement techniques
Use the aligned human-level AI to align artificial superintelligence

The threat model

Focusing on human-level AI

By human-level AI, I mean something that is as good as a smart human for anything that can be done from a computer. These models are likely to be superhuman on some axes (as they already are, e.g. at typing speed, speed of ingesting new info, and accurately answering PhD-level questions¹

GPQA is a benchmark of questions where 'experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy'. Claude 3.7 sonnet with thinking achieved 78.2%.

). However, I'm imagining this to be the exception not the norm - they're not absurdly better on a wide range of domains.

Given this scenario is for a short timelines world, I'm imagining they're using somewhat similar architectures to today's models. For example, computer-using large language models, trained on enough data (and probably with RL) to reach this level.

I'm focusing on aligning human-level AI systems, because if we can achieve this, we can use our aligned human-level AI to figure out how to align superhuman AI systems. This relaxes the problem significantly.

Assuming human-level AI will be able to align superhuman AI systems is a reasonable assumption. If the models were not capable of this, but humans were, then this system isn't really human-level. And if humans are not capable of this, we are screwed either way. (Unless we achieve a permanent universal moratorium on superintelligence development).

Cases where this breaks down:

We don't use human-level AI to figure out superalignment, before building superintelligent models
- because of race dynamics, e.g. military arms race between great powers results in them dangerously racing past human-level AI to superintelligence (this seems like mostly a governance problem - although if we think the governance problem won't be solved, means that we might need to have a technical solution in place)
- by accident, e.g. continuing to train a model far beyond human-level AI because of bad monitoring (this seems fairly unlikely)
The first human-level AIs end up being superhuman at scheming or similar behaviours. This seems unlikely, but plausible depending on how RL training goes. (Although pillar 1 work would reduce the chance of this being the case.)
Emergent behaviors arise when you have many of these AI systems that lead to very dangerous outcomes despite the mitigations below. (Although I think this is also mitigated by the below plan, it just makes me less confident in it.)

Focusing on covert misbehaviour

We're primarily worried about two failure modes:

Models doing the wrong thing overtly, e.g. the model takes dumb actions, or is obvious about trying to cause harm
Models doing the wrong thing covertly, e.g. scheming, sabotage, alignment faking, or other behaviour that is more subtle

I think overt behaviour is much easier to catch with standard output monitoring techniques, especially given the model is human-level. (E.g. imagine you have a smart person in a room, and they can only interact with the outside world by giving notes to you, which you can decide to pass on to the outside world. If they are just writing their plans overtly, it seems likely you could prevent them doing too much damage. In the AI case, it might be creating too much output for humans to review - but it seems likely at this point we'd have reliable classifiers that can review 99% of the content generated, and perhaps send the most suspicious 1% to human reviewers - or just lean towards blocking it).

Separately, if we can nail covert misbehaviour, overt misbehaviour seems like a strictly easier problem.

Therefore, I think most alignment work should focus on the covert scenarios.

Preventing covert misbehaviour from human-level AI

There are three pillars, acting in sort of a defence-in-depth approach here:

Train models to be incapable of doing bad things
Train models to not "want" to do bad things
Stop models from doing bad things when they try

And then to support the above three, we need to measure how well we're doing.

Pillar 1: Train models to be incapable of doing bad things: Machine unlearning and data input controls to remove capabilities

The below interventions will make it harder for models to use their capabilities in harmful ways. But what if we nerfed those capabilities entirely? This would make a language model less likely to be successful at causing harm.

Machine unlearning, training data input controls and disinformation injection could be used to remove or nerf:

dangerous capabilities e.g. 'how to build a weapon'
scheming and sabotage capabilities

Pillar 2: Train models to not "want" to do bad things: Fine-tuning to reduce propensity to do harmful things

Fine-tuning can be pretty effective at changing the behaviour of models - see the difference between a base and instruction tuned model!

This could be used to reduce the propensity of models to do harmful things. I expect most of this to happen by default, but this pillar includes:

RLHF (and friends, e.g. constitutional AI)
Heavy RL penalities when model found to be doing suspicious things
Moderate RL penalities when model found to be not maximally transparent even for non-suspicious things
Specific fine-tuning interventions, such as to reduce sycophancy
Some scalable oversight techniques, such as recursive reward modelling to ensure we're rewarding the right behaviour

Pillar 3: Stop models from doing bad things when they try: Faithful human-interpretable chain-of-thought, with effective monitoring

If models produce (mostly) faithful chain-of-thought, monitoring this would likely make it very hard for them to scheme.

It seems likely that we can achieve faithful human-interpretable chain-of-thought through the architecture decisions we make, e.g. designing the reward function to encourage this. This might be threatened if it turns out a non-faithful or non-human-interpretable thought results in better performance.

Rough justification: if any human had a bunch of other humans scrutinizing their thoughts all the time, it would be very hard to do anything malicious covertly. (This might not work for superhuman systems, but remember we're just focused on human-level scheming.)

In practice, we might want to implement this monitoring through:

AI-control-like techniques
scalable oversight techniques
- Especially forms of debate about whether the AI thoughts are malicious, or something like 'at-runtime' constitutional AI
- Where we're using AI models to supervise the AI, we'd ideally want them trained in fairly diverse ways with different "values" or characteristics to reduce the chance that the AI models successfully scheme. (Intuition: it might be easier to convince a copy of yourself acting as overseer to scheme with you. But it's much harder to convince 100 people with different values and intelligence levels etc. to all scheme with you. And our oversight system might take a warning from any of them as cause for further investigation, similar to whistleblower reports.)

Other research to support this might also include finding ways to:

validate whether chain-of-thought is faithful
train models so their chain-of-thought is faithful and human-interpretable
reduce the cost of faithful and human-interpretable chain-of-thought, relative to other approaches

Models today seem to have very roughly faithful chain of thought (i.e. when they say they're doing math, they probably are, and aren't doing some complex scheming). Although it doesn't appear to be faithful in the details (i.e. how they actually are doing the math).

Support: Measure how we're doing: Rigorous evaluations of models

The above will require decent evaluations and other measurement techniques to ensure they are working properly.

A brief note on prioritisation/neglectness

My guess is that the following areas in the above plan are most neglected:

Pillar 1, for scheming and sabotage capabilities
Pillar 2, for the direct fine-tuning and scalable oversight techniques
Pillar 3
Evaluations, that are good at measuring scheming capability or propensity

(Also the assumptions made in the section about focusing on human-level AI have not been addressed - there may be important neglected work in those worlds.)

GPQA is a benchmark of questions where 'experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy'. Claude 3.7 sonnet with thinking achieved 78.2%. ↩

A rough plan for AI alignment assuming short timelines

The threat model

Focusing on human-level AI

Focusing on covert misbehaviour

Preventing covert misbehaviour from human-level AI

Pillar 1: Train models to be incapable of doing bad things: Machine unlearning and data input controls to remove capabilities

Pillar 2: Train models to not "want" to do bad things: Fine-tuning to reduce propensity to do harmful things

Pillar 3: Stop models from doing bad things when they try: Faithful human-interpretable chain-of-thought, with effective monitoring

Support: Measure how we're doing: Rigorous evaluations of models

A brief note on prioritisation/neglectness

Footnotes