AI as a corporation (or, an intro to AI safety?)

Adam Jones

AI as a corporation (or, an intro to AI safety?)

This is my personal blog, where I write solely in my personal capacity. It does not represent the positions of any organisations I'm associated with.

Also, everything here is provded "as is", without warranties of any kind, express or implied.

14 July 202414 July 2024

People argue whether transformative AI will be like a corporation. There are good reasons why AI won’t be like a corporation. However, I think a corporation can at least act as a lower bound for how hard it might be to deal with. This can help people develop intuitions about transformative AI, and understand why others are worried about it.

Assuming no prior knowledge about AI, this article explains how we might get transformative AI, and how thinking about corporations can help us imagine what this could look like.

How do today’s AI systems work?

Machine learning is the technique behind most AI systems. This is where systems take in inputs and ideal outputs, and learn the patterns between them.¹

Understanding machine learning systems as fundamentally predictors based on past (input, output) pairs is sufficient for understanding the rest of the story here.

If you want to learn how neural networks find patterns in these pairs, see 3Blue1Brown’s excellent series on neural networks, particularly the first four videos.

Traditionally this has been for simple tasks like predicting house prices:

Bedrooms	Bathrooms	Square Metres	Postcode	House Price (£)
1	1	45	E14 7HS	`375,000`
2	1	65	SW11 6QU	`550,000`
2	2	75	N1 7GU	`725,000`
3	1	85	SE1 4YB	`850,000`
4	2	120	NW3 5QY	`1,750,000`

The system can then be used to predict more house prices, given some similar inputs. It's effectively leant the mapping from bedrooms, bathrooms, area, and postcode → house price:

Bedrooms	Bathrooms	Square Metres	Postcode	House Price (£)
2	1	55	SE1 7QP	`???`
1	1	40	E2 9RY	`???`
3	1	80	SW6 4JN	`???`
4	3	140	W11 2BQ	`???`
2	2	70	N5 1LP	`???`

Of course, this applies in other domains:

Fraud prevention: transaction, customer, and merchant details → likelihood of being fraudulent
Spam filtering: email subject, sender, and content → spam likelihood
Image recognition: the value of each pixel → image class

We train language models like ChatGPT in a similar way: but in this case we predict the next word²

This is accurate, although there are some transformation steps in the process:

Tokenization: Rather than input words directly, we usually actually turn the words into a fixed set of tokens. This might mean less common words are broken into multiple tokens like ‘bor-og-oves’. You can play with a tokenizer to see how sentences are broken up.
Vectorization: We then map this fixed set of tokens to vectors, as numbers are much easier for AI systems to process. These vectors are also known as word embeddings. For example, the token ‘bor’ might become a vector like [0.432, 0.199, 0.761, 0.0164]. Each token has a unique vector, and usually these vectors convey some semantic meaning. For example if you take V(token) to mean the vector for that token, then V(‘king’) + V(‘woman’) - V(‘man’) ≈ V(‘queen’).
The model then predicts the next token in the sequence by outputting a vector. The vectorisation and tokenization steps happen in reverse until we get back out new words.

of internet text:

Word 1	Word 2	Word 3	Word 4	Word 5
People	argue	whether	transformative	`AI`
There	are	good	reasons	`why`
AI	won't	be	like	`a`
However	I	think	a	`corporation`
can	at	least	act	`as`

We can then repeatedly predict the next word, and the AI effectively ‘writes’ convincing internet text. With some tuning, we can get it to answer more like a helpful assistant.

We’ve been doing this for a long time: this is the same task that the text autocomplete on your phone has. In the last few years, we discovered algorithms that allow us to do this more accurately and increase the context window (the maximum number of words as input). These include transformers and the attention method.

This has got us to accurate prediction machines that are superhuman at predicting internet text. However, imitating internet text results in real-world mistakes, so being superhuman at this isn’t the same as being ready to take people's jobs.

What is transformative AI?

Transformative AI would have a transformative effect on society. This is a little vague, but usually the kind of transformative effect people mean is huge: far greater than even computers or the internet. It’s often compared to the transformative effect of the industrial revolution. This makes it very hard to predict specific outcomes. Most visions of transformative AI include AI becoming almost every country’s top issue, mass unemployment across many sectors of the economy, and significantly increasing war-making capacity.

This isn’t science fiction: the people researching, building and regulating these systems seriously believe we might develop transformative AI. Most of them also believe it could go quite wrong.

OpenAI, the creators of ChatGPT, state in their charter that they ‘will attempt to directly build safe and beneficial AGI’, which they define as ‘highly autonomous systems that outperform humans at most economically valuable work’.

Anthropic are the creators of Claude, the main competitor to OpenAI’s ChatGPT. Their core views on AI safety are that they ‘believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well.’ They say most staff are ‘increasingly convinced that rapid AI progress will continue’, and that ‘rapid AI progress would be very disruptive, changing employment, macroeconomics, and power structures both within and between nations. These disruptions could be catastrophic in their own right, and they could also make it more difficult to build AI systems in careful, thoughtful ways, leading to further chaos and even more problems with AI.’

How might we get to transformative AI from current systems?

There are many ways we could develop transformative AI. I’ll cover a basic route here, although it’s hard to forecast precise technological developments - it seems most likely that transformative AI will be developed by combining this method with other bells, whistles and algorithmic efficiency improvements.

We could get transformative AI by simply scaling up existing systems with more high-quality data.³

And likely lots of compute to process that data. I’ve previously written an explainer of compute, compute governance and a body to track compute.

To do this we might:

Hire thousands of human experts. These are the top 90th-percentile accountants, biologists, chemists, doctors, economists, finance experts, historians, mathematicians, physicists, psychologists and technical writers.
Get them to do tasks using their skills. This wouldn’t be limited just text to text like most current models. It might include reviewing an image or video, or outputting actions like clicking buttons on a computer.
Train AI models to predict the next thing an expert would do or say, i.e. a model that learns: What the expert has just seen -> Expert action. This is similar to our language model predictors, but instead of predicting internet text, we'll predict expert behaviour.

This is already happening - Outlier⁴

Who are part of Scale AI, and provide training data for companies including OpenAI, Anthropic, Meta and Google.

has hired over 100,000 staff across these domains, many at 6-figure salaries⁵

Many of their postings offer $50/hour, which is $104k if you work 40 hour weeks. From friends who have worked for Outlier, it seems like most of them are paid at this rate (so the ‘up to’ claim is not some cop out).

, just to spend time providing or reviewing training data for AI systems. That means this single company employs more people to create AI training data than the populations of some UN-recognised countries.

The result might be that we soon get systems which are excellent at predicting what a 90th-percentile accountant, doctor, or software engineer would do at any one moment given some context. In reality, it’d be far better than the 90th-percentile expert: it probably actually looks like a 90th-percentile expert, with a reviewer over their shoulder, with unlimited and near-instant access to the internet and every relevant textbook.

If you don’t already see how this would be wildly transformative if we didn’t control it, that brings us to our corporation analogy…

AI as a corporation

In short, following the (fairly straightforward) steps above you could create a very powerful AI. You might imagine this as a corporation that is:

staffed with 90th-percentile experts in almost every domain imaginable
- who are all perfectly coordinated
- who work 24/7 without getting tired or sloppy
- who operate a hundred times faster than the fastest human
able to hire more experts in seconds, for pennies (by making more copies of itself)
coordinating with other AIs far faster than any human corporations could coordinate (because computer-to-computer communication is far faster than human-to-human communication)
practically inscrutable, as we don’t understand the inner workings of AI systems⁶
In some sense, we understand what the inner workings of AI systems do: they do lots of matrix multiplication and manipulation. But this is like saying we understand what humans do by saying they release neurotransmitters inside their brains.
There is some work to understand what these matrix multiplications mean, but it is challenging and still fairly primitive. For example, mechanistic interpretability attempts to break down a neural network’s activations into concepts. This work found that concepts are quite jumbled up in the network’s representation (formally, features are in superposition, and there are many polysemantic neurons). Recent papers have worked on disentangling different concepts using other neural networks, called sparse autoencoders (SAEs). However, experts would generally agree that we’re still a long way off understanding many of the concepts (features) neural networks seem to rely on, and the logic (circuits) that relate different concepts.

Of course, this could be excellent. Having cheap access to focused expertise could allow for faster drug discovery, designing better public infrastructure, or even creating a version of Microsoft Teams that isn’t terrible.

However, this could also go quite badly... I’ll explain how this could manifest.

More things go wrong

Corporations are made of people, who usually have some in-built moral sense to avoid wrongdoing. Even those without a moral compass usually fear punishment, such as being jailed. Corporations are also more likely to get caught: whether that be through whistleblowing or a slip-up that reveals the wrongdoing.

AI systems do not share the same properties. While AIs can learn moral rules that we teach them, they don’t do so in the same ways that humans do and are inconsistent in applying them. They can’t be ‘jailed’ or punished in the same sense as humans (although they could potentially be fined, retrained, or turned off: which might discourage them from obvious bad behaviour). They may be much smarter at avoiding getting caught, or being punished for it: you can imagine how difficult it is to prosecute something that has unlimited access to 90th-percentile lawyers, paired with every expert witness imaginable.⁷

Of course, this balance changes if regulators or prosecutors also have equal access to powerful AI systems - provided they can control them.

This makes these systems much more incentivised to behave deceptively than corporations (who already aren’t always beacons of ethical behaviour).

It’s harder to figure out when things go wrong

We often find out about corporate wrongdoing because:

whistleblowers report it to management or regulators
auditors reviewing a firm can properly understand it (although they’re fairly bad at this: it’s estimated that at best, only a third of fraud is detected, and 10% of large publicly traded firms commit securities fraud each year)
journalists can investigate a firm, and ask questions of its employees, customers and suppliers

However, none of these mechanisms are likely to protect us from transformative AI systems. There’s no real equivalent to ‘whistleblowers’ in an AI system.⁸

And the companies developing AI don’t seem to be fans of whistleblowing.

Other methods tend to rely on humans being able to understand what’s going on, but this is difficult given:

Humans aren’t able to understand the logic today's AI systems are using, and future systems are likely to be more complex.⁶
In some sense, we understand what the inner workings of AI systems do: they do lots of matrix multiplication and manipulation. But this is like saying we understand what humans do by saying they release neurotransmitters inside their brains.
There is some work to understand what these matrix multiplications mean, but it is challenging and still fairly primitive. For example, mechanistic interpretability attempts to break down a neural network’s activations into concepts. This work found that concepts are quite jumbled up in the network’s representation (formally, features are in superposition, and there are many polysemantic neurons). Recent papers have worked on disentangling different concepts using other neural networks, called sparse autoencoders (SAEs). However, experts would generally agree that we’re still a long way off understanding many of the concepts (features) neural networks seem to rely on, and the logic (circuits) that relate different concepts.
Auditors are bad at reviewing technical systems, even where we can inspect the logic completely. For example, the British Post Office system was reviewed by internal and external auditors and yet problems remained in the system (and wider corporation) leading to 900 false prosecutions, hundreds of people falsely imprisoned and four suicides.
Even if we ignore the internal logic and just review outputs, AI systems have already been seen to unintentionally create outputs that deceive humans or encode hidden signals for other AI systems. It’s also hard to understand some good outputs, like AlphaGo’s move 37 which expert Go players initially thought was strange or even a mistake but ended up winning the game.
Even if they’re not producing deceptive or manipulated outputs, AI systems operate much more quickly than human corporations. This means the volume of output to audit would be much higher, making the task much harder.

When things go wrong, they can go wrong much more severely

Corporations are made of individual people. Hiring people is costly and slow, and people usually only have deep expertise in at most 1 or 2 domains. There are significant information bottlenecks which lead to large corporations becoming less efficient. Despite this, these corporations can often be hard to regulate or control.

A transformative AI would be more skilled⁹

Given our assumptions of the simplest version of transformative AI: where an AI system is as smart as a collective group of expert humans because we’ve trained it to imitate those humans. This has superhuman intelligence in terms of breadth: no one human could be that good at so many things.

Other training methods (particularly self-play) could achieve intelligence far beyond even the smartest, most coordinated group of humans - enabling superhuman breadth and depth of intelligence. This is how we’ve achieved superhuman performance in fields like the board game Go.

, coordinated and fast-moving than any human corporation. It would have the knowledge and skills across almost every domain humans do, and instantly look up information from the internet and books. This greater capability means a transformative AI system could cause far more harm than a human corporation.

Additionally, a transformative AI system could process inputs much faster, keep much more information in its working memory, and could produce outputs millions of times faster than humans. This speed (plus given it’s harder to figure out when things go wrong) might mean it can do more damage than a human corporation could before being stopped.

It’s harder to learn from failures

When corporations engage in wrongdoing, we often learn about what happened through internal communications, employee testimonies, or leaked documents. This allows us to piece together what went wrong and why - and prevent it in future.

AI systems don't have water cooler conversations or send incriminating emails. Their decision-making processes are often opaque, hidden within complex neural networks that even their creators struggle to interpret.⁶

In some sense, we understand what the inner workings of AI systems do: they do lots of matrix multiplication and manipulation. But this is like saying we understand what humans do by saying they release neurotransmitters inside their brains.

There is some work to understand what these matrix multiplications mean, but it is challenging and still fairly primitive. For example, mechanistic interpretability attempts to break down a neural network’s activations into concepts. This work found that concepts are quite jumbled up in the network’s representation (formally, features are in superposition, and there are many polysemantic neurons). Recent papers have worked on disentangling different concepts using other neural networks, called sparse autoencoders (SAEs). However, experts would generally agree that we’re still a long way off understanding many of the concepts (features) neural networks seem to rely on, and the logic (circuits) that relate different concepts.

This lack of transparency makes it incredibly difficult to understand the root causes of failures or to implement effective corrective measures. Even if we were able to identify root causes, we might not learn from them. The UK and US don’t yet have any institutions that are building useful institutional learning for regulating AI systems.¹⁰

Specifically, they have no regulatory institutions like they have for regulating financial markets such as the FCA or the SEC.

They do have departments for which AI harms could fall under their remit, even if it’s not the focus. For example, DSIT in the UK - although this is a policymaking department, not a regulator. Both the UK and US AI safety institutes are not regulators: they are research institutes that aim to understand (and not regulate) AI.

In the EU, the EU AI Office offers some hope.

And again, speed is a key factor. By the time we detect a failure, the consequences may have already cascaded far beyond our ability to track them. As the world becomes much more complex we might not even know what went wrong, let alone how to prevent it in the future. Paul Christiano paints a possible future here in ‘What failure looks like’:

For a while we will be able to overcome these problems by recognizing them, improving the proxies, and imposing ad-hoc restrictions that avoid manipulation or abuse. But as the system becomes more complex, that job itself becomes too challenging for human reasoning to solve directly [...]

As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails.

Amongst the broader population, many folk already have a vague picture of the overall trajectory of the world and a vague sense that something has gone wrong. There may be significant populist pushes for reform, but in general these won’t be well-directed.

[...] Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory.

‘More’ misaligned systems could be much worse

The above has assumed somewhat ‘limited’ misalignment between the AI’s goals and the interests of society. Perhaps there’s a bunch of corporate fraud, which regulators and investors can’t figure out - it’s very bad, but it (might) not be the end of the world.

However, AI systems could also pursue much more obviously misaligned goals, or goals in a much more dangerous way. For example, an AI aiming to increase a stock price might behave well at first while it gathers resources. Later, once it thinks it has enough resources to overthrow humans, it could suddenly switch to trying to eliminate humans so it has complete control over the stock price. This switch in behaviour is known as a treacherous turn, and has been demonstrated in toy environments.

Companies don’t currently have safety techniques that they expect will solve this as we continue to scale up AI systems. This means even the most well-intentioned AI companies may not be able to control the systems they create.

Misuse offers another route to disaster

Finally, other AI users may not have such good intentions. For example, should North Korea gain access to this technology this effectively means they have unlimited access to expert nuclear physicists, engineers and military strategists. This could massively boost their efforts to create or threaten others with nuclear weapons.

It’s fairly unlikely North Korea could train the best models in the world (at least today). This would require far more data³

And likely lots of compute to process that data. I’ve previously written an explainer of compute, compute governance and a body to track compute.

than North Korea is likely to have.

However, it seems very plausible that they could steal the best models given that North Korea (and China, their closest ally) have strong offensive cyber programs. OpenAI has already been subject to multiple security breaches, including one in which an attacker gained access to some of the firm’s internal systems. Meta’s original Llama model was inadvertently leaked on 4chan. Ex-OpenAI staff have warned ‘in the next 12-24 months, we will leak key AGI breakthroughs to the CCP. It will be the national security establishment’s single greatest regret before the decade is out.’

Doing something about this

To summarise, transformative AI systems could be built soon, and act like some of the world’s worst corporations - only far faster, with much greater skill, and with far more resources. This could result in more wrongdoing that is harder to detect, has more serious impacts, and is harder to learn from. This is before touching on the potential for greater misalignment or the intentional misuse of AI systems. And we don’t currently have great tools to make these systems safer.

The one silver lining is that there are a lot of ways to help. If you’re interested in helping, do consider applying for the AI Safety Fundamentals courses I help run. We accept people from a wide range of backgrounds (including those who haven’t studied computer science or similar), given the wide-ranging nature of the problem!

If that’s too much commitment, do consider learning more about the problem even if informally. AISafety.com lists a number of other resources, events and communities. Robert Miles’ YouTube channel is another good place to start.

Understanding machine learning systems as fundamentally predictors based on past (input, output) pairs is sufficient for understanding the rest of the story here.

If you want to learn how neural networks find patterns in these pairs, see 3Blue1Brown’s excellent series on neural networks, particularly the first four videos. ↩
This is accurate, although there are some transformation steps in the process:
- Tokenization: Rather than input words directly, we usually actually turn the words into a fixed set of tokens. This might mean less common words are broken into multiple tokens like ‘bor-og-oves’. You can play with a tokenizer to see how sentences are broken up.
- Vectorization: We then map this fixed set of tokens to vectors, as numbers are much easier for AI systems to process. These vectors are also known as word embeddings. For example, the token ‘bor’ might become a vector like [0.432, 0.199, 0.761, 0.0164]. Each token has a unique vector, and usually these vectors convey some semantic meaning. For example if you take V(token) to mean the vector for that token, then V(‘king’) + V(‘woman’) - V(‘man’) ≈ V(‘queen’).
- The model then predicts the next token in the sequence by outputting a vector. The vectorisation and tokenization steps happen in reverse until we get back out new words.
↩
And likely lots of compute to process that data. I’ve previously written an explainer of compute, compute governance and a body to track compute. ↩ ↩²
Who are part of Scale AI, and provide training data for companies including OpenAI, Anthropic, Meta and Google. ↩
Many of their postings offer $50/hour, which is $104k if you work 40 hour weeks. From friends who have worked for Outlier, it seems like most of them are paid at this rate (so the ‘up to’ claim is not some cop out). ↩
In some sense, we understand what the inner workings of AI systems do: they do lots of matrix multiplication and manipulation. But this is like saying we understand what humans do by saying they release neurotransmitters inside their brains.

There is some work to understand what these matrix multiplications mean, but it is challenging and still fairly primitive. For example, mechanistic interpretability attempts to break down a neural network’s activations into concepts. This work found that concepts are quite jumbled up in the network’s representation (formally, features are in superposition, and there are many polysemantic neurons). Recent papers have worked on disentangling different concepts using other neural networks, called sparse autoencoders (SAEs). However, experts would generally agree that we’re still a long way off understanding many of the concepts (features) neural networks seem to rely on, and the logic (circuits) that relate different concepts. ↩ ↩² ↩³
Of course, this balance changes if regulators or prosecutors also have equal access to powerful AI systems - provided they can control them. ↩
And the companies developing AI don’t seem to be fans of whistleblowing. ↩
Given our assumptions of the simplest version of transformative AI: where an AI system is as smart as a collective group of expert humans because we’ve trained it to imitate those humans. This has superhuman intelligence in terms of breadth: no one human could be that good at so many things.

Other training methods (particularly self-play) could achieve intelligence far beyond even the smartest, most coordinated group of humans - enabling superhuman breadth and depth of intelligence. This is how we’ve achieved superhuman performance in fields like the board game Go. ↩
Specifically, they have no regulatory institutions like they have for regulating financial markets such as the FCA or the SEC.

They do have departments for which AI harms could fall under their remit, even if it’s not the focus. For example, DSIT in the UK - although this is a policymaking department, not a regulator. Both the UK and US AI safety institutes are not regulators: they are research institutes that aim to understand (and not regulate) AI. ↩

AI as a corporation (or, an intro to AI safety?)

How do today’s AI systems work?

What is transformative AI?

How might we get to transformative AI from current systems?

AI as a corporation

More things go wrong

It’s harder to figure out when things go wrong

When things go wrong, they can go wrong much more severely

It’s harder to learn from failures

‘More’ misaligned systems could be much worse

Misuse offers another route to disaster

Doing something about this

Footnotes