300,000 people are directly creating training data for AI

Adam Jones

300,000 people are directly creating training data for AI

This is my personal blog, where I write solely in my personal capacity. It does not represent the positions of any organisations I'm associated with.

Also, everything here is provded "as is", without warranties of any kind, express or implied.

21 April 202521 April 2025

As of April 2025, at least 300,000 people work directly on creating training data for AI systems. This is roughly comparable to a small island nation - between Barbados and Vanuatu.

The true number is likely higher, as it seems like many of these statistics might be a year out of date. My guess would be the true number is something like 400,000 to 500,000: between Iceland and Malta.

Hanging out at a Barbados beach bar sounds more fun than data annotation to me, but more people do the latter. Image by Unionville, under CC-0.

This is the total number of contractors, many of whom will be part-time. I have not attempted to evaluate the full-time equivalents of these.

Providers

Scale AI

Scale AI, which provides training data for OpenAI, Anthropic, Google Deepmind and Meta¹

OpenAI and Meta are listed openly on Scale AI’s website. Scale AI contractors discuss clients on Reddit and their project codenames, with multiple credible sources suggesting the following mappings:

Meta: Flamingo
ChatGPT: Ostrich
Google DeepMind: Bulba, Dolphin
Anthropic: Alpaca

, has hired people under the Remotasks and Outlier brands. From their websites they have hired:

Remotasks: “240,000+ total taskers”
Outlier: “40,000 experts”

These numbers do seem accurate, based on:

Conversations I’ve had with some people who contract for Scale. They said a few Slack groups²
The “Experts Project Support” and “Data Collectors Team” Slacks.
they were added to had hundreds of thousands of members (but they’ve since been moved to Discourse).
A June 2024 article by The Information said “About 300,000 [taskers] take assignments through a Slack group run by Outlier, a Scale subsidiary”. The article suggests this might also be from inside sources.
The unofficial outlier subreddit has 44k members. This suggests Outlier must have a lot more contractors than 40,000 given not all of them will have joined the subreddit.

All these estimates are several months old - and the websites have not been updated for a while (based on internet archive snapshots). I expect the true numbers to be higher with the increased investment going into AI.

Prolific

Prolific claims to have “200,000+ active taskers” for AI data annotation projects. However it’s unclear whether these are actually all working on AI projects as prolific does a range of different data work.

The unofficial Prolific subreddit has 43k members.

Surge AI

Surge AI, operating under the DataAnnotation brand, say they have “100k+ Members”. They hire contractors in the USA, Canada, Australia, New Zealand, UK, and Ireland, and work with OpenAI, Anthropic, and Google DeepMind.

The unofficial subreddit has 31k members, which supports this scale claim.

LabelBox

LabelBox, operating under the brand Alignner has an unofficial subreddit with 14k members. They hire contractors in the USA, Canada, Australia, and New Zealand.

I did not find a public claim from them about number of taskers, but we can estimate this using their subreddit numbers. We can use the claimed taskers to redditor ratios for Prolific (4.7 taskers/redditor) and Surge AI (3.2 taskers/redditor) and assume it’s about the same here for number of taskers - maybe 4 taskers/redditor. This results in 56k estimated taskers.

Aggregating the above

Summing the above gets us about 280k + 200k + 100k + 56k = 636k taskers.

However, it’s likely that many contractors sign up to multiple platforms. The most conservative estimate would therefore say that we should take the maximum rather than sum here - pointing towards Scale AI with 280,000 contractors.

In practice not every contractor will also have registered with Scale AI - and Scale’s public numbers are likely an underestimate for the reasons above. I think this gives us at least 300k as a lower bound. My best guess is 400k-500k.

OpenAI and Meta are listed openly on Scale AI’s website. Scale AI contractors discuss clients on Reddit and their project codenames, with multiple credible sources suggesting the following mappings:
- Meta: Flamingo
- ChatGPT: Ostrich
- Google DeepMind: Bulba, Dolphin
- Anthropic: Alpaca
↩
The “Experts Project Support” and “Data Collectors Team” Slacks. ↩

300,000 people are directly creating training data for AI

Providers

Scale AI

Prolific

Surge AI

LabelBox

Aggregating the above

Footnotes