Running LLMs Locally in 2025: Speed tests on M2 Pro + 16 GB RAM
Curious how fast modern LLMs run on Apple Silicon? I tested some of the latest local models as of early 2025 on an M2 Pro MacBook with 16GB RAM.
Raw data table
Model | Average tokens per second |
---|---|
deepseek-r1:1.5b | 83.1 |
llama3.2:1b | 82.0 |
gemma3:1b | 58.4 |
llama3.2:3b | 56.1 |
gemma3:4b | 43.3 |
deepseek-r1:8b | 28.9 |
gemma3:12b | 17.6 |
deepseek-r1:14b | 16.2 |
phi4:14b | 15.0 |
mistral-small:24b | 1.18 |
gemma3:27b | 0.06 |
Takeaways
If you need something fast (80tps+), use a 1B or 1.5B parameter model.
If you need something at human reading speed (5tps+), models up to 14B parameters will work.
Really, anything above this limit slows to a crawl and isn't that usable locally for much. They also use up all your RAM which freezes up your computer.
Test details
Each model <= 14B parameters was tested generating 1000 tokens, at least three times. The model was loaded into memory and used for 60 seconds before the test to be more representative of real use (e.g. so if they reached thermal throttling, this would likely happen before the first test run).
The larger models were tested similarly, but only generating 25 tokens, as they're much slower.
Models were run using ollama, at their default settings for the model tag as of 2025-03-31.