Running LLMs Locally in 2025: Speed tests on M2 Pro + 16 GB RAM

Adam Jones

Running LLMs Locally in 2025: Speed tests on M2 Pro + 16 GB RAM

This is my personal blog, where I write solely in my personal capacity. It does not represent the positions of any organisations I'm associated with.

Also, everything here is provded "as is", without warranties of any kind, express or implied.

2 April 20252 April 2025

Curious how fast modern LLMs run on Apple Silicon? I tested some of the latest local models as of early 2025 on an M2 Pro MacBook with 16GB RAM.

Raw data table

Model	Average tokens per second
deepseek-r1:1.5b	83.1
llama3.2:1b	82.0
gemma3:1b	58.4
llama3.2:3b	56.1
gemma3:4b	43.3
deepseek-r1:8b	28.9
gemma3:12b	17.6
deepseek-r1:14b	16.2
phi4:14b	15.0
mistral-small:24b	1.18
gemma3:27b	0.06

Takeaways

If you need something fast (80tps+), use a 1B or 1.5B parameter model.

If you need something at human reading speed (5tps+), models up to 14B parameters will work.

Really, anything above this limit slows to a crawl and isn't that usable locally for much. They also use up all your RAM which freezes up your computer.

Test details

Each model <= 14B parameters was tested generating 1000 tokens, at least three times. The model was loaded into memory and used for 60 seconds before the test to be more representative of real use (e.g. so if they reached thermal throttling, this would likely happen before the first test run).

The larger models were tested similarly, but only generating 25 tokens, as they're much slower.

Models were run using ollama, at their default settings for the model tag as of 2025-03-31.