Running LLaMA on Apple Silicon (M2 benchmarks)

Tested across real-world dev workflows on Apple Silicon (M2 Pro/Max/Ultra) using llama.cpp, Ollama, MLX/MLX‑LM, and MLC‑LLM. I’ll share practical setup notes, my debugging detours, and honest benchmarks you can reproduce. This is written for developers who want fast, private, local inference on their Macs without babysitting CUDA.

Why run LLMs locally on a Mac?

Privacy by default. Your context, code, and documents never leave the machine.
Snappy latencies. No cold starts, no mystery rate limits.
Predictable cost. One-time hardware; zero token bills.
Unified memory is a cheat code. Apple Silicon’s high bandwidth UMA lets you fit bigger models and KV caches than you’d expect on a laptop.

If you’re building agents, RAG, or code tooling, a fast local 8B is often “good enough” for interactive use. 70B on a desktop M2 Ultra is viable if you keep context sane and quantize sensibly.

TL;DR

For M2 Pro/Max laptops: run Llama 3.x 8B in Q4_K_M with llama.cpp or MLX. Expect ~25–60 tok/s generation depending on model, quant, and settings.
For M2 Ultra (96–192 GB) desktops: 70B Q4_K_M is workable for single-user interactive workloads at ~12–20 tok/s, prompt eval much higher. Keep context ≤ 16–32k unless you enjoy watching fans.
Ollama is the friendliest runtime; llama.cpp gives the most control; MLX‑LM is surprisingly competitive on Apple GPUs; MLC‑LLM is solid if you want a compiler toolchain and mobile targets.

I include reproducible commands and a simple harness you can tweak for your own Mac.

What we’ll cover

Model choices and quantisation that make sense on M2
Three ways to run locally (llama.cpp, Ollama, MLX/MLX‑LM) + MLC‑LLM
A reproducible benchmark harness
Results on M2 Pro, M2 Max, and M2 Ultra
Tuning for throughput vs latency
Self‑hosting tips: APIs, process managers, reverse proxies
Troubleshooting compilation, Metal, and memory headroom

Models and quantisation that actually work on M2

Good defaults

Llama 3/3.1 8B Instruct in Q4_K_M (≈ 4.8 bpw) for general use
Q5_K_M if you want a bit more quality and have RAM to spare
Q6_K or Q8_0 only if you need quality and can eat the speed hit

When to go 70B on Mac

You have M2 Ultra with 96–192 GB unified memory
You’re OK with 12–20 tok/s generation and careful context limits

KV cache

Start with default 16‑bit KV; if memory is tight, try 8‑bit KV cache variants in runtimes that support it.

Why this matters: Apple GPUs are memory‑bandwidth bound at small batch sizes. K‑quants keep accuracy reasonable without blowing RAM.

Option A: llama.cpp (maximum control)

Build with Metal on macOS

# Prereqs
xcode-select --install || true
brew install cmake ninja
 
# Build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -S . -B build -DGGML_METAL=ON -DBUILD_SHARED_LIBS=ON
cmake --build build -j
 
# Binaries appear in ./build/bin
./build/bin/llama-cli -h

If you prefer the classic Makefile: LLAMA_METAL=1 make -j from repo root. I use CMake for shared libs and consistent flags across machines.

Run an 8B Q4_K_M model

# Example: Llama 3.1 8B Q4_K_M downloaded as ./models/llama-3.1-8b-q4_k_m.gguf
./build/bin/llama-cli \
  -m ./models/llama-3.1-8b-q4_k_m.gguf \
  -p "Explain the Raft consensus algorithm like I'm a junior backend dev" \
  -c 8192 -n 256 -t 8 -ngl 999 -b 512 --verbose-prompt

Flags that matter

-ngl 999 offloads as many layers as possible to the GPU
-b 512 batch size; increase until you hit diminishing returns or OOM
-t 8 threads; on laptops I stick to P‑cores, but experiment
-c context; larger contexts cost RAM and often reduce t/s on laptops

Built‑in benchmark

# Warm-up once, then run structured bench
./build/bin/llama-bench \
  -m ./models/llama-3.1-8b-q4_k_m.gguf \
  -p 4096,8192 \
  -fa 1 -t 8 -b 512 -ub 1024 -n 256

This prints prompt eval and decode throughput separately. Capture results to JSON/Markdown with -o and script your runs.

Option B: Ollama (nicest DX, solid defaults)

# Install
brew install ollama
# Or direct: curl -fsSL https://ollama.com/install.sh | sh
 
# Pull and run Llama 3.1 8B
ollama pull llama3.1:8b
ollama run llama3.1:8b --verbose "Summarise this codebase migration plan in bullet points"

Use --verbose to get eval stats at the end. Create a Modelfile to pin quant, templates, and context limits for repeatable runs.

Option C: MLX / MLX‑LM (Apple’s stack)

MLX is Apple’s array framework with a lightweight LLM layer.

python -m venv .venv && source .venv/bin/activate
pip install mlx-lm
 
# Run an instruct model
python -m mlx_lm.generate \
  --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
  --prompt "Give me a practical example of rate limiting in Express.js"

MLX often matches llama.cpp on Apple GPUs, especially on recent releases. It’s a good baseline if you prefer Python and simple scripts.

Option D: MLC‑LLM (compiler toolchain)

If you want to target iOS, Android, or WebGPU later, try MLC‑LLM. It compiles model libraries for the platform and can JIT on Mac.

python -m venv .venv && source .venv/bin/activate
pip install mlc-ai-nightly mlc-llm-nightly
# Quick run (JIT)
python -m mlc_llm.gen --model Llama-3-8B-Instruct-q4f16_1-MLC --prompt "Design a RAG eval harness"

Benchmark plan you can actually reproduce

We care about both prompt eval (ingestion) and decode (generation). Settings below are conservative and designed to run on stock machines without tweaks.

Common settings

Model: Llama 3.1 8B Instruct Q4_K_M and Llama 3/3.1 70B Q4_K_M where RAM allows
Context: 8k for 8B, 16k for 70B
Threads: 8 on laptops, 12–16 on desktops
GPU offload: -ngl 999
Batch: 512; push to 1024 on desktops if stable
Two prompts: 4k token synthetic doc for prompt eval; 256 token decode for throughput

Harness

#!/usr/bin/env bash
set -euo pipefail
MODEL="$1"; NAME="$2"
FLAGS="-t 8 -ngl 999 -b 512 -fa 1 -n 256"
 
./build/bin/llama-bench -m "$MODEL" -p 4096,8192 -ub 1024 $FLAGS -o json > "bench_${NAME}.json"

Run it three times and take the median.

Results: expected ranges on Apple Silicon (M2 generation)

These are ranges, not lab‑perfect numbers. They reflect what I’ve seen and what the community reports with similar settings. Your mileage will vary with background load, thermal headroom, batch size, and quant.

Llama 3.x 8B (Q4_K_M)

Chip	Machine memory	Runtime	Prompt eval (tok/s)	Decode (tok/s)
M2 Pro (16 GB)	16 GB UMA	llama.cpp	500–900	22–35
		MLX‑LM	550–950	24–38
M2 Max (64–96 GB)	64–96 GB UMA	llama.cpp	700–1200	35–60
		MLX‑LM	750–1250	35–60
M2 Ultra (96–192 GB)	96–192 GB UMA	llama.cpp	900–1400	40–70
		MLX‑LM	950–1500	40–70

Llama 3/3.1 70B (Q4_K_M) on M2 Ultra

Chip	Runtime	Prompt eval (tok/s)	Decode (tok/s)
M2 Ultra (192 GB)	llama.cpp	200–450	12–18
	MLX‑LM	220–500	12–20

Notes

Ranges assume plugged‑in power, cool ambient temps, minimal background tasks.
If you see decode under the lower bound, reduce context, lower batch or drop to Q3_K.
M2 Pro with 8 GB RAM will force aggressive paging and kill throughput.

Tuning checklist

Batch size: Lift -b until speed stops improving or you hit OOM.
Context: Keep -c as low as your app can tolerate. Long context tanks t/s.
Threads: On laptops, don’t max all cores during long runs. Stick to P‑cores for stability.
GPU offload: Always use -ngl 999 on Apple Silicon.
Quant: Start Q4_K_M. Try Q5_K_M if you have RAM and need fidelity. For 70B on 96 GB, use more aggressive quants and smaller context.
KV cache: If supported, try 8‑bit KV to free memory for layers.

Serving locally (APIs and tooling)

llama.cpp server (HTTP)

./build/bin/llama-server \
  -m ./models/llama-3.1-8b-q4_k_m.gguf \
  --host 127.0.0.1 --port 8080 -c 8192 -ngl 999 -b 512 -t 8

llama-cpp-python (OpenAI‑compatible)

pip install "llama-cpp-python[server]"
python -m llama_cpp.server --model ./models/llama-3.1-8b-q4_k_m.gguf --n_ctx 8192 --n_gpu_layers -1

Ollama (REST + `docker‑ish` model mgmt)

ollama serve &
ollama run llama3.1:8b --verbose "Generate TypeScript types from this JSON schema"

Reverse proxy and auth

Front with Caddy or Traefik, add basic auth or mTLS, and restrict to your LAN or Tailscale network. Don’t expose dev boxes to the internet.

Troubleshooting (from my own bruises)

Metal not engaging: rebuild with -DGGML_METAL=ON or LLAMA_METAL=1. Check logs for GPU offload lines.
Weirdly low t/s: close the IDE, stop Spotlight indexing, drop -c, lower -b a notch, verify you’re not on battery.
OOM at start: reduce -b first, then -c, then try a lighter quant.
“IQ” quants crawling: some IQ formats can be slower on Macs; switch to K‑quants for throughput.
Long context regression: use NTK/RoPE‑aware models where possible. Don’t force 128k unless you have a real reason.
llama-cpp-python slower than llama-cli: rebuild wheels locally with Metal enabled; avoid Rosetta.

Privacy and local‑first notes

Running llama.cpp/MLX/Ollama locally means your prompts and docs can stay on device.
Check each app’s privacy settings. Some GUIs keep local chat logs by default. If you care, disable history or store on encrypted volumes.
If you serve an API, treat it like any internal service: auth, logs, and network boundaries.

Where this leaves us

An M2 Pro/Max laptop does great work with 8B models today. It’s a sweet spot for private coding assistants, RAG, and agents.
M2 Ultra opens the door to 70B locally, with careful tuning.
Apple’s stack has caught up: MLX‑LM is competitive with llama.cpp for many setups.

If you need my exact scripts, grab the harness above and swap in your model paths. Then tune batch, context, and quant until the numbers stop getting better.

Quick memory sizing

8B Q4_K_M: ≈ 4.8–5.2 GB + KV cache + overhead → aim for ≥ 16 GB RAM
8B Q5_K_M: ≈ 5.7–6.2 GB → ≥ 16–24 GB comfortable
70B Q4_K_M: ≈ 35–40 GB model + large KV cache → ≥ 96 GB UMA to be happy

These are ballparks. KV cache scales with sequence length, batch size, and precision.