Tested across real-world dev workflows on Apple Silicon (M2 Pro/Max/Ultra) using llama.cpp, Ollama, MLX/MLX‑LM, and MLC‑LLM. I’ll share practical setup notes, my debugging detours, and honest benchmarks you can reproduce. This is written for developers who want fast, private, local inference on their Macs without babysitting CUDA.
Why run LLMs locally on a Mac?
- Privacy by default. Your context, code, and documents never leave the machine.
- Snappy latencies. No cold starts, no mystery rate limits.
- Predictable cost. One-time hardware; zero token bills.
- Unified memory is a cheat code. Apple Silicon’s high bandwidth UMA lets you fit bigger models and KV caches than you’d expect on a laptop.
If you’re building agents, RAG, or code tooling, a fast local 8B is often “good enough” for interactive use. 70B on a desktop M2 Ultra is viable if you keep context sane and quantize sensibly.
TL;DR
- For M2 Pro/Max laptops: run Llama 3.x 8B in Q4_K_M with llama.cpp or MLX. Expect ~25–60 tok/s generation depending on model, quant, and settings.
- For M2 Ultra (96–192 GB) desktops: 70B Q4_K_M is workable for single-user interactive workloads at ~12–20 tok/s, prompt eval much higher. Keep context ≤ 16–32k unless you enjoy watching fans.
- Ollama is the friendliest runtime; llama.cpp gives the most control; MLX‑LM is surprisingly competitive on Apple GPUs; MLC‑LLM is solid if you want a compiler toolchain and mobile targets.
I include reproducible commands and a simple harness you can tweak for your own Mac.
What we’ll cover
- Model choices and quantisation that make sense on M2
- Three ways to run locally (llama.cpp, Ollama, MLX/MLX‑LM) + MLC‑LLM
- A reproducible benchmark harness
- Results on M2 Pro, M2 Max, and M2 Ultra
- Tuning for throughput vs latency
- Self‑hosting tips: APIs, process managers, reverse proxies
- Troubleshooting compilation, Metal, and memory headroom
Models and quantisation that actually work on M2
Good defaults
- Llama 3/3.1 8B Instruct in Q4_K_M (≈ 4.8 bpw) for general use
- Q5_K_M if you want a bit more quality and have RAM to spare
- Q6_K or Q8_0 only if you need quality and can eat the speed hit
When to go 70B on Mac
- You have M2 Ultra with 96–192 GB unified memory
- You’re OK with 12–20 tok/s generation and careful context limits
KV cache
- Start with default 16‑bit KV; if memory is tight, try 8‑bit KV cache variants in runtimes that support it.
Why this matters: Apple GPUs are memory‑bandwidth bound at small batch sizes. K‑quants keep accuracy reasonable without blowing RAM.
Option A: llama.cpp (maximum control)
Build with Metal on macOS
# Prereqs
xcode-select --install || true
brew install cmake ninja
# Build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -S . -B build -DGGML_METAL=ON -DBUILD_SHARED_LIBS=ON
cmake --build build -j
# Binaries appear in ./build/bin
./build/bin/llama-cli -h
If you prefer the classic Makefile:
LLAMA_METAL=1 make -j
from repo root. I use CMake for shared libs and consistent flags across machines.
Run an 8B Q4_K_M model
# Example: Llama 3.1 8B Q4_K_M downloaded as ./models/llama-3.1-8b-q4_k_m.gguf
./build/bin/llama-cli \
-m ./models/llama-3.1-8b-q4_k_m.gguf \
-p "Explain the Raft consensus algorithm like I'm a junior backend dev" \
-c 8192 -n 256 -t 8 -ngl 999 -b 512 --verbose-prompt
Flags that matter
-ngl 999
offloads as many layers as possible to the GPU-b 512
batch size; increase until you hit diminishing returns or OOM-t 8
threads; on laptops I stick to P‑cores, but experiment-c
context; larger contexts cost RAM and often reduce t/s on laptops
Built‑in benchmark
# Warm-up once, then run structured bench
./build/bin/llama-bench \
-m ./models/llama-3.1-8b-q4_k_m.gguf \
-p 4096,8192 \
-fa 1 -t 8 -b 512 -ub 1024 -n 256
This prints prompt eval and decode throughput separately. Capture results to JSON/Markdown with -o
and script your runs.
Option B: Ollama (nicest DX, solid defaults)
# Install
brew install ollama
# Or direct: curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3.1 8B
ollama pull llama3.1:8b
ollama run llama3.1:8b --verbose "Summarise this codebase migration plan in bullet points"
Use --verbose
to get eval stats at the end. Create a Modelfile
to pin quant, templates, and context limits for repeatable runs.
Option C: MLX / MLX‑LM (Apple’s stack)
MLX is Apple’s array framework with a lightweight LLM layer.
python -m venv .venv && source .venv/bin/activate
pip install mlx-lm
# Run an instruct model
python -m mlx_lm.generate \
--model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
--prompt "Give me a practical example of rate limiting in Express.js"
MLX often matches llama.cpp on Apple GPUs, especially on recent releases. It’s a good baseline if you prefer Python and simple scripts.
Option D: MLC‑LLM (compiler toolchain)
If you want to target iOS, Android, or WebGPU later, try MLC‑LLM. It compiles model libraries for the platform and can JIT on Mac.
python -m venv .venv && source .venv/bin/activate
pip install mlc-ai-nightly mlc-llm-nightly
# Quick run (JIT)
python -m mlc_llm.gen --model Llama-3-8B-Instruct-q4f16_1-MLC --prompt "Design a RAG eval harness"
Benchmark plan you can actually reproduce
We care about both prompt eval (ingestion) and decode (generation). Settings below are conservative and designed to run on stock machines without tweaks.
Common settings
- Model: Llama 3.1 8B Instruct Q4_K_M and Llama 3/3.1 70B Q4_K_M where RAM allows
- Context: 8k for 8B, 16k for 70B
- Threads: 8 on laptops, 12–16 on desktops
- GPU offload:
-ngl 999
- Batch: 512; push to 1024 on desktops if stable
- Two prompts: 4k token synthetic doc for prompt eval; 256 token decode for throughput
Harness
#!/usr/bin/env bash
set -euo pipefail
MODEL="$1"; NAME="$2"
FLAGS="-t 8 -ngl 999 -b 512 -fa 1 -n 256"
./build/bin/llama-bench -m "$MODEL" -p 4096,8192 -ub 1024 $FLAGS -o json > "bench_${NAME}.json"
Run it three times and take the median.
Results: expected ranges on Apple Silicon (M2 generation)
These are ranges, not lab‑perfect numbers. They reflect what I’ve seen and what the community reports with similar settings. Your mileage will vary with background load, thermal headroom, batch size, and quant.
Llama 3.x 8B (Q4_K_M)
Chip | Machine memory | Runtime | Prompt eval (tok/s) | Decode (tok/s) |
---|---|---|---|---|
M2 Pro (16 GB) | 16 GB UMA | llama.cpp | 500–900 | 22–35 |
MLX‑LM | 550–950 | 24–38 | ||
M2 Max (64–96 GB) | 64–96 GB UMA | llama.cpp | 700–1200 | 35–60 |
MLX‑LM | 750–1250 | 35–60 | ||
M2 Ultra (96–192 GB) | 96–192 GB UMA | llama.cpp | 900–1400 | 40–70 |
MLX‑LM | 950–1500 | 40–70 |
Llama 3/3.1 70B (Q4_K_M) on M2 Ultra
Chip | Runtime | Prompt eval (tok/s) | Decode (tok/s) |
---|---|---|---|
M2 Ultra (192 GB) | llama.cpp | 200–450 | 12–18 |
MLX‑LM | 220–500 | 12–20 |
Notes
- Ranges assume plugged‑in power, cool ambient temps, minimal background tasks.
- If you see decode under the lower bound, reduce context, lower batch or drop to Q3_K.
- M2 Pro with 8 GB RAM will force aggressive paging and kill throughput.
Tuning checklist
- Batch size: Lift
-b
until speed stops improving or you hit OOM. - Context: Keep
-c
as low as your app can tolerate. Long context tanks t/s. - Threads: On laptops, don’t max all cores during long runs. Stick to P‑cores for stability.
- GPU offload: Always use
-ngl 999
on Apple Silicon. - Quant: Start Q4_K_M. Try Q5_K_M if you have RAM and need fidelity. For 70B on 96 GB, use more aggressive quants and smaller context.
- KV cache: If supported, try 8‑bit KV to free memory for layers.
Serving locally (APIs and tooling)
llama.cpp server (HTTP)
./build/bin/llama-server \
-m ./models/llama-3.1-8b-q4_k_m.gguf \
--host 127.0.0.1 --port 8080 -c 8192 -ngl 999 -b 512 -t 8
llama-cpp-python (OpenAI‑compatible)
pip install "llama-cpp-python[server]"
python -m llama_cpp.server --model ./models/llama-3.1-8b-q4_k_m.gguf --n_ctx 8192 --n_gpu_layers -1
Ollama (REST + docker‑ish
model mgmt)
ollama serve &
ollama run llama3.1:8b --verbose "Generate TypeScript types from this JSON schema"
Reverse proxy and auth
Front with Caddy or Traefik, add basic auth or mTLS, and restrict to your LAN or Tailscale network. Don’t expose dev boxes to the internet.
Troubleshooting (from my own bruises)
- Metal not engaging: rebuild with
-DGGML_METAL=ON
orLLAMA_METAL=1
. Check logs for GPU offload lines. - Weirdly low t/s: close the IDE, stop Spotlight indexing, drop
-c
, lower-b
a notch, verify you’re not on battery. - OOM at start: reduce
-b
first, then-c
, then try a lighter quant. - “IQ” quants crawling: some IQ formats can be slower on Macs; switch to K‑quants for throughput.
- Long context regression: use NTK/RoPE‑aware models where possible. Don’t force 128k unless you have a real reason.
- llama-cpp-python slower than
llama-cli
: rebuild wheels locally with Metal enabled; avoid Rosetta.
Privacy and local‑first notes
- Running llama.cpp/MLX/Ollama locally means your prompts and docs can stay on device.
- Check each app’s privacy settings. Some GUIs keep local chat logs by default. If you care, disable history or store on encrypted volumes.
- If you serve an API, treat it like any internal service: auth, logs, and network boundaries.
Where this leaves us
- An M2 Pro/Max laptop does great work with 8B models today. It’s a sweet spot for private coding assistants, RAG, and agents.
- M2 Ultra opens the door to 70B locally, with careful tuning.
- Apple’s stack has caught up: MLX‑LM is competitive with llama.cpp for many setups.
If you need my exact scripts, grab the harness above and swap in your model paths. Then tune batch, context, and quant until the numbers stop getting better.
Quick memory sizing
- 8B Q4_K_M: ≈ 4.8–5.2 GB + KV cache + overhead → aim for ≥ 16 GB RAM
- 8B Q5_K_M: ≈ 5.7–6.2 GB → ≥ 16–24 GB comfortable
- 70B Q4_K_M: ≈ 35–40 GB model + large KV cache → ≥ 96 GB UMA to be happy
These are ballparks. KV cache scales with sequence length, batch size, and precision.