How MLX Runs LLMs on Apple Silicon
If you are running models on a Mac and trying to work out whether MLX is worth the switch, the short answer is usually yes, with a couple of caveats we will get to. The longer answer is worth knowing, because it explains why the same model can be noticeably faster under MLX, and when it will not be.
For the broader landscape of inference servers across all hardware, see our comparison of vLLM, SGLang, llama.cpp and Ollama.
What MLX is#
MLX is an open-source array framework from Apple's machine-learning research team (the ml-explore organisation on GitHub). Think NumPy or PyTorch, but designed from the first commit for Apple Silicon rather than ported to it. It handles both training and inference, and the LLM-facing layer, mlx-lm, gives you text generation, fine-tuning and an OpenAI-compatible server.
Two design decisions define it:
- Unified memory model. MLX arrays live in memory that the CPU and GPU both see. There is no
tensor.to(device)copy step, because there is nothing to copy to. You choose which device runs an operation, and the data stays put. - Lazy evaluation. Operations build a computation graph that only runs when you actually need the result. That lets MLX fuse work together and schedule the GPU efficiently rather than firing one kernel at a time.
Apple positions MLX as a framework for research and power users, not as the path for shipping a model inside an App Store app (that is what Core ML and the newer Foundation Models framework are for). But "research framework" undersells where it has landed: it is now the fastest practical way to run open-weight models on a Mac.
Why unified memory changes the maths#
On a typical PC with a discrete NVIDIA GPU, the model weights live in the GPU's VRAM. If a model is bigger than your VRAM, you are stuck moving data across the PCIe bus, and that bottleneck dominates. Apple Silicon does not have that split. The CPU, GPU and Neural Engine all share one pool of memory on the same package.
That has two practical consequences.
The first is capacity. A Mac with 128GB of unified memory can hold a 70B model in 4-bit quantisation and serve it from the GPU at full speed, no multi-GPU rig required. The rule of thumb is simple: a model needs roughly its parameter count times its bytes-per-weight. An 8B model in 4-bit is about 4.5GB; a 70B in 4-bit is about 40GB.
The second is that token generation is bound by memory bandwidth, not raw compute. Generating each token means reading essentially the whole model out of memory once. So the ceiling on tokens-per-second is, roughly, your memory bandwidth divided by the model size in bytes. This is why bandwidth is the single most important spec on a Mac you intend to run LLMs on, and why it pays to know the numbers across the line-up (all figures in GB/s):
| Tier | M1 | M2 | M3 | M4 | M5 |
|---|---|---|---|---|---|
| Base | 68 | 100 | 100 | 120 | 153 |
| Pro | 200 | 200 | 150 | 273 | 307 |
| Max | 400 | 400 | 300-400 | 410-546 | 460-614 |
| Ultra | 800 | 800 | 819 | - | - |
A few notes reading that table: the M3 Max figure depends on the GPU bin (a 14-core part is 300 GB/s, the 16-core is 400). There was no M4 Ultra, and at the time of writing there is no announced M5 Ultra. And one number stands out for the wrong reasons.
The Neural Engine myth#
Almost everyone assumes MLX runs on the Apple Neural Engine. It does not, and neither does llama.cpp or Ollama. They all run on the GPU through Metal.
The Neural Engine (ANE) is a fixed-function accelerator built for the kind of small, fixed-shape neural networks that power features like on-device dictation and photo analysis. It is power-efficient, but it has three problems for LLM serving: there is no public API to program it directly (you reach it only through Core ML), it is built around static computation graphs rather than the variable-length, autoregressive generation an LLM does, and its small on-chip memory means large attention layers spill constantly.
So the ANE is an efficiency story, not a throughput story. Running a small model on it draws a couple of watts where the GPU would draw twenty, which is great for battery life, but the GPU generates several times more tokens per second. For interactive LLM work, the GPU is the right engine, and that is where MLX lives.
What the M5 changed#
Apple announced the M5 in October 2025, and for LLM work its headline feature is those GPU Neural Accelerators: dedicated matrix-multiplication hardware in every GPU core. Matrix multiplication is exactly what dominates the prompt-processing (prefill) phase, so this is aimed squarely at time-to-first-token.
Apple's own machine-learning research benchmarks, run with MLX, put the M5 at roughly 3.3x to 4x faster time-to-first-token than the M4 across a range of models from 1.7B up to a 30B mixture-of-experts. Image generation with FLUX saw a similar jump.
Token generation, though, only improved by around 20 to 30 percent. That is smaller than the prefill gains, and for the same reason as before: generation is bandwidth-bound, and bandwidth only rose from 120 to 153 GB/s on the base chip. The Neural Accelerators make the compute-heavy prefill much faster; they cannot move the bandwidth ceiling that governs decode.
So on an M5, long prompts start responding much sooner, which is the part of latency you actually feel. Sustained generation speed tracks memory bandwidth, as it always has.
MLX versus llama.cpp and Ollama#
The clearest signal of MLX's maturity is that Ollama - the most popular local-LLM tool there is - switched its Apple Silicon engine to MLX in version 0.19 (March 2026). GGUF models still route through llama.cpp; models in safetensors format now run on MLX. The numbers Ollama published with that announcement, on an M5 Max running a 35B mixture-of-experts model in 4-bit, are striking: prefill went from roughly 1,150 to 1,810 tokens per second, and decode from about 58 to 112 tokens per second. Close to double, for the same hardware and model.
The general pattern across community testing is consistent:
When each one wins
Smaller and MoE models
MLXLarge dense models
EvenVery long contexts
llama.cppEveryday interactive chat
EitherOne last catch: model availability lag. MLX needs models in its own safetensors-based format, and while the mlx-community on Hugging Face converts popular models quickly, GGUF for llama.cpp often appears the day a model drops. If you chase brand-new releases, GGUF is usually ready first.
Serving MLX with an OpenAI API#
MLX is a framework, not a server, so you put a serving layer on top. The main options, all OpenAI-compatible:
- mlx-lm's built-in server (
mlx_lm.server) - the official, no-frills choice from Apple's team. One model per process, and the base server has no continuous batching. Watch the defaultmax_tokensof 512, which catches people out. - oMLX (jundot/omlx) - a macOS-native, menu-bar server that serves many models from one process, loading them on demand. Its standout feature is a tiered KV cache: hot blocks in RAM, cold blocks persisted to SSD, so a long context can be restored after a restart instead of recomputed. OpenAI and native Anthropic endpoints, plus embeddings and reranking.
- LM Studio - a polished GUI built on mlx-lm, with a toggle between MLX and GGUF per model. Closed source, but the easiest on-ramp.
- vllm-mlx and similar community servers - vLLM-style continuous batching on top of MLX. Worth knowing that vLLM proper does not run usefully on Apple Silicon; "vLLM on a Mac" almost always means one of these MLX-based projects, not the real thing.
We go deeper on wiring these together, and on running several of them behind one endpoint, in our companion guide: running MLX behind Olla.
Where Apple is taking this#
MLX has become a first-class part of Apple's developer story. At WWDC 2025 Apple gave it dedicated sessions and shipped the Foundation Models framework, which exposes Apple's own on-device model to apps. At WWDC 2026 it went further: MLX got its own session track for local agentic AI, the mlx-lm server gained continuous batching, and you can now point Apple's LanguageModelSession API at open-weight MLX models from the mlx-community through a pluggable backend. Apple also previewed Core AI, positioned as the successor to Core ML for on-device inference.
Generation over generation, Apple Silicon is being shaped into hardware that runs language models well, and MLX is the open framework riding closest to the metal.
You do not have to bet on one backend
Whichever way the Mac inference landscape moves, you do not have to pick a single engine and rewrite your clients when it changes. Olla, our open-source LLM proxy, sits in front of oMLX, Ollama, LM Studio, llama.cpp and the rest, and gives your applications one endpoint regardless of what is running behind it - speaking both OpenAI and Anthropic, so OpenAI-shaped clients and Anthropic-shaped tools like Claude Code work against the same backends. As new MLX servers mature we add native support, so your application layer stays put while the backend world keeps moving.
Comparison at a glance#
| MLX (via mlx-lm / oMLX) | llama.cpp | Ollama | |
|---|---|---|---|
| Hardware | Apple Silicon (Metal GPU) | Almost anything (CPU, Metal, CUDA, ROCm, Vulkan) | macOS, Linux, Windows |
| Model format | safetensors (MLX-converted) | GGUF | GGUF + safetensors (MLX since 0.19) |
| Token generation on Mac | Fastest for small and MoE models | Strong, slightly behind on small models | Uses MLX or llama.cpp under the hood |
| Long-context prefill | Full prefill first - slower TTFT | Chunked - faster TTFT on huge prompts | Depends on backend |
| Day-one model availability | Lags (needs conversion) | Usually immediate | Tracks both |
| Uses the Neural Engine? | No - GPU via Metal | No | No |
Key takeaways
- MLX is Apple's own framework for Apple Silicon, built around unified memory so the CPU and GPU share one memory pool with no copying.
- It runs on the GPU through Metal, not the Neural Engine. The ANE is for low-power fixed-graph tasks, not interactive LLM generation.
- Token generation is bound by memory bandwidth. Check a Mac's bandwidth, not just its chip name - the M3 Pro is slower here than the M2 Pro.
- The M5's GPU Neural Accelerators cut time-to-first-token by 3-4x, but sustained generation still tracks bandwidth.
- MLX is generally faster than llama.cpp on small and MoE models, roughly even on large dense ones, and behind on very long contexts where chunked prefill wins.
- Ollama 0.19 adopting MLX is the strongest signal yet that MLX is the default fast path on a Mac.
Further reading#
- MLX on GitHub - Apple's array framework
- mlx-lm - LLM inference, fine-tuning and the OpenAI-compatible server
- Apple's M5 MLX benchmarks - time-to-first-token gains explained
- Ollama on MLX - the official announcement of the backend switch
- Running MLX behind Olla - one endpoint for oMLX and the rest of your Mac stack
- LLM inference servers compared - vLLM, SGLang, llama.cpp and Ollama
- What is an LLM proxy? - how a proxy sits in front of multiple backends