MLX is an open-source array and machine-learning framework built by Apple's machine-learning research team for Apple Silicon. It runs LLM inference and fine-tuning directly on M-series Macs, exploiting the unified memory architecture so the CPU and GPU share one memory pool with no copying between them.

Does MLX use the Apple Neural Engine?

No. MLX runs on the GPU through Metal, not on the Apple Neural Engine (ANE). The ANE is only reachable through Core ML and is designed for fixed-graph models, which suits always-on, low-power tasks rather than autoregressive LLM generation. On the M5, MLX uses the new GPU Neural Accelerators, which sit inside the GPU cores and are separate from the ANE.

Is MLX faster than llama.cpp on a Mac?

For smaller and mixture-of-experts models that fit in unified memory, MLX is generally faster than llama.cpp's Metal backend at token generation. On large dense models both are limited by memory bandwidth and converge. llama.cpp still wins for very long contexts because it processes the prompt in chunks, while MLX completes the full prefill before emitting the first token.

Can I serve MLX models with an OpenAI-compatible API?

Yes. mlx-lm ships an OpenAI-compatible server (mlx_lm.server), and there are several other options including oMLX, LM Studio's MLX engine and community projects such as vllm-mlx. Ollama also uses MLX as its Apple Silicon backend for safetensors models as of version 0.19.

How MLX Runs LLMs on Apple Silicon

MLX has gone from a research curiosity to the default way fast LLM inference happens on a Mac. The tipping point was Ollama switching its Apple Silicon backend to MLX in March 2026, but the groundwork was Apple's own: a framework built specifically for the way Apple Silicon handles memory. This guide explains what MLX actually is, why unified memory changes the maths, the one thing almost everyone gets wrong about the Neural Engine, and how MLX stacks up against llama.cpp and Ollama.

If you are running models on a Mac and trying to work out whether MLX is worth the switch, the short answer is usually yes, with a couple of caveats we will get to. The longer answer is worth knowing, because it explains why the same model can be noticeably faster under MLX, and when it will not be.

For the broader landscape of inference servers across all hardware, see our comparison of vLLM, SGLang, llama.cpp and Ollama.

What MLX is#

MLX is an open-source array framework from Apple's machine-learning research team (the ml-explore organisation on GitHub). Think NumPy or PyTorch, but designed from the first commit for Apple Silicon rather than ported to it. It handles both training and inference, and the LLM-facing layer, mlx-lm, gives you text generation, fine-tuning and an OpenAI-compatible server.

Two design decisions define it:

Unified memory model. MLX arrays live in memory that the CPU and GPU both see. There is no tensor.to(device) copy step, because there is nothing to copy to. You choose which device runs an operation, and the data stays put.
Lazy evaluation. Operations build a computation graph that only runs when you actually need the result. That lets MLX fuse work together and schedule the GPU efficiently rather than firing one kernel at a time.

Apple positions MLX as a framework for research and power users, not as the path for shipping a model inside an App Store app (that is what Core ML and the newer Foundation Models framework are for). But "research framework" undersells where it has landed: it is now the fastest practical way to run open-weight models on a Mac.

Why unified memory changes the maths#

On a typical PC with a discrete NVIDIA GPU, the model weights live in the GPU's VRAM. If a model is bigger than your VRAM, you are stuck moving data across the PCIe bus, and that bottleneck dominates. Apple Silicon does not have that split. The CPU, GPU and Neural Engine all share one pool of memory on the same package.

On Apple Silicon the CPU, GPU and Neural Engine sit on one package and share the same unified memory, so model weights never have to be copied between them.

That has two practical consequences.

The first is capacity. A Mac with 128GB of unified memory can hold a 70B model in 4-bit quantisation and serve it from the GPU at full speed, no multi-GPU rig required. The rule of thumb is simple: a model needs roughly its parameter count times its bytes-per-weight. An 8B model in 4-bit is about 4.5GB; a 70B in 4-bit is about 40GB.

The second is that token generation is bound by memory bandwidth, not raw compute. Generating each token means reading essentially the whole model out of memory once. So the ceiling on tokens-per-second is, roughly, your memory bandwidth divided by the model size in bytes. This is why bandwidth is the single most important spec on a Mac you intend to run LLMs on, and why it pays to know the numbers across the line-up (all figures in GB/s):

Tier	M1	M2	M3	M4	M5
Base	68	100	100	120	153
Pro	200	200	150	273	307
Max	400	400	300-400	410-546	460-614
Ultra	800	800	819	-	-

A few notes reading that table: the M3 Max figure depends on the GPU bin (a 14-core part is 300 GB/s, the 16-core is 400). There was no M4 Ultra, and at the time of writing there is no announced M5 Ultra. And one number stands out for the wrong reasons.

The Neural Engine myth#

Almost everyone assumes MLX runs on the Apple Neural Engine. It does not, and neither does llama.cpp or Ollama. They all run on the GPU through Metal.

The Neural Engine (ANE) is a fixed-function accelerator built for the kind of small, fixed-shape neural networks that power features like on-device dictation and photo analysis. It is power-efficient, but it has three problems for LLM serving: there is no public API to program it directly (you reach it only through Core ML), it is built around static computation graphs rather than the variable-length, autoregressive generation an LLM does, and its small on-chip memory means large attention layers spill constantly.

So the ANE is an efficiency story, not a throughput story. Running a small model on it draws a couple of watts where the GPU would draw twenty, which is great for battery life, but the GPU generates several times more tokens per second. For interactive LLM work, the GPU is the right engine, and that is where MLX lives.

What the M5 changed#

Apple announced the M5 in October 2025, and for LLM work its headline feature is those GPU Neural Accelerators: dedicated matrix-multiplication hardware in every GPU core. Matrix multiplication is exactly what dominates the prompt-processing (prefill) phase, so this is aimed squarely at time-to-first-token.

Apple's own machine-learning research benchmarks, run with MLX, put the M5 at roughly 3.3x to 4x faster time-to-first-token than the M4 across a range of models from 1.7B up to a 30B mixture-of-experts. Image generation with FLUX saw a similar jump.

Token generation, though, only improved by around 20 to 30 percent. That is smaller than the prefill gains, and for the same reason as before: generation is bandwidth-bound, and bandwidth only rose from 120 to 153 GB/s on the base chip. The Neural Accelerators make the compute-heavy prefill much faster; they cannot move the bandwidth ceiling that governs decode.

So on an M5, long prompts start responding much sooner, which is the part of latency you actually feel. Sustained generation speed tracks memory bandwidth, as it always has.

MLX versus llama.cpp and Ollama#

The clearest signal of MLX's maturity is that Ollama - the most popular local-LLM tool there is - switched its Apple Silicon engine to MLX in version 0.19 (March 2026). GGUF models still route through llama.cpp; models in safetensors format now run on MLX. The numbers Ollama published with that announcement, on an M5 Max running a 35B mixture-of-experts model in 4-bit, are striking: prefill went from roughly 1,150 to 1,810 tokens per second, and decode from about 58 to 112 tokens per second. Close to double, for the same hardware and model.

The general pattern across community testing is consistent:

When each one wins

Smaller and MoE models

MLX

Models that fit in unified memory, especially mixture-of-experts. MLX's kernel fusion and efficient sparse-routing give it a real edge on token generation, often substantial on MoE.

Large dense models

Even

Around 30B and up, both engines hit the memory-bandwidth ceiling, so the framework stops being the deciding factor and the chip's bandwidth takes over.

Very long contexts

llama.cpp

llama.cpp processes the prompt in chunks, so it starts responding sooner on huge inputs. It also runs on far more hardware and supports partial CPU offload for models that do not fit.

Everyday interactive chat

Either

On models that fit comfortably in memory the difference is real but not life-changing. Pick on ecosystem and convenience rather than raw speed.

One last catch: model availability lag. MLX needs models in its own safetensors-based format, and while the mlx-community on Hugging Face converts popular models quickly, GGUF for llama.cpp often appears the day a model drops. If you chase brand-new releases, GGUF is usually ready first.

Serving MLX with an OpenAI API#

MLX is a framework, not a server, so you put a serving layer on top. The main options, all OpenAI-compatible:

mlx-lm's built-in server (mlx_lm.server) - the official, no-frills choice from Apple's team. One model per process, and the base server has no continuous batching. Watch the default max_tokens of 512, which catches people out.
oMLX (jundot/omlx) - a macOS-native, menu-bar server that serves many models from one process, loading them on demand. Its standout feature is a tiered KV cache: hot blocks in RAM, cold blocks persisted to SSD, so a long context can be restored after a restart instead of recomputed. OpenAI and native Anthropic endpoints, plus embeddings and reranking.
LM Studio - a polished GUI built on mlx-lm, with a toggle between MLX and GGUF per model. Closed source, but the easiest on-ramp.
vllm-mlx and similar community servers - vLLM-style continuous batching on top of MLX. Worth knowing that vLLM proper does not run usefully on Apple Silicon; "vLLM on a Mac" almost always means one of these MLX-based projects, not the real thing.

We go deeper on wiring these together, and on running several of them behind one endpoint, in our companion guide: running MLX behind Olla.

Where Apple is taking this#

MLX has become a first-class part of Apple's developer story. At WWDC 2025 Apple gave it dedicated sessions and shipped the Foundation Models framework, which exposes Apple's own on-device model to apps. At WWDC 2026 it went further: MLX got its own session track for local agentic AI, the mlx-lm server gained continuous batching, and you can now point Apple's LanguageModelSession API at open-weight MLX models from the mlx-community through a pluggable backend. Apple also previewed Core AI, positioned as the successor to Core ML for on-device inference.

Generation over generation, Apple Silicon is being shaped into hardware that runs language models well, and MLX is the open framework riding closest to the metal.

You do not have to bet on one backend

Whichever way the Mac inference landscape moves, you do not have to pick a single engine and rewrite your clients when it changes. Olla, our open-source LLM proxy, sits in front of oMLX, Ollama, LM Studio, llama.cpp and the rest, and gives your applications one endpoint regardless of what is running behind it - speaking both OpenAI and Anthropic, so OpenAI-shaped clients and Anthropic-shaped tools like Claude Code work against the same backends. As new MLX servers mature we add native support, so your application layer stays put while the backend world keeps moving.

Comparison at a glance#

	MLX (via mlx-lm / oMLX)	llama.cpp	Ollama
Hardware	Apple Silicon (Metal GPU)	Almost anything (CPU, Metal, CUDA, ROCm, Vulkan)	macOS, Linux, Windows
Model format	safetensors (MLX-converted)	GGUF	GGUF + safetensors (MLX since 0.19)
Token generation on Mac	Fastest for small and MoE models	Strong, slightly behind on small models	Uses MLX or llama.cpp under the hood
Long-context prefill	Full prefill first - slower TTFT	Chunked - faster TTFT on huge prompts	Depends on backend
Day-one model availability	Lags (needs conversion)	Usually immediate	Tracks both
Uses the Neural Engine?	No - GPU via Metal	No	No

MLX is Apple's own framework for Apple Silicon, built around unified memory so the CPU and GPU share one memory pool with no copying.
It runs on the GPU through Metal, not the Neural Engine. The ANE is for low-power fixed-graph tasks, not interactive LLM generation.
Token generation is bound by memory bandwidth. Check a Mac's bandwidth, not just its chip name - the M3 Pro is slower here than the M2 Pro.
The M5's GPU Neural Accelerators cut time-to-first-token by 3-4x, but sustained generation still tracks bandwidth.
MLX is generally faster than llama.cpp on small and MoE models, roughly even on large dense ones, and behind on very long contexts where chunked prefill wins.
Ollama 0.19 adopting MLX is the strongest signal yet that MLX is the default fast path on a Mac.