Under the Hood of MLX: How a Model Runs on a Mac
If you want the higher-level picture of why MLX is fast on a Mac, our guide to how MLX runs LLMs on Apple Silicon covers it, and the companion piece on quantisation covers how the weights get small. This one goes a layer down, into how the framework actually turns your Python into work on the GPU. None of it is required reading to use MLX. It is the kind of thing that makes the performance make sense, and that helps when something is slower than you expected.
Nothing runs until you ask#
The first surprise when you poke at MLX is that your code does not do anything when you think it does. Write b = a * 2 and no multiplication has happened. MLX has recorded that you want a times two and handed you back an array that knows how to produce the result later. This is lazy evaluation, and it is the foundation everything else sits on.
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = a * 2 # nothing has run yet
c = b + 1 # still nothing
mx.eval(c) # now the GPU does the work The computation only fires when you call mx.eval, or when you do something that needs the actual numbers, like printing them or converting to NumPy. Up to that point MLX is collecting operations into a graph.
Why bother? Because seeing the whole graph at once is what lets MLX be clever. If it ran each operation the moment you wrote it, it would be firing one tiny GPU job after another, paying the launch overhead every time and never getting a chance to combine them. By waiting, it gets to look at the entire chunk of work and plan.
Compiling the graph#
Lazy evaluation sets up the opportunity. mx.compile takes it. You wrap a function, MLX traces its graph once on the first call, and then fuses the operations so the GPU gets far fewer kernels to launch.
@mx.compile
def step(x, w):
h = mx.matmul(x, w)
h = h * mx.sigmoid(h) # SiLU
return h The difference is large. A transformer layer is a long chain of small operations: a matmul, an add, a normalisation, another matmul, an activation, and so on. Run them eagerly and each is its own GPU dispatch, each with its own overhead. Fuse them and a whole stretch of that chain becomes a single kernel, and the entire forward pass can go to the GPU as one command buffer.
We put a number on it on our own M2 Max. A chain of small elementwise ops, the kind of glue that sits between the matmuls, ran at about 4.7 ms eagerly and 1.7 ms once compiled, roughly 2.8 times faster, because the fused version hands the GPU one kernel instead of a dozen separate dispatches. There is a caveat worth being honest about: our first attempt wrapped the same chain around two large matmuls and saw no speedup at all, because the matmuls dwarf the launch overhead they sit between. Fusion pays off on the many-small-ops glue, not on matmul-bound code. A real forward pass has plenty of both, and on a Mac, where you are already close to the memory-bandwidth ceiling, shaving the overhead around the useful work is a real part of what is left to win.
There is a catch, and it is the usual one with tracing. A compiled function has to be pure. No branching in Python on the actual values inside an array, because MLX traced one path and will reuse it. No mx.eval in the middle, because that breaks the graph it is trying to capture. If you have used torch.compile or JAX's jit, the rules will feel familiar, and so will the payoff.
Where the weights live#
Now the part that is specific to Apple Silicon, and the reason MLX exists at all. On a PC with a discrete GPU, your model weights live in the GPU's own memory, and getting them there means copying across the PCIe bus. On a Mac there is no bus to cross. The CPU and GPU share one pool of memory on the same package.
So an MLX array is just memory, visible to whichever device you ask to compute on it. There is no tensor.to("gpu") step because there is nowhere else to put it. You decide which device runs an operation, and the data stays exactly where it is. For inference that means the moment the weights are loaded, the GPU can read them, full stop.
Pinning the weights#
Sharing memory is the upside. The catch is that the operating system still treats that memory like any other, which means it can decide to page parts of it out to disk when things get tight. For a model you are reading end to end on every token, that is the difference between fast and unusable.
MLX deals with this by asking Metal to keep memory wired: resident and non-pageable, so the weights cannot be swapped out from under the GPU. There is a limit on how much you are allowed to wire, and MLX exposes it as set_wired_limit. It usually sits around 70 percent of the unified pool, leaving the rest for the OS and everything else.
mlx-lm picks a sensible wired limit for you, so most of the time you never touch this. It matters when you push a model right up to the edge of memory. Go past the wired limit and macOS starts paging and compressing the weights to make room, and the moment that happens your tokens-per-second falls off a cliff. We have watched a Mac with a model sized a little too large spend more time compressing memory than generating, and the fix is almost always the same: drop to a smaller quant or a smaller model so the working set sits comfortably under the wired limit, with headroom for the KV cache to grow. The on-disk size is a floor, not the whole bill: on our M2 Max the 8B 4-bit model that is 4.6GB on disk peaked at 5.6GB in use, once the cache and activations were counted.
Inside the matmul#
The last stop is the kernel that actually does the multiplying. When the model is quantised, the work runs in MLX's quantised matrix-multiply kernels, and which one depends on what phase you are in. Generating a token at a time is a matrix-vector multiply (QMV), because there is one vector of activations going through the weights. Processing a whole prompt at once is a matrix-matrix multiply (QMM), because there is a stack of them. Same weights, different shape of work, different kernel.
One detail inside those kernels is worth knowing if you care about output quality. They dequantise the weights and accumulate the sums in 32-bit float, even when the model is fp16 or bf16, so a long running sum of many small products keeps its precision the whole way through. Skip that and a long sum in lower precision quietly drops its low bits; you would see it as slightly worse output and never guess the kernel was the reason. The kernels themselves are written in Metal and lean on the GPU's simdgroup matrix instructions, and on the M5 they reach the new Neural Accelerators through Metal 4, which is a story we tell in our roofline piece.
Why this shape of design keeps paying off
Lazy evaluation, fusion, unified memory and wired residency all come from one instinct: do as little work around the work as possible, and never move data you do not have to. That is exactly what matters when you are bandwidth-bound, which on a Mac you almost always are. If you run a stack of these backends and want one stable endpoint over the lot, Olla sits in front and routes for you, as we cover in running MLX behind Olla.
When the machinery matters#
You can use MLX without any of this in your head. We keep it there because it tells us where to look when a run is off. The one that has bitten us most: a model that ran fine yesterday crawling today. Nine times out of ten the conversation had grown the KV cache past the headroom we left, tipped over the wired limit, and the Mac was busy paging weights instead of multiplying them. Knowing the weights are pinned, and that the cache grows into the same pool, turned a baffling slowdown into a one-line fix.
The framework does a lot on your behalf. Knowing roughly what it does, and why, is usually enough to find the knob that moved. If you want to poke at any of this on your own Mac, the small benchmark we used for the numbers here is open source at github.com/tensorfoundrylabs/benchmark-mlx.
Key takeaways
- MLX is lazy. Operations build a graph and run nothing until
mx.eval(or a print, or a NumPy conversion) asks for a result. mx.compilefuses that graph, turning a forward pass of many small dispatches into a single command buffer, which is where a lot of the speed comes from.- Unified memory means no copy. An MLX array is visible to CPU and GPU alike, so weights are usable the moment they load.
- Wired residency pins the weights up to a limit around 70 percent of memory. Cross it and the OS pages weights out, and throughput collapses.
- The quantised kernels accumulate in fp32 for quality, and switch between matrix-vector (decode) and matrix-matrix (prefill) work.
- Knowing the machinery is a debugging tool, even if you never call any of these APIs directly.
Further reading#
- How MLX runs LLMs on Apple Silicon - the higher-level why
- MLX vs GGUF quantisation - how the weights get small
- The bandwidth wall: a roofline for local LLMs - why decode is bound by memory
- MLX lazy evaluation docs - the official explanation
- MLX compile docs - fusion and the purity rules
- Running MLX behind Olla - one endpoint over your Mac inference stack