Does MLX run operations immediately?

No. MLX is lazy. When you write an expression it builds up a graph of operations and runs nothing until you actually need a result, at which point mx.eval (or printing, or converting to NumPy) triggers the computation. This lets MLX see a whole chunk of work at once and schedule it efficiently.

What does mx.compile do in MLX?

mx.compile traces a function's graph once and fuses adjacent operations together, so the GPU receives far fewer kernel launches. A forward pass that would otherwise be dozens or hundreds of separate dispatches can become a single command buffer, which cuts overhead and speeds the whole thing up. Compiled functions have to be pure: no Python branching on array values and no mx.eval inside.

What is set_wired_limit in MLX?

It tells Metal how much memory MLX may pin as wired, meaning resident and non-pageable, so the GPU always has the weights to hand. It commonly sits around 70 percent of the unified memory pool. mlx-lm sets a sensible value for you. Push a model past the wired limit and the OS starts paging and compressing weights, and throughput drops sharply.

Does MLX copy model weights to the GPU?

No. On Apple Silicon the CPU and GPU share one unified memory pool, so an MLX array is visible to both with no transfer step. There's no tensor.to(device) copy because there's nowhere to copy it to. You pick which device runs an operation and the data stays put.

Under the Hood of MLX: How a Model Runs on a Mac

You run a model with mlx-lm, the tokens come back quickly, and that's usually where curiosity stops. Between your Python and the GPU, MLX builds a graph of your whole forward pass, waits until the last possible moment to run any of it, fuses it into a handful of kernels, and serves it from memory the GPU never has to copy. This is a look at that machinery, and why each piece is there.

If you want the higher-level picture of why MLX is fast on a Mac, our guide to how MLX runs LLMs on Apple Silicon covers it, and the companion piece on quantisation covers how the weights get small. This one goes a layer down, into how the framework actually turns your Python into work on the GPU. None of it's required reading to use MLX. It's the kind of thing that makes the performance make sense, and that helps when something's slower than you expected.

Nothing runs until you ask#

The first surprise when you poke at MLX is that your code doesn't do anything when you think it does. Write b = a * 2 and no multiplication has happened. MLX has recorded that you want a times two and handed you back an array that knows how to produce the result later. This is lazy evaluation, and it's the foundation everything else sits on.

import mlx.core as mx

a = mx.array([1.0, 2.0, 3.0])
b = a * 2          # nothing has run yet
c = b + 1          # still nothing
mx.eval(c)         # now the GPU does the work

The computation only fires when you call mx.eval, or when you do something that needs the actual numbers, like printing them or converting to NumPy. Up to that point MLX is collecting operations into a graph.

Why bother? Because seeing the whole graph at once is what lets MLX be clever. If it ran each operation the moment you wrote it, it'd be firing one tiny GPU job after another, paying the launch overhead every time and never getting a chance to combine them. By waiting, it gets to look at the entire chunk of work and plan.

Compiling the graph#

Lazy evaluation sets things up. mx.compile takes it. You wrap a function, MLX traces its graph once on the first call, and then fuses the operations so the GPU gets far fewer kernels to launch.

@mx.compile
def step(x, w):
    h = mx.matmul(x, w)
    h = h * mx.sigmoid(h)   # SiLU
    return h

The difference is large. A transformer layer is a long chain of small operations: a matmul, an add, a normalisation, another matmul, an activation, and so on. Run them eagerly and each is its own GPU dispatch, each with its own overhead. Fuse them and a whole stretch of that chain becomes a single kernel, and the entire forward pass can go to the GPU as one command buffer.

Without compile, each operation is its own GPU dispatch. mx.compile traces the graph once and fuses the operations, so the whole forward pass goes to the GPU as a single command buffer.

We put a number on it on our own M2 Max. A chain of small elementwise ops, the kind of glue that sits between the matmuls, ran at about 4.7 ms eagerly and 1.7 ms once compiled, roughly 2.8 times faster, because the fused version hands the GPU one kernel instead of a dozen separate dispatches. One small thing to note: our first attempt wrapped the same chain around two large matmuls and saw no speedup at all, because the matmuls dwarf the launch overhead they sit between. Fusion pays off on the many-small-ops glue, not on matmul-bound code. A real forward pass has plenty of both, and on a Mac, where you're already close to the memory-bandwidth ceiling, shaving the overhead around the useful work is a real part of what's left to win.

There's a catch, the usual one with tracing. A compiled function has to be pure. No branching in Python on the actual values inside an array, because MLX traced one path and will reuse it. No mx.eval in the middle, because that breaks the graph it's trying to capture. If you've used torch.compile or JAX's jit, the rules will feel familiar.

Where the weights live#

Now the part that's specific to Apple Silicon, and the reason MLX exists at all. On a PC with a discrete GPU, your model weights live in the GPU's own memory, and getting them there means copying across the PCIe bus. On a Mac there's no bus to cross. The CPU and GPU share one pool of memory on the same package.

On Apple Silicon the CPU, GPU and Neural Engine sit on one package and share the same unified memory, so model weights never have to be copied between them.

So an MLX array is just memory, visible to whichever device you ask to compute on it. There's no tensor.to("gpu") step because there's nowhere else to put it. You decide which device runs an operation, and the data stays exactly where it is. For inference that means the moment the weights are loaded, the GPU can read them straight away.

Pinning the weights#

Sharing memory is the upside. The catch is that the operating system still treats that memory like any other, which means it can decide to page parts of it out to disk when things get tight. For a model you're reading end to end on every token, that's the difference between keeping up and stalling.

MLX deals with this by asking Metal to keep memory wired: resident and non-pageable, so the weights can't be swapped out from under the GPU. There's a limit on how much you're allowed to wire, and MLX exposes it as set_wired_limit. It usually sits around 70 percent of the unified pool, leaving the rest for the OS and everything else.

MLX pins the weights in wired memory so the GPU always has them. The wired limit, around 70 percent of the pool by default, is the line you do not want to cross, because past it the OS pages weights back out.

mlx-lm picks a sensible wired limit for you, so most of the time you never touch this. It matters when you push a model right up to the edge of memory. Go past the wired limit and macOS starts paging and compressing the weights to make room, and the moment that happens your tokens-per-second falls off a cliff. We've watched a Mac with a model sized a little too large spend more time compressing memory than generating, and the fix is almost always the same: drop to a smaller quant or a smaller model so the working set sits comfortably under the wired limit, with headroom for the KV cache to grow. The on-disk size is a floor, not the whole bill: on our M2 Max the 8B 4-bit model that's 4.6GB on disk peaked at 5.6GB in use, once the cache and activations were counted.

Inside the matmul#

The last stop is the kernel that actually does the multiplying. When the model is quantised, the work runs in MLX's quantised matrix-multiply kernels, and which one depends on what phase you're in. Generating a token at a time is a matrix-vector multiply (QMV), because there's one vector of activations going through the weights. Processing a whole prompt at once is a matrix-matrix multiply (QMM), because there's a stack of them. Same weights, different shape of work, different kernel.

One detail inside those kernels matters if you care about output quality. They dequantise the weights and accumulate the sums in 32-bit float, even when the model is fp16 or bf16, so a long running sum of many small products keeps its precision the whole way through. Skip that and a long sum in lower precision drops its low bits; you'd see it as slightly worse output and never guess the kernel was the reason. The kernels themselves are written in Metal and lean on the GPU's simdgroup matrix instructions, and on the M5 they reach the new Neural Accelerators through Metal 4, which is a story we tell in our roofline piece.

How MLX differs to CUDA#

If you've written CUDA, the lazy graph and the no-copy memory both go against how you're used to working. Here's the same toy computation in both, b = a * 2 then c = b + 1. The maths is the same; everything around it changes.

On a discrete NVIDIA GPU, host RAM and GPU VRAM are separate pools. The inputs start in host RAM, so you allocate device memory, copy them across PCIe, run the kernel, then copy the results back:

#include <cuda_runtime.h>

__global__ void transform(const float* in, float* out, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    // a * 2, then + 1
    if (i < n) out[i] = in[i] * 2.0f + 1.0f;
}

int main() {
    const int N = 4;
    const size_t bytes = N * sizeof(float);
    float h_in[] = {1.0f, 2.0f, 3.0f, 4.0f}, h_out[N];

    float *d_in, *d_out;
    cudaMalloc(&d_in, bytes);
    cudaMalloc(&d_out, bytes);

    // copy in, across PCIe
    cudaMemcpy(d_in, h_in, bytes, cudaMemcpyHostToDevice);

    // launch now, returns straight away
    transform<<<1, N>>>(d_in, d_out, N);

    // copy back
    cudaMemcpy(h_out, d_out, bytes, cudaMemcpyDeviceToHost);

    cudaFree(d_in);
    cudaFree(d_out);
}

The same thing in MLX is the code we started with, and there's no cudaMalloc, no cudaMemcpy, no <<< >>>:

import mlx.core as mx

# already in the shared pool
a = mx.array([1.0, 2.0, 3.0, 4.0])
# nothing runs
b = a * 2
# still nothing
c = b + 1
# now the GPU does it, in place
mx.eval(c)

The biggest difference is memory. On the Mac the array was never anywhere the GPU couldn't reach, so there's nothing to copy and no .to(device) to call, because there's no device to move it to. CUDA can paper over the copy for you with managed memory (cudaMallocManaged hands you a single pointer), but on a discrete GPU that's the runtime migrating pages back and forth on demand. The copy still happens; the driver just does it for you.

On a discrete GPU the weights cross the bus into VRAM before the GPU can touch them. On Apple Silicon the CPU and GPU read the same memory, so there is nothing to copy.

One difference is NVIDIA's Grace Blackwell superchips, the GB200 and GB300, wire a Grace CPU to the GPU over a coherent NVLink-C2C link, the same approach the earlier Grace Hopper GH200 took. On those the GPU reads CPU memory directly, no copy. That coherence comes from the Grace-plus-GPU package itself: put a standalone B200 in a server with an x86 host and you're back to separate memory and a PCIe copy, like any other discrete card. The one in your laptop or desktop still copies across PCIe.

The other is when work runs. CUDA is eager: each <<< >>> is a kernel you launch right now, and you pay a little host-side overhead every time. CUDA Graphs let you capture a run of launches and replay them as one to win that overhead back. That batches the launches; the kernels still run separately and still push their intermediates through memory. MLX defers instead: it collects b and c into a graph and runs nothing until mx.eval, and mx.compile goes one further, fusing that chain into a single kernel so b never lands in memory at all.

Why this shape of design keeps paying off

Lazy evaluation, fusion, unified memory and wired residency all come from one instinct: do as little work around the work as possible, and never move data you don't have to. That's exactly what matters when you're bandwidth-bound, which on a Mac you almost always are. If you run a stack of these backends and want one stable endpoint over the lot, Olla sits in front and routes for you, as we cover in running MLX behind Olla.

When Mike & The Mechanic matters#

In practice, you don't need to think about any of this until performance changes unexpectedly.

The most common cause we've encountered is the KV cache gradually consuming the available memory headroom. Once that happens, macOS starts paging model weights and throughput drops quickly. Knowing how MLX manages model weights and the KV cache makes these issues much easier to diagnose.

Fortunately, MLX handles almost all of this for you. A basic understanding of what's happening under the hood is usually enough to explain why a model has slowed down.

The benchmark used throughout this article is available at github.com/tensorfoundrylabs/benchmark-mlx if you'd like to run the same tests on your own Mac.

MLX is lazy. Operations build a graph and run nothing until mx.eval (or a print, or a NumPy conversion) asks for a result.
mx.compile fuses that graph, turning a forward pass of many small dispatches into a single command buffer, which is where a lot of the speed comes from.
Unified memory means no copy. An MLX array is visible to CPU and GPU alike, so weights are usable the moment they're loaded.
Wired residency pins the weights up to a limit around 70 percent of memory. Cross it and the OS pages weights out, and throughput collapses.
The quantised kernels accumulate in fp32 for quality, and switch between matrix-vector (decode) and matrix-matrix (prefill) work.
Knowing the machinery is a debugging tool, even if you never call any of these APIs directly.