The Bandwidth Wall: A Roofline for Local LLMs
The picture is the roofline, a performance model borrowed from high-performance computing. It is worth the few minutes it takes, because it predicts decode speed to within a rough half and tells you which upgrades will help before you spend on them. If you have read our guide to how MLX runs LLMs on Apple Silicon you have met the headline: token generation is bound by memory bandwidth. Here is the machinery under it.
Compute versus bandwidth#
Every piece of hardware has two limits that matter here. One is how fast it can do arithmetic, measured in floating-point operations per second. The other is how fast it can move data in and out of memory, measured in bytes per second. For most of computing history the arithmetic was the bottleneck, so that is the number everyone quotes. For LLM inference, on the machine you own, it is usually the other one.
The thing that decides which limit you hit is a ratio called arithmetic intensity: how many calculations you do for every byte you read. Do a lot of maths with each byte and the arithmetic is your limit. Read a lot of bytes and do little with them, and memory bandwidth is your limit. That single ratio is the X axis of the roofline.
The roofline#
Plot achievable throughput against arithmetic intensity and you get a shape like a roof. On the left, where intensity is low, throughput climbs along a slope set by memory bandwidth: every extra calculation per byte buys you more, because you are limited by how fast bytes arrive. At some point the line hits the flat compute ceiling, where you are doing all the arithmetic the chip can manage and reading bytes is no longer the constraint. The corner where they meet is the ridge point.
The whole game is knowing which side of the ridge you are on. To the left, more compute power does nothing, because you are waiting on memory. To the right, faster memory does nothing, because you are waiting on the arithmetic. Spend money or effort on the limit you are not hitting and you get nothing back.
Prefill and decode live in different places#
A single request has two phases, and they sit on opposite sides of the ridge.
Prefill is the prompt-processing phase. It reads your whole prompt at once and pushes all those tokens through the model together, which is a big matrix-matrix multiply with plenty of arithmetic for every byte of weights it loads. High intensity. It sits up under the compute ceiling, so it is compute-bound, and more arithmetic horsepower makes it faster.
Decode is the generation phase, one token at a time. For each token, the model reads essentially all of its weights out of memory and does comparatively little maths with them before producing the next token. Low intensity. It sits down on the slope, firmly bandwidth-bound. The ceiling on how fast you can decode is roughly your memory bandwidth divided by the model's size in bytes, and no amount of extra compute changes that.
This split is what people mean when they say a new Mac did not feel faster. They upgraded compute and measured decode. Time-to-first-token, which prefill governs, and tokens-per-second, which decode governs, respond to completely different things, so a change that transforms one can leave the other flat.
The bandwidth ladder#
If decode speed is roughly bandwidth divided by model size, then bandwidth is the spec to look at on any Mac you intend to generate tokens on. The catch is that the number you see quoted, the base chip's, is the floor. The tiers above it tell the real story.
The base chips, the M4 in a MacBook Air or the M5 in the entry MacBook Pro, sit between 68 and 153 GB/s. Step up to Pro and you roughly double it, to Max and you double it again, and the Ultra in a Mac Studio lands near 800. That is more than a ten-fold spread across the line-up, all generating tokens from the same kind of model. So when someone says a Mac is fast or slow at LLMs, the tier matters more than the generation: an M1 Max at 400 GB/s out-generates an M3 Pro at 150, three years newer.
You can do the sum, then check it against a real machine. Our M2 Max has 400 GB/s of bandwidth. An 8B model in 4-bit is 4.6GB on disk, so the ceiling is about 400 divided by 4.6, roughly 87 tokens per second. We measured 64.9, which is 75 percent of the theoretical limit. That fraction is normal, and it is the whole point: the ceiling is set by bandwidth, and you only ever get part of it.
The clean way to see the roofline is to hold the model fixed and change only its size. So we ran the same Qwen3-8B at three bit widths on the same Mac:
| Qwen3-8B | On disk | Decode | Theoretical | Of ceiling |
|---|---|---|---|---|
| 4-bit | 4.6 GB | 64.9 tok/s | 87 tok/s | 75% |
| 8-bit | 8.7 GB | 40.0 tok/s | 46 tok/s | 87% |
| bf16 | 16.4 GB | 22.6 tok/s | 24 tok/s | 93% |
Decode falls almost exactly in step with the bytes. Double the size, roughly halve the speed, which is the roofline working as advertised. The surprise is the last column: efficiency climbs as the model gets heavier, from 75 percent at 4-bit to 93 percent at bf16. The bigger the per-token read, the more it swamps the fixed overheads, so a heavy model runs closer to its bandwidth ceiling than a light one does. A 3B 4-bit model on the same Mac hit 145.9 tok/s, but only 66 percent of its ceiling, for the same reason in reverse.
There is a second thing buried in those numbers. Prefill went the other way: in the high 300s of tokens per second at 4-bit, up in the high 500s at bf16 (the exact figure wobbles a little between runs, the direction does not). Quantising the model sped up decode and slowed down prefill, because unpacking 4-bit weights costs arithmetic, and prefill is the phase that is short on arithmetic to spare. Quantisation is a decode win you pay for, a little, at the prompt.
What the M5 changed, and what it did not#
The M5 is the cleanest illustration of this we have seen. Its headline feature for AI is the Neural Accelerators: matrix-multiply hardware built into each GPU core, reached through Metal 4. They make matmul dramatically faster, which is to say they lift the compute ceiling.
Look back at the roofline and you can predict the result before reading a single benchmark. Prefill is compute-bound, sitting under the ceiling, so lifting the ceiling speeds it up a lot. Apple's own figures put the M5 at roughly three to four times the M4 on time-to-first-token. Decode is bandwidth-bound, sitting on the slope, and the slope barely moved because base-tier bandwidth only went from 120 to 153 GB/s. So decode improved only around 20 to 30 percent.
Same chip, same upgrade, two completely different outcomes, and the roofline called both. Long prompts start responding much sooner on an M5. Sustained generation speed tracks bandwidth, the way it always has.
What actually speeds up decode#
If decode is bandwidth-bound, only a few things genuinely move it, and the roofline says which.
Read fewer bytes. That is quantisation, and the sweep above is the proof: 4-bit decodes at nearly three times the rate of bf16 on the same model, because it reads a third of the bytes. It is the most reliable speed-up most people have.
Reuse the bytes you do read. That is batching: serve several requests at once and the weights you load get spread across all of them, so you do more maths per byte and climb the slope toward the compute ceiling. It is why server engines batch hard, and why a single-user Mac, with nothing to batch, sits stubbornly on the bandwidth-bound part of the curve.
Or buy more bandwidth. It is the reason a Max or Ultra chip generates so much faster than a base chip of the same year, and the reason an older Mac with a wider bus can out-generate a newer one with a narrow one.
The same wall, at every scale
The same wall stands at data-centre scale. A GPU running a 70B model is bandwidth-bound on decode for the same reason your Mac is, which is why the tricks that help, quantisation and batching, are exactly what the big inference servers lean on. The numbers are larger and the bandwidth is enormous, but the wall is in the same place. For the GPU end of that spectrum, our work on FoundryOS and Forge lives against the same physics described here.
Once the roofline is in your head, a lot of buying advice turns into arithmetic. Buy for bandwidth, quantise to read fewer bytes, and stop expecting a compute upgrade to do anything for generation. You can run the same measurements on your own Mac: the benchmark we used is open source at github.com/tensorfoundrylabs/benchmark-mlx.
Key takeaways
- Token generation is bandwidth-bound. Decode reads the whole model per token and does little maths with it, so memory bandwidth is the limit, not compute.
- The roofline tells you which limit you are hitting. Below the ridge point you are bandwidth-bound; above it you are compute-bound. Spending on the wrong one buys nothing.
- Prefill and decode sit on opposite sides. Prefill is compute-bound (time-to-first-token); decode is bandwidth-bound (tokens per second). They respond to different upgrades.
- Decode speed is roughly bandwidth divided by model size, times a real-world efficiency that climbs with model weight, from about two-thirds on a small model to over 90 percent on a heavy one (measured on an M2 Max).
- The M5 lifted the compute ceiling, not the bandwidth slope, so it cut time-to-first-token by three to four times but only nudged generation by 20 to 30 percent.
- The levers that work are quantising, batching and bandwidth. A pure compute upgrade does nothing for generation.
Further reading#
- How MLX runs LLMs on Apple Silicon - the bandwidth story at a higher level
- Under the hood of MLX - lazy graphs, fusion and the kernels
- MLX vs GGUF quantisation - reading fewer bytes per token
- Apple's M5 MLX benchmarks - the prefill and decode numbers
- Roofline model - the original idea from high-performance computing
- Running MLX behind Olla - one endpoint over your Mac inference stack