Back to Blog

The Bandwidth Wall: A Roofline for Local LLMs

You upgrade your Mac, load the same model, and the tokens come out maybe 20 percent faster. The chip benchmarks promised a lot more. Nothing is broken. This is the single idea that governs how fast a model generates on the machine you own, and it explains why a newer Mac, a bigger GPU or a faster chip so often does so little for token generation. We will use real numbers off our own M2 Max throughout, because the effect is much easier to believe when you watch it happen.

The picture is the roofline, a performance model borrowed from high-performance computing. It is worth the few minutes it takes, because it predicts decode speed to within a rough half and tells you which upgrades will help before you spend on them. If you have read our guide to how MLX runs LLMs on Apple Silicon you have met the headline: token generation is bound by memory bandwidth. Here is the machinery under it.

Compute versus bandwidth#

Every piece of hardware has two limits that matter here. One is how fast it can do arithmetic, measured in floating-point operations per second. The other is how fast it can move data in and out of memory, measured in bytes per second. For most of computing history the arithmetic was the bottleneck, so that is the number everyone quotes. For LLM inference, on the machine you own, it is usually the other one.

The thing that decides which limit you hit is a ratio called arithmetic intensity: how many calculations you do for every byte you read. Do a lot of maths with each byte and the arithmetic is your limit. Read a lot of bytes and do little with them, and memory bandwidth is your limit. That single ratio is the X axis of the roofline.

The roofline#

Plot achievable throughput against arithmetic intensity and you get a shape like a roof. On the left, where intensity is low, throughput climbs along a slope set by memory bandwidth: every extra calculation per byte buys you more, because you are limited by how fast bytes arrive. At some point the line hits the flat compute ceiling, where you are doing all the arithmetic the chip can manage and reading bytes is no longer the constraint. The corner where they meet is the ridge point.

Roofline model of LLM inferenceA roofline chart with a bandwidth-bound slope rising to a compute ceiling, decode on the slope and prefill at the ceiling, and a higher M5 ceiling above.The rooflinearithmetic intensity (FLOPs per byte)throughputM5 ceilingM4 ceilingbandwidth bounddecodememory-boundprefillcompute-bound
Decode sits on the sloped, bandwidth-bound part of the roofline, so a higher compute ceiling does nothing for it. Prefill sits under the ceiling, so the M5 lifts it. That is why the M5 cuts time-to-first-token sharply but barely moves tokens per second.

The whole game is knowing which side of the ridge you are on. To the left, more compute power does nothing, because you are waiting on memory. To the right, faster memory does nothing, because you are waiting on the arithmetic. Spend money or effort on the limit you are not hitting and you get nothing back.

Prefill and decode live in different places#

A single request has two phases, and they sit on opposite sides of the ridge.

Prefill versus decodePrefill processing a whole prompt of tokens in one matmul on the left, decode reading the full model per token on the right.Prefill versus decodePrefillone big matmulwhole prompt at once, compute-boundDecodeall weightstokenone token per full model read, memory-bound
Prefill chews the whole prompt in one parallel pass, so it is compute-bound. Decode reads the entire model out of memory for every single token, so it is memory-bound. They live at opposite ends of the roofline.

Prefill is the prompt-processing phase. It reads your whole prompt at once and pushes all those tokens through the model together, which is a big matrix-matrix multiply with plenty of arithmetic for every byte of weights it loads. High intensity. It sits up under the compute ceiling, so it is compute-bound, and more arithmetic horsepower makes it faster.

Decode is the generation phase, one token at a time. For each token, the model reads essentially all of its weights out of memory and does comparatively little maths with them before producing the next token. Low intensity. It sits down on the slope, firmly bandwidth-bound. The ceiling on how fast you can decode is roughly your memory bandwidth divided by the model's size in bytes, and no amount of extra compute changes that.

This split is what people mean when they say a new Mac did not feel faster. They upgraded compute and measured decode. Time-to-first-token, which prefill governs, and tokens-per-second, which decode governs, respond to completely different things, so a change that transforms one can leave the other flat.

The bandwidth ladder#

If decode speed is roughly bandwidth divided by model size, then bandwidth is the spec to look at on any Mac you intend to generate tokens on. The catch is that the number you see quoted, the base chip's, is the floor. The tiers above it tell the real story.

M-series memory bandwidth by tierA bubble chart of unified-memory bandwidth in gigabytes per second across M1 to M5, with separate bands for the Base, Pro, Max and Ultra tiers.Memory bandwidth by tier (GB/s)BaseProMaxUltra0200400600800M1M2M3M4M568100100120153200200150273307400400400546614800800819
The base chips are the floor, 68 to 153 GB/s. Pro roughly doubles that, Max multiplies it again, and the Ultra sits near 800. Token generation tracks bandwidth, so the tier matters more than the generation. Max figures are the higher GPU bin; there is no M4 or M5 Ultra.

The base chips, the M4 in a MacBook Air or the M5 in the entry MacBook Pro, sit between 68 and 153 GB/s. Step up to Pro and you roughly double it, to Max and you double it again, and the Ultra in a Mac Studio lands near 800. That is more than a ten-fold spread across the line-up, all generating tokens from the same kind of model. So when someone says a Mac is fast or slow at LLMs, the tier matters more than the generation: an M1 Max at 400 GB/s out-generates an M3 Pro at 150, three years newer.

You can do the sum, then check it against a real machine. Our M2 Max has 400 GB/s of bandwidth. An 8B model in 4-bit is 4.6GB on disk, so the ceiling is about 400 divided by 4.6, roughly 87 tokens per second. We measured 64.9, which is 75 percent of the theoretical limit. That fraction is normal, and it is the whole point: the ceiling is set by bandwidth, and you only ever get part of it.

The clean way to see the roofline is to hold the model fixed and change only its size. So we ran the same Qwen3-8B at three bit widths on the same Mac:

Qwen3-8BOn diskDecodeTheoreticalOf ceiling
4-bit4.6 GB64.9 tok/s87 tok/s75%
8-bit8.7 GB40.0 tok/s46 tok/s87%
bf1616.4 GB22.6 tok/s24 tok/s93%

Decode falls almost exactly in step with the bytes. Double the size, roughly halve the speed, which is the roofline working as advertised. The surprise is the last column: efficiency climbs as the model gets heavier, from 75 percent at 4-bit to 93 percent at bf16. The bigger the per-token read, the more it swamps the fixed overheads, so a heavy model runs closer to its bandwidth ceiling than a light one does. A 3B 4-bit model on the same Mac hit 145.9 tok/s, but only 66 percent of its ceiling, for the same reason in reverse.

There is a second thing buried in those numbers. Prefill went the other way: in the high 300s of tokens per second at 4-bit, up in the high 500s at bf16 (the exact figure wobbles a little between runs, the direction does not). Quantising the model sped up decode and slowed down prefill, because unpacking 4-bit weights costs arithmetic, and prefill is the phase that is short on arithmetic to spare. Quantisation is a decode win you pay for, a little, at the prompt.

What the M5 changed, and what it did not#

The M5 is the cleanest illustration of this we have seen. Its headline feature for AI is the Neural Accelerators: matrix-multiply hardware built into each GPU core, reached through Metal 4. They make matmul dramatically faster, which is to say they lift the compute ceiling.

Look back at the roofline and you can predict the result before reading a single benchmark. Prefill is compute-bound, sitting under the ceiling, so lifting the ceiling speeds it up a lot. Apple's own figures put the M5 at roughly three to four times the M4 on time-to-first-token. Decode is bandwidth-bound, sitting on the slope, and the slope barely moved because base-tier bandwidth only went from 120 to 153 GB/s. So decode improved only around 20 to 30 percent.

Same chip, same upgrade, two completely different outcomes, and the roofline called both. Long prompts start responding much sooner on an M5. Sustained generation speed tracks bandwidth, the way it always has.

What actually speeds up decode#

If decode is bandwidth-bound, only a few things genuinely move it, and the roofline says which.

Read fewer bytes. That is quantisation, and the sweep above is the proof: 4-bit decodes at nearly three times the rate of bf16 on the same model, because it reads a third of the bytes. It is the most reliable speed-up most people have.

Reuse the bytes you do read. That is batching: serve several requests at once and the weights you load get spread across all of them, so you do more maths per byte and climb the slope toward the compute ceiling. It is why server engines batch hard, and why a single-user Mac, with nothing to batch, sits stubbornly on the bandwidth-bound part of the curve.

Or buy more bandwidth. It is the reason a Max or Ultra chip generates so much faster than a base chip of the same year, and the reason an older Mac with a wider bus can out-generate a newer one with a narrow one.

The same wall, at every scale

The same wall stands at data-centre scale. A GPU running a 70B model is bandwidth-bound on decode for the same reason your Mac is, which is why the tricks that help, quantisation and batching, are exactly what the big inference servers lean on. The numbers are larger and the bandwidth is enormous, but the wall is in the same place. For the GPU end of that spectrum, our work on FoundryOS and Forge lives against the same physics described here.

Once the roofline is in your head, a lot of buying advice turns into arithmetic. Buy for bandwidth, quantise to read fewer bytes, and stop expecting a compute upgrade to do anything for generation. You can run the same measurements on your own Mac: the benchmark we used is open source at github.com/tensorfoundrylabs/benchmark-mlx.


Key takeaways

  • Token generation is bandwidth-bound. Decode reads the whole model per token and does little maths with it, so memory bandwidth is the limit, not compute.
  • The roofline tells you which limit you are hitting. Below the ridge point you are bandwidth-bound; above it you are compute-bound. Spending on the wrong one buys nothing.
  • Prefill and decode sit on opposite sides. Prefill is compute-bound (time-to-first-token); decode is bandwidth-bound (tokens per second). They respond to different upgrades.
  • Decode speed is roughly bandwidth divided by model size, times a real-world efficiency that climbs with model weight, from about two-thirds on a small model to over 90 percent on a heavy one (measured on an M2 Max).
  • The M5 lifted the compute ceiling, not the bandwidth slope, so it cut time-to-first-token by three to four times but only nudged generation by 20 to 30 percent.
  • The levers that work are quantising, batching and bandwidth. A pure compute upgrade does nothing for generation.

Further reading#