Why is LLM token generation memory-bound?

Because generating each token means reading essentially the whole model out of memory once, while doing relatively little arithmetic with it. That's a low arithmetic intensity, which on a roofline puts you in the bandwidth-bound region. The ceiling on tokens per second is roughly your memory bandwidth divided by the model size in bytes, so bandwidth, not compute, is what limits you.

What is the roofline model?

A simple performance model that plots achievable throughput against arithmetic intensity, the ratio of compute to memory traffic. Below a threshold you're bandwidth-bound and throughput rises with intensity along a slope. Above it you're compute-bound and throughput flattens at the hardware's compute ceiling. The two meet at the ridge point.

Why didn't a faster Mac speed up token generation much?

Because token generation is bandwidth-bound, and a newer chip with more compute doesn't necessarily have proportionally more memory bandwidth. If bandwidth only rose a little, tokens per second only rises a little, no matter how much faster the compute got. To speed up generation you need more bandwidth or a smaller model; extra FLOPs do nothing for it.

Why is the M5 so much faster at prompt processing but not token generation?

The M5's Neural Accelerators make matrix multiplication much faster, which lifts the compute ceiling. Prefill, the prompt-processing phase, is compute-bound and sits under that ceiling, so it gets dramatically faster, around three to four times the M4 on time-to-first-token. Token generation is bandwidth-bound and sits on the slope, which barely moved, so it only improves around 20 to 30 percent.

The Bandwidth Wall: A Roofline for Local LLMs

You upgrade your Mac, load the same model, and the tokens come out maybe 20 percent faster. The chip benchmarks promised a lot more. Nothing's broken. There's one idea that governs how fast a model generates on the machine you own, and it explains why a newer Mac, a bigger GPU or a faster chip so often does so little for token generation. We'll use real numbers off our own M2 Max throughout, because the effect is a lot easier to believe when you watch it happen.

The picture is the roofline, a performance model borrowed from high-performance computing. It pays for the few minutes it takes, because it predicts decode speed to within a rough half and tells you which upgrades will help before you spend on them. If you've read our guide to how MLX runs LLMs on Apple Silicon you've met the headline: token generation is bound by memory bandwidth. Here's the machinery under it.

Compute versus bandwidth#

Every piece of hardware has two limits that matter here. One is how fast it can do arithmetic, measured in floating-point operations per second. The other is how fast it can move data in and out of memory, measured in bytes per second. For most of computing history the arithmetic was the bottleneck, so that's the number everyone quotes. For LLM inference, on the machine you own, it's usually the other one.

What decides which limit you hit is a ratio called arithmetic intensity: how many calculations you do for every byte you read. Do a lot of maths with each byte and the arithmetic is your limit. Read a lot of bytes and do little with them, and memory bandwidth is your limit. That single ratio is the X axis of the roofline.

The roofline#

Plot achievable throughput against arithmetic intensity and you get a shape like a roof. On the left, where intensity is low, throughput climbs along a slope set by memory bandwidth: every extra calculation per byte buys you more, because you're limited by how fast bytes arrive. At some point the line hits the flat compute ceiling, where you're doing all the arithmetic the chip can manage and reading bytes is no longer the constraint. The corner where they meet is the ridge point.

Decode sits on the sloped, bandwidth-bound part of the roofline, so a higher compute ceiling does nothing for it. Prefill sits under the ceiling, so the M5 lifts it. That is why the M5 cuts time-to-first-token sharply but barely moves tokens per second.

The whole game is knowing which side of the ridge you're on. To the left, more compute power does nothing, because you're waiting on memory. To the right, faster memory does nothing, because you're waiting on the arithmetic. Spend on the limit you're not hitting and it does nothing.

Prefill and decode live in different places#

A single request has two phases, and they sit on opposite sides of the ridge.

Prefill chews the whole prompt in one parallel pass, so it is compute-bound. Decode reads the entire model out of memory for every single token, so it is memory-bound. They live at opposite ends of the roofline.

Prefill is the prompt-processing phase. It reads your whole prompt at once and pushes all those tokens through the model together, which is a big matrix-matrix multiply with plenty of arithmetic for every byte of weights it loads. High intensity. It sits up under the compute ceiling, so it's compute-bound, and more arithmetic horsepower makes it faster.

Decode is the generation phase, one token at a time. For each token, the model reads essentially all of its weights out of memory and does comparatively little maths with them before producing the next token. Low intensity. It sits down on the slope, firmly bandwidth-bound. The ceiling on how fast you can decode is roughly your memory bandwidth divided by the model's size in bytes, and no amount of extra compute changes that.

This split is what people mean when they say a new Mac didn't feel faster. They upgraded compute and measured decode. Time-to-first-token, which prefill governs, and tokens-per-second, which decode governs, respond to completely different things, so a change that transforms one can leave the other flat.

The bandwidth ladder#

If decode speed is roughly bandwidth divided by model size, then bandwidth is the spec to look at on any Mac you intend to generate tokens on. The number you see quoted, the base chip's, is the floor. The tiers above it tell the real story.

The base chips are the floor, 68 to 153 GB/s. Pro roughly doubles that, Max multiplies it again, and the Ultra sits near 800. Token generation tracks bandwidth, so the tier matters more than the generation. Max figures are the higher GPU bin; there is no M4 or M5 Ultra.

The base chips, the M4 in a MacBook Air or the M5 in the entry MacBook Pro, sit between 68 and 153 GB/s. Step up to Pro and you roughly double it, to Max and you double it again, and the Ultra in a Mac Studio lands near 800. That's more than a ten-fold spread across the line-up, all generating tokens from the same kind of model. So when someone says a Mac is fast or slow at LLMs, the tier matters more than the generation: an M1 Max at 400 GB/s out-generates an M3 Pro at 150, three years newer.

You can do the sum, then check it against a real machine. Our M2 Max has 400 GB/s of bandwidth. An 8B model in 4-bit is 4.6GB on disk, so the ceiling is about 400 divided by 4.6, roughly 87 tokens per second. We measured 64.9, which is 75 percent of the theoretical limit. That fraction is normal, and it's the whole point: the ceiling is set by bandwidth, and you only ever get part of it.

The clean way to see the roofline is to hold the model fixed and change only its size. So we ran the same Qwen3-8B at three bit widths on the same Mac:

Qwen3-8B	On disk	Decode	Theoretical	Of ceiling
4-bit	4.6 GB	64.9 tok/s	87 tok/s	75%
8-bit	8.7 GB	40.0 tok/s	46 tok/s	87%
bf16	16.4 GB	22.6 tok/s	24 tok/s	93%

Decode falls almost exactly in step with the bytes. Double the size, roughly halve the speed, which is the roofline working as advertised. The surprise is the last column: efficiency climbs as the model gets heavier, from 75 percent at 4-bit to 93 percent at bf16. The bigger the per-token read, the more it swamps the fixed overheads, so a heavy model runs closer to its bandwidth ceiling than a light one does. A 3B 4-bit model on the same Mac hit 145.9 tok/s, but only 66 percent of its ceiling, for the same reason in reverse.

There's a second thing buried in those numbers. Prefill went the other way: in the high 300s of tokens per second at 4-bit, up in the high 500s at bf16 (the exact figure wobbles a little between runs, the direction doesn't). Quantising the model sped up decode and slowed down prefill, because unpacking 4-bit weights costs arithmetic, and prefill is the phase that's short on arithmetic to spare. Quantisation is a decode win you pay for, a little, at the prompt.

What the M5 changed, and what it did not#

The M5 shows this most cleanly. Its headline feature for AI is the Neural Accelerators: matrix-multiply hardware built into each GPU core, reached through Metal 4. They make matmul dramatically faster, which is to say they lift the compute ceiling.

Look back at the roofline and you can predict the result before reading a single benchmark. Prefill is compute-bound, sitting under the ceiling, so lifting the ceiling speeds it up a lot. Apple's own figures put the M5 at roughly three to four times the M4 on time-to-first-token. Decode is bandwidth-bound, sitting on the slope, and the slope barely moved because base-tier bandwidth only went from 120 to 153 GB/s. So decode improved only around 20 to 30 percent.

Same upgrade, two outcomes, both predicted by the roofline. Long prompts start responding much sooner on an M5; sustained generation still tracks bandwidth.

What actually speeds up decode#

If decode is bandwidth-bound, only a few things move it, and the roofline says which.

Read fewer bytes. That's quantisation, and the sweep above is the proof: 4-bit decodes at nearly three times the rate of bf16 on the same model, because it reads a third of the bytes. It's the most reliable speed-up most people have.

Reuse the bytes you do read. That's batching: serve several requests at once and the weights you load get spread across all of them, so you do more maths per byte and climb the slope toward the compute ceiling. It's why server engines batch hard, and why a single-user Mac, with nothing to batch, sits stubbornly on the bandwidth-bound part of the curve.

Or buy more bandwidth. It's the reason a Max or Ultra chip generates so much faster than a base chip of the same year, and the reason an older Mac with a wider bus can out-generate a newer one with a narrow one.

The same wall, at every scale

The same wall stands at data-centre scale. A GPU running a 70B model is bandwidth-bound on decode for the same reason your Mac is, which is why the tricks that help, quantisation and batching, are exactly what the big inference servers lean on. The numbers are larger and the bandwidth is enormous, but the wall is in the same place. For the GPU end of that spectrum, our work on FoundryOS and Forge lives against the same physics we've described here.

Once the roofline is in your head, a lot of buying advice turns into arithmetic. Buy for bandwidth, quantise to read fewer bytes, and stop expecting a compute upgrade to do anything for generation. You can run the same measurements on your own Mac: the benchmark we used is open source at github.com/tensorfoundrylabs/benchmark-mlx.

Token generation is bandwidth-bound. Decode reads the whole model per token and does little maths with it, so memory bandwidth sets the ceiling.
The roofline tells you which limit you're hitting. Below the ridge point you're bandwidth-bound; above it you're compute-bound. Spending on the wrong one buys nothing.
Prefill and decode sit on opposite sides. Prefill is compute-bound (time-to-first-token); decode is bandwidth-bound (tokens per second). They respond to different upgrades.
Decode speed is roughly bandwidth divided by model size, times a real-world efficiency that climbs with model weight, from about two-thirds on a small model to over 90 percent on a heavy one (measured on an M2 Max).
The M5 lifted the compute ceiling, not the bandwidth slope, so it cut time-to-first-token by three to four times but only nudged generation by 20 to 30 percent.
The levers that work are quantising, batching and bandwidth. A pure compute upgrade does nothing for generation.