Back to Blog

MLX vs GGUF: How 4-bit Quantisation Really Works

The first time we watched a 70B model load on a laptop it felt like a glitch. It was 4-bit, and that one word is what makes local LLMs practical. Almost everyone runs it as a black box. This is a look inside that box: what "4-bit" actually means, how MLX and GGUF get there by different routes, why they trade quality for speed in opposite directions, and the rotation trick that is quietly pushing useful inference below four bits.

If you have read our guide to how MLX runs LLMs on Apple Silicon, you will know the punchline of local inference: token generation is bound by memory bandwidth, not raw compute. Quantisation is the lever that moves that bound. Shrink the weights and you read fewer bytes per token, so you generate faster and you fit a bigger model in the same memory. The interesting part is how you shrink them without wrecking the model, and that is where MLX and llama.cpp made different choices worth understanding.

Why we bother shrinking models#

A model's weights are trained in 16-bit floating point. An 8B model in fp16 is about 16GB; a 70B is around 140GB. On a Mac that is the difference between "runs comfortably" and "does not run at all", and on every machine it is the difference between a fast first token and a slow one.

Quantisation stores each weight in fewer bits. Drop from 16-bit to 4-bit and the model is roughly a quarter of the size. You read a quarter of the bytes per token, so on a bandwidth-bound decode you generate close to four times faster, and a 70B model in 4-bit fits in about 40GB instead of 140. The cost is precision: you are approximating each weight, and if you do it carelessly the model gets noticeably worse. The whole craft is in doing it carefully.

What 4-bit actually costs#

So why isn't 4-bit actually four bits per weight? It cannot be, because a 4-bit number only has sixteen distinct values, and weights span a continuous range. So you store each weight as a small integer that indexes a set of evenly spaced levels, and you store the levels separately as a scale and a bias shared across a group of weights.

Affine group quantisationFull-precision weights snapping down to sixteen evenly spaced levels that share one scale and one bias.Quantising a group of weights051015biasscale
In 4-bit quantisation a whole group of weights shares one scale and one bias, so each weight only needs a 4-bit index into sixteen evenly spaced levels.

That shared scale and bias is the overhead. MLX's default 4-bit layout uses a group size of 64: every 64 weights get one scale and one bias, both stored in 16-bit. Do the arithmetic and 64 weights cost 64 times 4 bits for the indices, plus two 16-bit values for the scale and bias, which is 288 bits for 64 weights, or about 4.5 bits per weight. The "4-bit" label is the index width; the real footprint is always a little higher. This is the affine, or asymmetric, quantisation scheme, and it is the foundation both MLX and GGUF build on.

The size of the group is a real dial. Smaller groups track the local range of the weights more tightly, so they quantise more accurately, but they spend more on scales and biases. Larger groups are cheaper but blunter. A group of 64 is the common sweet spot; you will also see 32 for higher quality and 128 for smaller files.

How MLX packs the weights#

MLX keeps this deliberately flat. Each group of 64 weights gets its scale and bias, the 4-bit indices are packed into 32-bit integers (eight indices to a word), and that is the whole structure. You can produce it in one command with mlx-lm:

mlx_lm.convert --hf-path Qwen/Qwen3-8B -q --q-bits 4 --q-group-size 64

When that model runs, the work happens in MLX's quantised matrix-multiply kernels (quantized_matmul and its vector variants). One detail there matters for quality: those kernels accumulate in 32-bit float even when the model is fp16 or bf16, so the sum of many small products keeps its precision all the way through. It is a quiet decision you only notice by its absence, when another implementation cuts that corner and the output slips.

The flat layout is fast and simple to reason about, and on Apple Silicon it is quick. Community comparisons tend to put MLX ahead of llama.cpp's Metal backend on tokens per second for models that fit in memory, often by a useful margin. What it gives up is the trick GGUF plays next.

How GGUF does it differently#

llama.cpp's "k-quant" formats, the ones you see as Q4_K, Q5_K and friends, add a second level. Instead of one flat group, Q4_K nests small 32-weight blocks inside a larger 256-weight superblock. Each block has its own scale, and then those block scales are themselves quantised against a scale for the whole superblock. That is the "double quantisation" you may have heard mentioned, and it is what lets GGUF spend its overhead bits more efficiently.

MLX and GGUF 4-bit layoutsMLX flat groups each carrying a scale and bias, compared with GGUF nesting sub-blocks inside a superblock with quantised scales.How MLX and GGUF pack 4-bitMLX 4-bit64 weightsscale + bias64 weightsscale + bias64 weightsscale + bias64 weightsscale + biasGGUF Q4_Kone superblock of 256, block scales also quantisedblocks of 32
MLX stores one scale and bias per group of 64 weights. GGUF Q4_K nests small blocks inside a larger superblock and quantises the scales themselves, trading a touch more bookkeeping for slightly better fidelity.

There is a second thing going on in the name. The _M in Q4_K_M is a mix: llama.cpp keeps a handful of the most sensitive tensors, the attention and feed-forward weights that hurt most when degraded, at a higher precision like Q6_K, and quantises the rest at 4-bit. So Q4_K_M is not uniformly four bits. It spends extra precisely where the model is fragile, which is why it tends to hold quality slightly better than a uniform 4-bit scheme at roughly the same average file size.

There is a whole second family below this, the "i-quants" (IQ4_XS, IQ2 and so on), which use a small codebook of weight patterns rather than plain levels, and squeeze quality into very low bit widths at the cost of slower kernels on some hardware. They are worth knowing about, but the k-quants are what most people run.

So which should you use?#

On a Mac you usually end up choosing on the rest of the stack, because the gap between the formats is small. Here is how they line up.

MLX 4-bitGGUF Q4_K_M
LayoutFlat group of 64, one scale + bias32-weight blocks in a 256 superblock, scales quantised
Precision mixUniform 4-bitMixed, sensitive tensors kept higher
Real footprintAbout 4.5 bits/weightAbout 4.5 to 4.8 bits/weight
Speed on Apple SiliconTypically fasterStrong, usually a little behind
Quality at 4-bitVery goodSlightly better, on average
HardwareApple Silicon (Metal)Almost anything
Day-one availabilityLags, needs conversionUsually immediate

A quick rule of thumb. If you are on a Mac and chasing tokens per second, MLX 4-bit is the natural default, and if you want more quality you step up to 5 or 6 bits rather than worrying about the format. If you need the model the day it drops, or you run across mixed hardware, GGUF is ready first and runs everywhere. And if you are running both on the same machine anyway, which is what we end up doing, a proxy like Olla lets you keep one endpoint over the lot, as we cover in running MLX behind Olla.

The outlier problem#

Both schemes so far share a weakness, and it is the thing the frontier is attacking. A handful of values in a layer, both in the weights and in the activations flowing through, are much larger than the rest. Those outliers stretch the range the grid has to cover, so the evenly spaced levels end up far apart, and every ordinary value loses resolution to make room for a few extreme ones. At 4-bit it is annoying. At 3-bit and below it is the difference between a working model and noise.

You can throw bits at the problem, which is what the mixed-precision tensors in Q4_K_M are quietly doing. Or you can go after the outliers themselves, which is where most of the last couple of years of research has gone.

The rotation trick#

The clever idea, and the one that has reshaped low-bit quantisation over the last couple of years, is to rotate the values before you quantise them. Multiply the weights and activations by a Hadamard matrix and the energy that was concentrated in a few outliers gets spread evenly across the whole vector. The distribution flattens, the range tightens, and a uniform grid suddenly fits beautifully.

Hadamard rotation removes outliersA spiky weight distribution with a coarse grid on the left, and an even distribution with a tight grid on the right after a rotation.Rotating away the outliersbefore rotationafter rotationrotate
A handful of outliers force a coarse grid that wastes precision on the rest. A Hadamard rotation spreads the values out so a uniform grid fits them all, and the maths around it cancels the rotation so the output does not change.

The reason it feels like a free lunch is that the rotation cancels out. If you rotate one side of a matrix multiply by a Hadamard and the other side by its inverse, the result is unchanged, because the two rotations multiply back to the identity. So you get the outlier-free distribution for quantisation without changing what the model computes. QuaRot, the paper that popularised this, reports a 4-bit LLaMA-2 70B landing within about half a point of WikiText perplexity of the full-precision model while keeping roughly 99 percent of its zero-shot accuracy, with weights, activations and the KV cache all in 4-bit.

Stack a learned codebook on top of the rotation, the way the QuIP and TurboQuant methods do, and you can hold usable quality down to two or three bits, which a year ago was firmly in "broken model" territory. The interesting work now is in reshaping the numbers before you quantise, so an ordinary grid is enough. That is where this is going.

We have built this, on the other GPU

We have spent a fair bit of time on this exact idea. Forge, our pure-Rust inference server for NVIDIA GPUs, ships a quant format called TQ4 built on this principle: a Hadamard rotation to kill the outliers, then a Lloyd-Max codebook to place the levels where the data actually sits. That work is firmly on the CUDA side, and Forge does not run on Apple Silicon, so it is not something you would point at a Mac. We bring it up because it is the same technique now arriving in the MLX world through TurboQuant and friends, and having built it at the kernel level is partly why this corner of the field is fun to write about. If you run NVIDIA hardware, Forge is where we put these ideas to work.

Quantising the KV cache too#

Weights are only half the memory story. As a conversation grows, the KV cache (the attention keys and values for every token so far) grows with it, and at long context it can outgrow the model itself. Quantising that cache is the other big lever, and it is where most of the 2026 movement is.

mlx-lm can quantise the cache to a few bits, and it makes long-context work practical on memory that would otherwise run out. There are two catches. The first is that attention still wants to run in higher precision, so a quantised cache has to be dequantised on the way into the attention step, and doing that naively can briefly balloon memory: one MLX issue reports a spike of around 31GB at a 128k context on a 31B model, which is exactly why native quantised attention kernels matter. The second is plumbing: at the time of writing the library supports cache quantisation but the bundled mlx_lm.server does not expose the flags for it, so you reach for it from Python or from a multi-model server such as oMLX that wires it up for you.

The rotation idea shows up here too. TurboQuant and PolarQuant point the same rotate-then-codebook approach at the cache, and push it to two or three bits with very little quality loss. That is the difference between a 128k-token context you can afford to keep around and one you cannot.

What we would actually do#

Pulling it together into something you can act on:

Picking a quant in practice

Mac, speed first

MLX 4-bit
The default for tokens per second on Apple Silicon. Group size 64 is the sweet spot. If a model has not been converted yet, the mlx-community on Hugging Face usually has it within a day or two.

Quality matters more

5 to 6-bit
Do not agonise over MLX versus GGUF at 4-bit. Step the bit width up instead. Five or six bits buys back most of what 4-bit gives away, for a modest size increase.

Mixed hardware or day-one

GGUF Q4_K_M
Runs everywhere, ready the moment a model drops, and its mixed precision holds quality well. The natural choice when you are not all-in on Apple Silicon.

Long context

Quantise the KV cache
Once your prompts run to tens of thousands of tokens, the cache is the memory you run out of first. Quantise it, and watch for the dequantisation cost on the way into attention.

Quantisation has quietly become the most important performance lever in local inference, and it is moving fast. The grids are mostly a solved problem; the action now is in reshaping the numbers before you quantise them, so that four bits, or three, or two, is enough. Worth keeping an eye on, because every step down that ladder puts a bigger model on the hardware you already own.


Key takeaways

  • "4-bit" is really about 4.5 bits per weight. The extra half a bit is the scale and bias each group of weights has to carry.
  • MLX keeps the layout flat, one scale and bias per group of 64, which is fast and simple on Apple Silicon.
  • GGUF Q4_K nests blocks in superblocks and quantises the scales, and Q4_K_M keeps sensitive tensors higher, which holds quality slightly better at the same size.
  • The gap between them at 4-bit is small. On a Mac, pick on speed and ecosystem, and reach for 5 or 6 bits when you want more quality.
  • Outliers are the real enemy of low-bit quantisation, and Hadamard rotation spreads them out so a uniform grid fits, without changing the model's output.
  • Quantising the KV cache is the next lever, and the rotate-then-codebook methods are pushing both weights and cache below four bits.

Further reading#