LLM Quantisation: A Field Guide for 2026
By the end of this you should be able to read any quant label and know roughly what you are getting, why someone chose it, and when you would pick something else. The MLX and GGUF mechanics live in our companion piece on MLX and GGUF if you want them.
Why make models small#
A model's weights are trained in 16-bit floating point. An 8B model in 16-bit is about 16GB; a 70B is around 140GB. Quantisation stores each weight in fewer bits, and that buys two things:
- Less memory: the model takes up a fraction of the space, so a bigger one fits on the same card.
- More speed: token generation is bound by memory bandwidth, so reading fewer bytes per token runs faster.
Drop from 16-bit to 4-bit and a model is roughly a quarter of the size, reads a quarter of the bytes, and a 70B suddenly fits in about 40GB instead of 140.
The cost is precision. You are approximating every weight and a careless approximation makes the model worse.
The number line#
Start with the formats - there are a few, because half the labels are just naming one. Here are the ones you will actually meet, drawn to scale.
A float spends its bits on three parts:
- Sign: one bit, positive or negative.
- Exponent: how large or small the number can get, the range.
- Mantissa: how finely that range is resolved, the precision.
BF16 keeps FP16's wide range but spends fewer bits on the mantissa, which is why it is the training default. FP8 comes in two flavours, and the split is the whole story:
- E4M3: four exponent bits, three mantissa. More precision, less range, and the usual choice for inference.
- E5M2: five exponent bits, two mantissa. More range, less precision.
Integers are flatter: INT8 and INT4 drop the exponent entirely and carve the range into even steps.
The interesting recent arrivals are the 4-bit floats. FP4 (E2M1) only has sixteen possible values, so on its own it is hopeless. The trick is microscaling: split the tensor into small blocks and give each block its own scale factor, so each block can zoom in on its local range. The industry landed on two:
- MXFP4, the open standard: 32-element blocks with a simple power-of-two scale.
- NVFP4, NVIDIA's version: tighter 16-element blocks with a higher-precision FP8 scale, so more accurate at a touch more overhead.
Both run natively on Blackwell GPUs, and that hardware support is why 4-bit floats suddenly matter.
How it works, at heart#
Underneath every format is the same small recipe. Take a group of weights, find the range they span, and lay a fixed number of evenly spaced levels across it. Each weight snaps to its nearest level and is stored as a small integer index. To get the real number back, you multiply the index by a scale (the gap between levels) and add an offset (where the range starts).
That is affine quantisation, and four small steps run it:
- Pick the scale and offset for the group.
- Divide each weight by the scale.
- Round to the nearest whole number.
- Clamp anything that falls past the ends of the range.
Reading it back is just the reverse. The reason "4-bit" is never really four bits per weight is those shared scale and offset values, which cost a little on top, so a 4-bit model lands closer to 4.5 bits in practice.
Three choices decide how good the result is, and they are the switches every method further down is flicking.
Symmetric or asymmetric. Symmetric quantisation centres the levels on zero and stores only a scale. That suits weights, which sit roughly evenly either side of zero. Asymmetric quantisation also stores an offset, the "zero point", so the levels can start anywhere, which suits lopsided ranges like the output of a ReLU that is never negative. As a rule, weights go symmetric and activations asymmetric.
How many weights share a scale. This is granularity, and it is the single biggest lever on quality.
- Per-tensor: one scale for the whole weight matrix. Cheapest, and a single outlier ruins it for everyone.
- Per-channel: a scale for each row or column. Much better, and the standard for 8-bit.
- Per-group (or per-block): a scale for every small run of weights, commonly 64 or 128. This is what modern 4-bit uses, and it is why a 4-bit model holds up at all. Smaller groups track the weights more tightly and cost more overhead; bigger groups are cheaper and blunter.
Where to clip. A few extreme weights can stretch the range so far that everything else loses resolution. Clipping throws those few away so the levels pack tightly around the values that matter, at the cost of representing the extremes badly. Getting that boundary right is a surprising amount of what separates a good quant from a sloppy one.
How MLX and GGUF each pack all this is the subject of our 4-bit deep dive; the shape above is common to all of them.
Two ways to get there: after, or during#
There are two moments you can quantise a model, and the difference matters more than any single method.
Post-training quantisation (PTQ) takes a finished model and compresses it after the fact. Weights are fixed, so their scales can be worked out once and baked in. Activations are trickier, because they change with every input, so you pick how to handle them:
- Dynamic: measure the activation ranges live as the model runs. Accurate, a little slower.
- Static calibration: estimate them ahead of time by pushing a few hundred representative samples through and watching the ranges. Faster, but only as good as the samples you chose.
PTQ takes minutes to hours, needs no retraining, and is what almost every quant you download is. At 4-bit and above it is excellent.
Quantisation-aware training (QAT) simulates the quantisation noise during a short fine-tune, so the model adjusts its own weights to survive being squeezed. It costs real training compute, but it holds quality far better at low bit widths.
For years QAT was treated as too expensive to bother with at scale. That changed, and 2026 made it ordinary for small models. The big one this year was Gemma 4: alongside the usual weights, Google shipped quantisation-aware checkpoints and a new mobile format that gets the smallest model, E2B, from 9.6GB down to about 1GB, small enough to live on a phone. It manages that by quantising the token-generation layers all the way to 2 bits, with the scales worked out at training time rather than guessed afterwards. Gemma 3 the year before was the run-up: it took its 27B model from 54GB to about 14GB at 4-bit and roughly halved the quality loss against plain PTQ. Apple's on-device model on current iPhones is squeezed to 2 bits the same way. The recipe is simple: spend about 90 percent of your training compute at full precision, then the last 10 percent on a quantisation-aware fine-tune, so the model only adapts to the low bits at the very end.
Weights are easy, activations are mean#
So far this has all been about weights. But a model also has activations, the numbers flowing through it as it runs, and they are where quantisation gets hard.
Weight distributions are roughly bell-shaped and quantise nicely. Activation distributions have outliers: a few channels, often after attention or a normalisation layer, blow up to a hundred times the size of everything else. Quantise naively and those outliers stretch the range until every ordinary value collapses to almost nothing.
This is what the W and A in a quant's name are telling you, the bit width of the weights and of the activations:
| Notation | Weights | Activations | Where you see it |
|---|---|---|---|
| W4A16 | 4-bit | 16-bit | GGUF, GPTQ, AWQ, most local models |
| W8A8 | 8-bit | 8-bit | server inference with INT8 or FP8 |
| W4A8 | 4-bit | 8-bit | high-throughput serving |
| W4A4 | 4-bit | 4-bit | the research frontier, needs rotation |
Weight-only W4A16 is the easy, common case, because the activations stay in full precision and never meet an outlier they cannot hold. Pushing the activations down to 8 or 4 bits is the real fight, and two ideas tackle it:
- SmoothQuant: mathematically shifts the difficulty off the activations and onto the weights, where it is easier to handle.
- Rotation: spreads the outliers out so no single channel dominates the range.
The toolbox#
Most of these are PTQ recipes for shrinking a finished model so it runs leaner. The common ones:
- GPTQ quantises a layer's weights one at a time, and after each one it nudges the weights it has not done yet to absorb the error it just made, guided by second-order (Hessian) information. Slow to produce, still in wide use at 4-bit.
- AWQ spots the roughly one percent of weight channels that matter most to the activations and scales them up before quantising, so they keep their precision while the rest drop to 4-bit. No backprop, quick to produce, and the default weight-only 4-bit format in most GPU serving stacks.
- GGUF k-quants and i-quants are the llama.cpp family: k-quants nest blocks inside superblocks, i-quants add a small codebook calibrated by an importance matrix. The
Q4_K_MandIQ4_XSnames you see on local models come from here, and it is what Ollama runs. - HQQ and AutoRound are newer and faster. HQQ needs no calibration data at all; AutoRound learns the rounding itself for a few extra accuracy points.
All of those shrink a model you intend to run. There is a second job, quantising a model so you can cheaply fine-tune it on hardware that could never hold it in full precision, and that is what QLoRA is for:
- NF4 (NormalFloat-4) is its 4-bit data type. The sixteen levels are placed to match the bell curve a weight matrix actually follows, rather than spaced evenly, so it spends its resolution where the weights actually are.
- Double quantisation then quantises the scales themselves, shaving a little more off the footprint.
- You freeze the base model in NF4 and train small LoRA adapters in 16-bit on top, so only a sliver of the parameters ever update.
The upshot is that you can fine-tune a 70B model on a single consumer GPU, something that not long ago needed a server rack. It is the technique that put fine-tuning within reach of people without a datacentre.
Then there is the frontier, and it is built on two ideas. The first is rotation. Multiply the weights and activations by a Hadamard matrix and the outlier energy spreads evenly across every channel, so a uniform grid suddenly fits.
The rotation cancels out in the surrounding maths, so the model computes the same thing, but the numbers going into the quantiser are now outlier-free. QuaRot introduced this; SpinQuant made the rotation learned rather than fixed, and Meta ships SpinQuant 4-bit Llama models that run on phones. The second idea is codebooks: instead of evenly spaced levels, learn a small dictionary of weight patterns and store indices into it. Methods like QuIP# and AQLM use this to reach two and three bits where plain grids fall apart.
How low can you go#
This is the question everyone actually wants answered, and the shape of the answer is a cliff.
Have a play with the trade-off yourself:
From 16 bits down to about 4, quality barely moves, so 4-bit is where we would start unless there is a reason to go higher. Three bits is doable with the rotation and codebook tricks above, though reasoning and coding suffer more than chat does. Two bits is where ordinary post-training quantisation simply breaks, and only QAT keeps a usable model. One research result is worth flagging: more-trained models are harder to quantise, because they have packed more into each weight, so as frontier models train on more tokens the cheap quantisation gets riskier.
The clearest exception is BitNet, Microsoft's line of natively low-bit models, trained at 1.58 bits (ternary weights, just minus-one, zero and plus-one). It sits off the curve because it never had the precision to lose in the first place. You cannot convert an existing model to it; you have to train it that way. It is a glimpse of a different path, where the model is built small from the start.
The other half: the KV cache#
Quantising weights is only half the memory story, and at long context it is the smaller half. As a conversation grows, the KV cache (the attention keys and values for every token so far) grows with it, and it does not stop. A 70B model at a 128k context can hold around 40GB of KV cache, roughly the size of the 4-bit weights themselves. Past that, you run out of cache before you run out of model.
So you quantise it too. FP8 KV cache is production-standard now in vLLM and SGLang and roughly doubles the context or batch you can fit, near losslessly on ordinary tasks. The research frontier pushes further with rotation-and-codebook methods like Google's TurboQuant and PolarQuant, which reach three bits or below. One caveat: the "quality neutral at extreme compression" headlines hold only at their gentlest settings, and vLLM's own testing found plain FP8 is still the best default once you account for the throughput cost and the hit to reasoning at the aggressive end.
In practice it is one flag. Here is the same Gemma 4 served with an FP8 KV cache on vLLM and SGLang; move the context up and the startup command changes with it.
vllm serve google/gemma-4-31B-it \ --tensor-parallel-size 2 \ --max-model-len 131072 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.90 \ --host 0.0.0.0 \ --port 8000 It is in the silicon now#
The biggest shift is that low precision has moved from a software trick into the hardware. NVIDIA's Blackwell GPUs run FP4 natively and roughly double their FP8 throughput doing it; a DeepSeek-R1 on a B200 in NVFP4 reportedly serves around three times the tokens of an H200 in FP8. Apple's M5 added native FP8 and INT4 to its GPU cores. And the formats are converging: FP8 (the microscaling MXFP8 flavour) is becoming the datacentre baseline, with FP4 the next step for inference.
It has reached training, too. DeepSeek has trained in FP8 since V3, and ships its current flagship, v4, with FP8 as the native format you download. Meta trained Llama 4's largest model in FP8. NVIDIA has shown a 12B model pretrained in 4-bit NVFP4 on ten trillion tokens with almost no quality gap. No shipping model is born in FP4 yet, but you can see where this is going: the precision a model is trained in, run in, and stored in is all drifting down together, now that the silicon rewards it.
What we run on the NVIDIA side
This is not abstract for us. Forge, our pure-Rust inference server for NVIDIA GPUs, supports nine of these formats directly: BF16 and FP8, INT4 in GPTQ and AWQ, AutoRound, Compressed-Tensors, and the two 4-bit floats MXFP4 and NVFP4. It also ships its own format, TQ4, built on exactly the rotation-plus-codebook idea from the frontier above: a Hadamard rotation to kill outliers, then a Lloyd-Max codebook to place the levels where the data actually sits, for about 5.7x compression over BF16. That work is firmly CUDA-side, so it is an NVIDIA story, not a Mac one. We mention it because having built one of these formats at the kernel level is partly why this whole area is worth a field guide.
A word on file formats#
We can think of model formats as container formats (similar to AVI, Matroska, or MP4 for video), because they define how everything is organised and transported. The format you download usually indicates which inference toolchain it is intended for.
- Safetensors is the Hugging Face standard for raw weights. It replaced the old pickle-based
.binfiles, which could run arbitrary code when you loaded them, with a plain tensor store plus a small JSON header, memory-mapped so it loads fast. It holds the weights and little else, and it is what the GPU stacks expect. Full-precision, FP8 and most GPTQ and AWQ checkpoints ship as safetensors. - GGUF is the llama.cpp container, and it is the all-in-one of the bunch: weights, tokenizer, chat template and quant settings in one self-contained file. It grew out of GGML, Georgi Gerganov's original format, which was fast but brittle, with no metadata and a habit of breaking compatibility between versions. GGUF (GPT-Generated Unified Format) replaced it in 2023 with versioning and room to grow, and it is where the k-quants and i-quants live. Run Ollama or llama.cpp and you run GGUF.
- MLX is barely a separate container at all. Apple's framework stores its models as safetensors too, with its own quantisation baked in, so an "MLX model" is really a safetensors file the mlx-community converted for Apple Silicon.
In practice: safetensors for the servers and GPU stacks, GGUF for local llama.cpp and Ollama, and MLX for Macs, which is safetensors in Apple Silicon clothing.
Reading a quant label#
A model filename is just labelled parts glued together. Here is one pulled apart:
The quant tag at the end has its own little grammar, and each piece is worth being able to read:
Q4_K_M(GGUF), a 4-bit k-quant at the medium tier:Q4the bit widthKthe k-quant block structureMthe size tier, fromSsmall toLlarge- an
Iprefix (as inIQ4_XS) marks an i-quant, a codebook tuned by an importance matrix
W4A16, 4-bit weights with the activations left untouched:W44-bit weightsA1616-bit activations
FP8,INT4,NVFP4, where the number is the bit width:FPa float,INTan integer- a vendor prefix like
NVorMXnames a particular 4-bit-float layout
-AWQ,-GPTQ,-GGUF, the method or container used to make it:AWQandGPTQare usually weight-only 4-bitGGUFis the [llama.cpp](https://github.com/ggml-org/llama.cpp) container holding the k- and i-quants
So Qwen3.6-32B-Instruct-AWQ reads as a 32B instruct model, weight-only 4-bit via AWQ, activations left at 16-bit. With the parts named, the label tells you most of what you need before you download anything.
Choosing a quant#
Stripping it back to something you can act on:
- Running a model locally? 4-bit is your default. GGUF Q4_K_M or an AWQ checkpoint. Step up to 5 or 6 bits if you do reasoning or coding and quality matters more than the last gigabyte. The current crop, Qwen 3.6, DeepSeek v4, GLM 5.2, MiniMax 3, all turn up as community 4-bit GGUFs within days of release.
- Serving on GPUs? FP8 weights and FP8 KV cache on Hopper or Blackwell. Those same big models almost all ship an FP8 release aimed at exactly this. Drop to AWQ 4-bit weights when memory is tighter than throughput.
- Targeting a phone or a tiny memory budget? This is QAT territory. Reach for a model that was quantised that way (Gemma 4's QAT releases) rather than squeezing one yourself to 2 or 3 bits.
- Running long contexts? Quantise the KV cache, FP8 first. It is often the single biggest memory win once your prompts run to tens of thousands of tokens.
The field moves fast, but the shape of it is stable. Pick 4-bit unless you have a reason, watch the activations and the KV cache as much as the weights, and keep in mind that four bits is roughly the floor for squeezing a finished model. Below that, the model has to be trained for low precision from the start.
Key takeaways
- Quantisation stores weights in fewer bits to save memory and read fewer bytes per token. 4-bit is about a quarter the size of 16-bit.
- The formats are the alphabet soup: FP16/BF16 for training, FP8 for servers, INT4 for local models, and FP4 (MXFP4/NVFP4) now that Blackwell runs it in hardware.
- PTQ is fine at 4-bit and up; QAT wins below it, which is why Google, Apple and Meta now ship quantisation-aware models for on-device use.
- Activations are the hard part, because of outliers, and rotation (QuaRot, SpinQuant) is the trick that tames them.
- Quality falls off a cliff below 4 bits. Three needs tricks, two needs QAT, and natively low-bit models like BitNet are a different path entirely.
- The KV cache is the other half of the memory bill at long context, and quantising it (FP8 first) is often the biggest win.
- Low precision is now in the silicon, in both inference and training, and the formats are converging downward.
Further reading#
- MLX vs GGUF quantisation - the 4-bit mechanics, pulled apart
- The bandwidth wall - why reading fewer bytes per token makes generation faster
- Under the hood of MLX - where the quantised kernels actually run
- Introducing NVFP4 - NVIDIA on 4-bit inference on Blackwell
- Gemma 3 QAT - quantisation-aware training shipped to consumers
- BitNet b1.58 - a model trained natively at 1.58 bits
- Forge - our NVIDIA inference server and its TQ4 format