LearnInference & Production ScaleModel Quantization: GPTQ, AWQ & GGUF

🚀HardInference Optimization

Model Quantization: GPTQ, AWQ & GGUF

Understand how GPTQ, AWQ, and GGUF trade off accuracy, memory footprint, and portability when serving LLMs on GPUs or local hardware.

36 min read

Learning path

Step 131 of 155 in the full curriculum

Model Parallelism for LLM Inference Local LLM Deployment

Model Quantization: GPTQ, AWQ, and GGUF

Model parallelism splits a large model across several accelerators. Quantization asks a different question: what if each weight used fewer bytes before you split anything?

Model quantization stores selected tensors with fewer bits so large models can require less memory and potentially less bandwidth. This chapter explains GPTQ, a Hessian-aware post-training quantization method, AWQ (Activation-Aware Weight Quantization), and GGUF, a portable local-inference container format, from the serving tradeoff: what footprint you reduce, what quality can move, and which runtime can use the artifact.

Imagine you run the AI team for an online marketplace. A 70 billion parameter model needs about 140 GB of weight storage in FP16 or BF16. That exceeds one 80 GB accelerator before KV cache or runtime buffers. You need to test whether a lower-bit artifact fits while preserving required task quality.

Quantization is that technique. It works like replacing a precise ruler with a coarser grid: the storage reduction is predictable, while the error depends on which values the grid distorts. Instead of storing every model weight as a high-precision floating-point number, a weight-only method packs approximate low-bit values plus metadata. Some models and tasks tolerate that approximation well; others regress enough that the artifact must be rejected.

By this point in the curriculum you already know how a Transformer turns tokens into predictions. The next question is how to fit the resulting model onto real hardware. LLM inference is often memory-bandwidth bound, which means the accelerator spends much of its time streaming weights from memory instead of saturating the math units. Quantization helps because 4-bit weights cut raw weight traffic to about one quarter of FP16. Real speedups are smaller than 4x because kernels still have to dequantize and accumulate in higher precision, but the memory savings are real. Shrinking weights from about 140 GB to about 35 GB can be the difference between "needs multiple datacenter GPUs" and "fits on a workstation or mixed CPU/GPU local setup." This article breaks down GPTQ, AWQ, and GGUF: what problem each one solves, and when to use each.

What is quantization?

Think of it like rounding numbers on a ruler. A precise ruler has millimeter markings, so you can measure 3.7 mm, 4.2 mm, 12.1 mm. A coarse ruler only has centimeter markings, so those same values become 4 cm, 4 cm, 12 cm. You lose some precision, but the ruler is simpler and cheaper. Quantization does this to every number in a neural network.

A tiny worked example

Before the general formula, walk through one weight by hand. Suppose a weight has the value $w = 0.73$ and you choose a scale of $s = 0.1$ . That means each integer step represents $0.1$ in the original space.

Divide: $0.73 / 0.1 = 7.3$
Round to the nearest integer: $7$
Store the integer $q = 7$

During inference you reverse the process:

$\hat{w} = s \cdot q = 0.1 \cdot 7 = 0.7$

The stored value is $0.7$ , not $0.73$ . The error is tiny ( $0.03$ ), and when this happens across billions of weights the model usually stays useful. If you had chosen a coarser scale of $s = 0.5$ , the same weight would become $q = \text{round}(0.73 / 0.5) = 1$ and the reconstructed value would be $0.5$ , which is a much larger error. The art of quantization is choosing the right scale so the errors stay small where they matter most.

At the simplest level, quantization maps a floating-point weight to a smaller integer range using a scale and, in some schemes, a zero point:

$q = \text{clip}\left(\text{round}\left(\frac{w}{s}\right) + z,\ q_{\min},\ q_{\max}\right)$

The quantization formula

Take the original weight $w$ , divide by the scale $s$ to express it in "integer-sized steps," shift by the zero point $z$ , round to the nearest integer, and clamp to the representable range. That's how a high-precision weight becomes an INT8 or INT4 value.

$w$ is the original floating-point weight
$q$ is the stored integer
$s$ is the scale factor
$z$ is the zero point
$q_{\min}, q_{\max}$ are the integer limits (for example, 0 to 15 for unsigned 4-bit)

Dequantization reverses the process during inference:

$\hat{w} = s \cdot (q - z)$

Reading the formula

Subtract the zero point from the stored integer, then multiply by the scale. The result $\hat{w}$ is only an approximation of the original weight because rounding already threw away information.

symmetric-int4-roundtrip.py

weights = [0.15, -1.22, 2.40, -0.45]
qmax = 7
scale = max(abs(weight) for weight in weights) / qmax
quantized = [max(-qmax, min(qmax, round(weight / scale))) for weight in weights]
restored = [value * scale for value in quantized]
mean_error = sum(abs(a - b) for a, b in zip(weights, restored)) / len(weights)

print(f"scale: {scale:.5f}")
print(f"INT4 values: {quantized}")
print(f"reconstructed: {[round(value, 3) for value in restored]}")
print(f"mean absolute error: {mean_error:.3f}")

Output

scale: 0.34286
INT4 values: [0, -4, 7, -1]
reconstructed: [0.0, -1.371, 2.4, -0.343]
mean absolute error: 0.102

Symmetric vs. asymmetric

Imagine a warehouse scale that stores package-weight readings using only 16 possible markings. Two strategies are available:

Symmetric: center the 16 markings around zero, such as -8 to +7. This is simple and hardware-friendly, but it wastes range if the values are skewed.
Asymmetric: move the markings to the actual range you observed. This uses the integer range more efficiently, but it requires the extra zero-point offset.

More formally:

Symmetric quantization: uses $z = 0$ and maps weights around zero. This is common for weight-only LLM kernels because the math is simpler.
Asymmetric quantization: uses a non-zero $z$ to better cover distributions that are not centered at zero. This is common for activations and some weight formats.

The key idea: most LLM weight tensors are roughly zero-centered, so symmetric or near-symmetric per-group quantization often works well for weights. Activations are harder because a few channels can contain very large outliers. Techniques like SmoothQuant^[1] make activation quantization easier by shifting some of that difficulty into the weights.

asymmetric-activation-range.py

activations = [0.0, 1.2, 1.8, 2.6, 3.0]
signed_qmax = 7
unsigned_qmax = 15

symmetric_scale = max(activations) / signed_qmax
asymmetric_scale = (max(activations) - min(activations)) / unsigned_qmax

print(f"symmetric signed step: {symmetric_scale:.3f}")
print(f"asymmetric unsigned step: {asymmetric_scale:.3f}")
print("A non-negative activation range can use more 4-bit levels asymmetrically.")

Output

symmetric signed step: 0.429
asymmetric unsigned step: 0.200
A non-negative activation range can use more 4-bit levels asymmetrically.

The figure below shows the basic quantization pipeline. Start with high-precision weights, estimate the statistics needed to compute scales, then pack the results into a low-bit representation plus metadata.

Quantization pipeline: high-precision weights become calibrated low-bit values plus the metadata an inference kernel needs. — Quantized weights still need metadata. Scales, group size, tensor layout, and kernel packing explain why real artifacts are larger than ideal bit math.

Memory savings

The most obvious benefit of quantization is memory reduction. That directly translates to lower serving cost, larger batch sizes, and the ability to fit bigger models on smaller devices.

The table below is weights only. Actual runtime memory is higher because you still pay for activations, the KV cache, scale metadata, and framework overhead.

Bar chart comparing 70B model weight and artifact footprints: FP16 at 140 GB, INT8 at 70 GB, ideal INT4 at 35 GB, packed Q4 at about 43 GB, and ideal INT3 at 26 GB. — Ideal bit math is the starting point. Packed Q4 is illustrative: artifact size varies with quant type, metadata, tensor layout, alignment, and mixed tensor choices.

Precision	Bits/Weight	7B Weights	70B Weights
FP16 / BF16	16	~14 GB	~140 GB
INT8 / FP8	8	~7 GB	~70 GB
INT4	4	~3.5 GB	~35 GB
INT3	3	~2.6 GB	~26 GB

Those numbers are idealized weight math. Real packed formats are larger when they also store per-group scales, alignment, or tensors kept at higher precision. For a GGUF artifact, inspect the selected quantization type and actual file/runtime footprint rather than assuming the ideal 35 GB number.^[2]

Because LLM generation is often weight-bandwidth bound, shrinking the weights can also speed up inference. The raw bandwidth demand drops almost linearly with bit width, although the observed throughput gain depends on the kernel, batch size, and how much extra work the runtime does to unpack the weights.

group-scale-metadata.py

parameters = 70_000_000_000
bits_per_weight = 4
group_size = 128
bytes_per_scale = 2

ideal_weight_gb = parameters * bits_per_weight / 8 / 1_000_000_000
scale_metadata_gb = parameters / group_size * bytes_per_scale / 1_000_000_000

print(f"ideal INT4 weights: {ideal_weight_gb:.2f} GB")
print(f"one FP16 scale per {group_size} weights: {scale_metadata_gb:.2f} GB")
print("Alignment and mixed-precision tensors can add more.")

Output

ideal INT4 weights: 35.00 GB
one FP16 scale per 128 weights: 1.09 GB
Alignment and mixed-precision tensors can add more.

Sizing exercise

Try this before moving on. You have a workstation with one NVIDIA RTX 4060 (8 GB VRAM) and you want to run Llama 3 70B. The model has roughly 70 billion parameters.

How much VRAM would the model need in FP16?
How much would it need at INT4?
Can you run it on the 4060?

Use this tiny calculator to check the arithmetic without any framework overhead:

quantization-sizing.py

def weight_gb(parameters_billion: float, bits_per_weight: int) -> float:
    return parameters_billion * bits_per_weight / 8

params = 70
for bits in (16, 8, 4, 3):
    print(f"70B at {bits:>2}-bit weights: {weight_gb(params, bits):5.1f} GB")

gpu_vram_gb = 8
usable_vram_gb = gpu_vram_gb * 0.8
print(f"8 GB GPU with 20% reserve: {usable_vram_gb:.1f} GB usable")

Output

70B at 16-bit weights: 140.0 GB
70B at  8-bit weights:  70.0 GB
70B at  4-bit weights:  35.0 GB
70B at  3-bit weights:  26.2 GB
8 GB GPU with 20% reserve: 6.4 GB usable

Solution

FP16: $70 \times 2 = 140$ GB. The model doesn't fit.
INT4: $70 \times 0.5 = 35$ GB. The model still doesn't fit on an 8 GB card.
You can't run the full model on the GPU alone. You need GGUF with heavy CPU offloading, or a larger GPU, or a smaller model. The INT4 size is a huge improvement, but it isn't magic.

This is exactly the calculation you should do before choosing a quantization strategy. The formula is simple: parameters $\times$ bytes per parameter = weight footprint. Remember that runtime memory is higher because of activations, KV cache, and metadata.

Weight-only vs. weight-activation quantization

Two broad approaches to quantization exist, and confusing them causes a lot of interview mistakes.

Weight-only quantization is what GPTQ and AWQ target. The stored weights are low precision, while optimized kernels unpack or dequantize low-bit values during computation and accumulate with higher-precision activations or accumulators. This is why you often see notation like W4A16: 4-bit stored weights, 16-bit activations.

Weight-activation quantization pushes both sides of the matmul down, for example W8A8 or W4A8. This can be faster and more memory-efficient, but it's harder because activation distributions are spikier and less stable than weight distributions. Techniques like SmoothQuant^[1] exist specifically to make weight-activation quantization practical.

GPTQ and AWQ are weight-only methods. They capture large weight-memory savings without requiring equally low-bit activations. Lower-bit activation paths such as W4A4 need separate hardware, kernel, and quality validation.

Common mistake: Candidates often claim W4A16 quantization gives a 4x end-to-end speedup. It does not guarantee that. Low-bit weights reduce raw weight traffic, but kernel overhead and non-weight work remain, and the workload may not be bandwidth-bound. The guaranteed first-order gain is smaller stored weights, not a fixed tokens-per-second multiplier.

bandwidth-saving-is-not-speedup.py

bandwidth_gb_s = 1_000
fp16_weight_gb = 14.0
int4_weight_gb = 3.5
other_work_ms = 3.0

fp16_ms = fp16_weight_gb / bandwidth_gb_s * 1000 + other_work_ms
int4_ms = int4_weight_gb / bandwidth_gb_s * 1000 + other_work_ms

print(f"raw weight traffic reduction: {fp16_weight_gb / int4_weight_gb:.1f}x")
print(f"illustrative step speedup with fixed overhead: {fp16_ms / int4_ms:.2f}x")

Output

raw weight traffic reduction: 4.0x
illustrative step speedup with fixed overhead: 2.62x

GPTQ (post-training quantization)

The intuition: balancing errors like a checkbook

Imagine a layer has only two weights for simplicity: $w_1 = 1.2$ and $w_2 = 0.8$ . A naive quantizer might round both toward the nearest integer, turning them into $1.0$ and $1.0$ . The first weight lost $0.2$ , the second gained $0.2$ . The total output change depends on how the layer uses those weights.

If the calibration data shows that $w_1$ is multiplied by large activations and $w_2$ by small ones, the $-0.2$ error on $w_1$ hurts the output far more than the $+0.2$ error on $w_2$ . GPTQ notices this through the Hessian approximation and compensates: it might quantize $w_1$ more carefully, or adjust $w_2$ in the opposite direction to cancel some of the damage. The goal isn't to make every weight close to its original value. The goal is to keep the layer's output on real data as close as possible to the original output.

Algorithm: minimize output error with curvature information

GPTQ works like carefully removing blocks from a Jenga tower. When you remove one block (round one weight), the tower shifts. A careless player ignores the shift and keeps pulling until the tower collapses. GPTQ compensates for the shift. After quantizing one set of weights, it updates the remaining floating-point weights so the layer output changes as little as possible.

GPTQ^[3] is a one-shot post-training quantization method based on approximate second-order information. It builds on the layer-wise Optimal Brain Quantization (OBQ) solver, which quantizes weights one at a time and updates the remaining weights to minimize the layer's output error. GPTQ makes that idea fast enough for billion-parameter models by quantizing weights in a fixed order and using lazy batched updates. The key objective isn't "make the quantized weights numerically close to the originals." It's "make the layer output on real activations stay close to the original output." For a weight row $w$ and calibration activations $X$ , GPTQ approximates:

\begin{aligned} \hat{w} &= \arg\min_{\tilde{w} \in \mathcal{Q}} \|wX - \tilde{w}X\|_2^2 \\ &\approx \arg\min_{\tilde{w} \in \mathcal{Q}} (w - \tilde{w})^T H (w - \tilde{w}) \\ H &\approx XX^T \end{aligned}

Reading the formula

The Hessian approximation $H$ tells GPTQ which input directions matter most on the calibration set. A small error on an unimportant direction is cheap. The same numeric error on a frequently used direction is expensive. That's why GPTQ usually beats plain round-to-nearest quantization at the same bit width.

In practice, GPTQ looks like this:

Collect representative activations from a calibration set such as C4^[4].
Approximate $H \approx XX^T$ for each linear layer.
Quantize the weights sequentially while using an approximate inverse Hessian to compensate the remaining floating-point weights.
Pack the result into a low-bit format that an inference kernel can consume efficiently.

The original paper reports quantizing 175B-class models (OPT-175B and BLOOM-176B) in about four GPU-hours while preserving strong accuracy at 3-bit and 4-bit settings.^[3]

gptq-curvature-proxy.py

rounding_errors = [0.20, 0.20]
curvature_proxy = [25.0, 1.0]
weighted_cost = [h * error**2 for h, error in zip(curvature_proxy, rounding_errors)]

print(f"same absolute errors: {rounding_errors}")
print(f"curvature-weighted costs: {weighted_cost}")
print(f"first direction costs {weighted_cost[0] / weighted_cost[1]:.0f}x more to distort")

Output

same absolute errors: [0.2, 0.2]
curvature-weighted costs: [1.0000000000000002, 0.04000000000000001]
first direction costs 25x more to distort

The configuration sketch below shows the shape of a transformers GPTQ workflow. Backend packages and supported arguments change over time, so verify current library documentation and evaluate the resulting artifact before deployment.^[5]

reading-the-formula.py

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=gptq_config,
)

# If quantized with device_map="auto", gather the model onto one device before saving.
model.to("cpu")
model.save_pretrained("opt-125m-gptq")
tokenizer.save_pretrained("opt-125m-gptq")

Granularity: per-tensor, per-channel, per-group

A naive quantizer uses one scale for the entire tensor. This is per-tensor quantization. It's cheap, but one large outlier can ruin the precision of everything else.

Modern LLM quantizers almost always use finer granularity:

Granularity	What Gets Its Own Scale	Accuracy	Metadata Overhead
Per-tensor	Entire tensor	Lowest	Lowest
Per-channel	One output channel / row	Better	Moderate
Per-group	Small group of weights, often 64 or 128	Common 4-bit choice	Moderate

Per-group quantization is the common compromise for 4-bit LLM inference. Smaller groups usually improve fidelity, but they also require storing more scale metadata and may reduce kernel efficiency.

AWQ (activation-aware weight quantization)

Not all weights are equally dangerous to quantize

Imagine renovating a building and needing to remove material wherever possible. Most walls can be thinned or moved with little impact. A few are load-bearing. Disturb those and the whole structure becomes unstable. AWQ tries to identify the "load-bearing" weight channels.

AWQ^[6] starts from the observation that activation magnitudes are not evenly distributed. A small fraction of channels carry disproportionately large activations. If the corresponding weight columns are quantized poorly, the downstream matmul error gets amplified. The AWQ paper reports that protecting only about 1% of salient weights can greatly reduce quantization error.^[6]

awq-salient-channel-proxy.py

activations = [100.0, 0.1]
naive_weight_errors = [0.10, 0.10]
protected_weight_errors = [0.02, 0.10]

naive_output_error = sum(a * e for a, e in zip(activations, naive_weight_errors))
protected_output_error = sum(a * e for a, e in zip(activations, protected_weight_errors))

print(f"naive output-error proxy: {naive_output_error:.2f}")
print(f"protect high-activation channel: {protected_output_error:.2f}")

Output

naive output-error proxy: 10.01
protect high-activation channel: 2.01

Why protecting a few weights matters

Imagine the same two-weight layer, but now the activation pattern is extreme: $w_1$ is almost always multiplied by $100$ , while $w_2$ is multiplied by $0.1$ . A $0.1$ rounding error on $w_1$ becomes a $10$ unit output error, while the same error on $w_2$ becomes only $0.01$ units. AWQ identifies these "high-traffic" channels and rescales them so the quantizer spends more of its limited integer range on the weights that matter most.

Algorithm

AWQ doesn't rebuild the entire weight matrix the way GPTQ does. Instead, it uses an equivalent rescaling trick:

$W x = W \cdot \mathrm{diag}(s) \cdot \mathrm{diag}(s)^{-1} x \approx Q\!\left(W \cdot \mathrm{diag}(s)\right) \cdot \mathrm{diag}(s)^{-1} x$

Reading the formula

Multiply important weight columns by a scaling vector $s$ before quantization so they occupy more of the available integer range. Then divide the corresponding activation channels by the same factor. The floating-point computation stays equivalent, but the quantizer now spends more precision on the columns that matter most.

The practical workflow is:

Run representative inputs through the model and collect activation statistics.
Identify salient channels with unusually large activation magnitude.
Search for scaling factors that reduce the quantization error on those channels.
Quantize the rescaled weights into a hardware-friendly 4-bit format.

AWQ artifacts are typically produced offline and loaded by a serving runtime that understands the artifact's quantization metadata, such as group size and zero-point policy. Loader and kernel support varies by runtime version, so verify the chosen artifact/runtime pair before benchmarking.^[7]

Compared with GPTQ, AWQ is lighter-weight because it avoids GPTQ's reconstruction step. It often works especially well on instruction-tuned checkpoints, but the final speed and latency picture still depends on the runtime and kernel implementation.^[6]^[7]

Comparison of GPTQ and AWQ mechanics: GPTQ compensates reconstruction error while AWQ protects activation-sensitive channels before packing 4-bit weights. — GPTQ and AWQ both compress weights, but protect different signals. GPTQ targets reconstructed layer output; AWQ protects activation-sensitive channels.

GGUF (llama.cpp format)

What is GGUF?

GGUF is a file format, not a quantization algorithm. It's the container format used by the ggml / llama.cpp ecosystem for local inference.^[2]

The key idea: GPTQ and AWQ mainly answer the question "How should I quantize the weights?" GGUF answers a different question: "How should I package model tensors and metadata so local runtimes can load and run them efficiently?"

GGUF matters because it bundles the tensors with the metadata needed to run them:

tokenizer and vocabulary information
architecture metadata and tensor shapes
tensor-by-tensor quantization types inside one portable file
compatibility with CPU inference and partial GPU offload in local runtimes

That last point is why GGUF is so important for local LLMs. If the full model doesn't fit in VRAM, llama.cpp-style runtimes can keep some layers on the GPU and spill the rest to system memory.

GGUF packaging and partial offload: a portable file stores tensors and metadata, then a local runtime places some layers in GPU VRAM and the rest in system RAM. — GGUF is a local runtime package. It can hold mixed quantized tensor types and enough metadata for a runtime to place layers across GPU VRAM and system RAM.

gguf-partial-offload-budget.py

artifact_gib = 5.0
layers = 32
usable_gpu_gib = 4.0
runtime_reserve_gib = 0.5

layer_budget_gib = usable_gpu_gib - runtime_reserve_gib
gpu_layers = int(layer_budget_gib / (artifact_gib / layers))

print(f"GPU budget for model layers: {layer_budget_gib:.1f} GiB")
print(f"even-size approximation: {gpu_layers}/{layers} layers fit on GPU")
print("Measure real tensor placement and KV memory in the chosen runtime.")

Output

GPU budget for model layers: 3.5 GiB
even-size approximation: 22/32 layers fit on GPU
Measure real tensor placement and KV memory in the chosen runtime.

Quantization families inside GGUF

GGUF can store several quantization families. The format doesn't force one specific quantizer.

Family	Example	Extra Calibration	Typical Use
Legacy block quantization	`Q4_0`, `Q5_0`	No	Simple and widely supported
K-quants	`Q4_K_M`, `Q5_K_M`	No	Common local default for size/quality
IQ / iMatrix-aware formats	`IQ4_XS`, `IQ3_M`	Usually yes	Better quality when squeezing below comfortable 4-bit settings

Q4_K_M is a commonly encountered local-inference candidate, but the right choice depends on target model, quality gate, and hardware. If the full model fits in VRAM, benchmark GPU-oriented GPTQ or AWQ artifacts against the local runtime; if partial offload is required, GGUF is a useful packaging option.

iMatrix quantization

Importance-matrix quantization uses representative text to estimate which directions are expensive to distort. That extra signal lets IQ formats spend precision where it buys the most quality. Conceptually, it fills the same role as calibration in GPTQ and AWQ: representative data tells the quantizer what errors matter most.

The commands below illustrate a llama.cpp-style conversion and quantization path. Binary names and supported quant types can change, so check the installed revision's documentation before running it.^[2]

terminal

# 1) Convert a Hugging Face checkpoint to GGUF
python3 convert_hf_to_gguf.py ./Meta-Llama-3.1-8B-Instruct \
    --outtype f16 \
    --outfile llama-3.1-8b-f16.gguf

# 2) If needed: build an importance matrix from representative text
llama-imatrix \
    -m llama-3.1-8b-f16.gguf \
    -f calibration.txt \
    -o llama-3.1.imatrix.dat

# 3a) Common default without iMatrix
llama-quantize \
    llama-3.1-8b-f16.gguf \
    llama-3.1-8b-Q4_K_M.gguf \
    Q4_K_M

# 3b) Importance-aware quantization
llama-quantize \
    --imatrix llama-3.1.imatrix.dat \
    llama-3.1-8b-f16.gguf \
    llama-3.1-8b-IQ4_XS.gguf \
    IQ4_XS

Beyond weight-only: FP8 and KV cache quantization

GPTQ, AWQ, and GGUF all shrink the model weights. Once the weights are small, the next bottleneck is often the KV cache, the attention state that grows with sequence length.

FP8 sits adjacent to the big three rather than replacing them. It's an 8-bit floating-point format that becomes attractive when the serving hardware has native FP8 kernels.^[8]

FP8 has two common encodings:^[8]

E4M3: more mantissa precision, less dynamic range
E5M2: less mantissa precision, more dynamic range

Unlike 4-bit weight-only methods, FP8 is usually chosen when you want a milder accuracy-memory tradeoff and the accelerator is built to exploit FP8 directly.

A newer branch of the same idea is 4-bit floating-point. MXFP4 and NVFP4 are block-scaled FP4 formats that pair tiny FP4 values with a shared per-block scale: MXFP4 uses 32-element blocks with a power-of-two (E8M0) scale, while NVFP4 uses 16-element blocks with an FP8 (E4M3) scale for finer resolution. These are not competitors to GPTQ and AWQ; in practice GPTQ-style or AWQ-style calibration is applied on top of an FP4 format to recover accuracy. Native FP4 tensor-core acceleration requires recent NVIDIA Blackwell hardware, and on older GPUs runtimes still load FP4 weights through software kernels that give the memory savings without the full compute speedup. MXFP4 has seen production use (for example in GPT-OSS releases); confirm format and hardware support against your runtime before committing to it.

Quantizing the KV cache is a separate lever. Weight quantization shrinks static model weights. KV-cache quantization shrinks attention state that grows with sequence length. Once weights fit, long-context concurrency can become limited by KV state; a runtime may then offer lower-precision KV storage as another measured tradeoff.

kv-cache-after-weight-quantization.py

weights_int4_gib = 7_000_000_000 * 0.5 / 1024**3
batch, sequence = 32, 8_192
layers, kv_heads, head_dim, kv_bytes = 32, 8, 128, 2
kv_gib = 2 * batch * sequence * layers * kv_heads * head_dim * kv_bytes / 1024**3

print(f"hypothetical 7B INT4 weights: {weights_int4_gib:.1f} GiB")
print(f"FP16 KV cache at batch={batch}, context={sequence}: {kv_gib:.1f} GiB")
print("Shrinking weights alone does not solve long-context capacity.")

Output

hypothetical 7B INT4 weights: 3.3 GiB
FP16 KV cache at batch=32, context=8192: 32.0 GiB
Shrinking weights alone does not solve long-context capacity.

Comparison

When choosing a quantization strategy, the real question isn't "Which one is best?" It's "What hardware constraint am I solving for?"

Feature	GPTQ	AWQ	GGUF
Meaning	Weight-only PTQ algorithm	Weight-only PTQ algorithm	Portable file/container format
Core idea	Minimize layer output error with Hessian-weighted reconstruction	Protect salient weight channels using activation statistics	Store tensors + metadata + chosen ggml quantizers in one artifact
Calibration	Required	Required	Depends on quantizer; iMatrix uses representative data
Common deployment target	Fully GPU-resident serving	Fully GPU-resident serving	CPU, Apple Silicon, or mixed CPU/GPU
Strength	Mature second-order method	Strong 4-bit quality with hardware-friendly kernels	Single-file portability and partial GPU offload
Tradeoff	Offline quantization is heavier	Runtime/kernel compatibility still matters	Usually slower than specialized full-GPU kernels

AWQ's paper reports strong 4-bit results by protecting activation-sensitive channels.^[6] GPTQ remains relevant when a runtime or kernel stack supports its packed artifacts efficiently.^[3] On GPU servers, benchmark the exact low-bit kernel path. For local or partial-offload deployments, benchmark the GGUF runtime and placement plan rather than assuming a format name decides performance.

Common pitfalls

Treating all quantizers as equivalent

Symptom: You choose "4-bit" from a model hub without checking whether it is GPTQ, AWQ, GGUF, or a runtime quantization path.

Cause: Bit width describes storage size, not calibration method, tensor layout, kernel support, offload behavior, or quality profile.

Fix: Name the artifact and the runtime together: "AWQ on vLLM," "GPTQ on Transformers," or "Q4_K_M GGUF on llama.cpp." Then test that exact pair.

The calibration trap

Symptom: Your quantized French customer-support model speaks gibberish, even though the English version quantized fine.

Cause: GPTQ and AWQ both rely on calibration data to understand which weights matter. If you use English Wikipedia to quantize a model trained on French retail data, the activation statistics are wrong and the quantizer throws away precision in the wrong places.

Fix: Use calibration text that matches the target domain and language, then verify quality on held-out target tasks.

Confusing weight-only and full quantization

Symptom: A design doc claims GPTQ or AWQ makes the entire model 4-bit.

Cause: GPTQ and AWQ are weight-only methods. Activations usually stay at FP16 or BF16, and accumulation happens in higher precision.

Fix: Write the precision contract explicitly. W4A16 means 4-bit stored weights and 16-bit activations, not full W4A4 inference.

The speed fallacy

Symptom: You quantize to 4-bit expecting a 4x speedup, but tokens per second barely improve.

Cause: 4-bit weights save memory bandwidth, but the kernel still has to dequantize them into higher precision before the matrix multiply. If the dequantization code path is slow or the GPU isn't memory-bound to begin with, the speedup shrinks.

Fix: Measure end-to-end tokens per second on your exact hardware and batch size. Bandwidth savings are real, but they only translate to speed when the runtime is optimized for your GPU.

Ignoring perplexity

Symptom: The quantized model still chats politely, but it hallucinates order statuses or generates invalid JSON for your shipping API.

Cause: Perplexity on held-out text is a fast sanity check, but a model can show only a small perplexity increase while regressing sharply on structured tasks like code generation or multi-step reasoning.

Fix: Always pair perplexity with task-specific benchmarks. For a logistics model, run your own production eval set that includes the exact output formats the model must produce.

Confusing GGUF with the quantizer

Symptom: Someone says "we used GGUF quantization" as if that fully specifies the quality and runtime behavior.

Cause: GGUF is the container. Q4_K_M, IQ4_XS, Q5_K_M, and related tensor types describe the actual low-bit encoding inside the file.

Fix: Report both: "GGUF Q4_K_M with 20 GPU layers," "GGUF IQ4_XS with iMatrix," or another concrete artifact/runtime pairing.

How to evaluate a quantized model

Perplexity is a good first sanity check, but it is not enough. A quantized model can show a modest change in perplexity while still regressing on code generation, structured output, or domain decisions. A returns-classification model should therefore be tested on representative extraction and action-format cases, not only general text.

GPTQ and AWQ both report that 4-bit weight-only quantization can preserve language-modeling quality surprisingly well on large models, while more aggressive bit widths degrade more sharply.^[3]^[6]

Three-gate approval workflow for quantized models: broad language drift, task regressions, and serving performance all must pass before deployment. — A strong quantization job checks quality and serving behavior together. Approval depends on target workload gates, not checkpoint size alone.

Evaluation Axis	What To Measure	Why It Matters
Language modeling	Held-out perplexity	Fast check that next-token behavior didn't drift too far
Reasoning and knowledge	MMLU^[9], GSM8K^[10]	Catches multi-step failures that perplexity can hide
Code generation	HumanEval^[11] or your own coding eval	Code is often more brittle than chat completion
Systems performance	Tokens/s, VRAM use, max context, cold-start time	Quantization is a systems tradeoff, not just an accuracy number

Three practical rules:

GPTQ and AWQ results support testing 4-bit weight-only artifacts before pushing to more aggressive bit widths.^[3]^[6]
Task-specific regressions cannot be inferred from a generic quality score; reasoning, structured output, and tool-use tasks need their own gates.
Group size, calibration data, and the serving kernel can matter as much as the headline format name.

quantized-artifact-approval-gate.py

candidates = [
    {"name": "FP16", "task_accuracy": 0.93, "p95_ms": 70, "vram_gib": 14.0},
    {"name": "AWQ-4bit", "task_accuracy": 0.92, "p95_ms": 49, "vram_gib": 4.1},
    {"name": "aggressive-3bit", "task_accuracy": 0.85, "p95_ms": 43, "vram_gib": 3.2},
]
minimum_accuracy = 0.90
maximum_vram_gib = 8.0
approved = [c["name"] for c in candidates if c["task_accuracy"] >= minimum_accuracy and c["vram_gib"] <= maximum_vram_gib]

print(f"approved artifacts: {approved}")
print("Smaller artifact is rejected when task quality misses the gate.")

Output

approved artifacts: ['AWQ-4bit']
Smaller artifact is rejected when task quality misses the gate.

Decision guide

Choose based on the bottleneck

Decision table mapping deployment bottlenecks to candidate quantization paths: GPU residency to AWQ or GPTQ, local offload to GGUF, native FP8 hardware to FP8, and long-context memory pressure to KV-cache tuning. — Choose candidates from the deployment bottleneck, then benchmark them. A model that fits in VRAM needs different testing from a local model that must offload layers.

Scenario	Recommended Starting Point	Why
Production GPU server, model fits in VRAM	Benchmark AWQ or GPTQ	Candidate artifacts for specialized GPU kernels
24 GB GPU, model is too large	GGUF with partial offload	Lets system RAM absorb the overflow
CPU or Apple Silicon laptop	GGUF	Common local-runtime artifact path
Datacenter accelerator with native FP8 path	FP8 + KV-cache tuning	Better quality/memory tradeoff than jumping straight to INT4

Practical rule of thumb: if the whole model fits on GPU, benchmark an AWQ or GPTQ path supported by your runtime. If it does not, benchmark a GGUF partial-offload path.

Mastery check

By the end of this lesson, you should be able to:

Distinguish symmetric quantization from asymmetric quantization with scale and zero point.
Explain why GPTQ uses approximate second-order information to minimize layer reconstruction error.
Describe AWQ's activation-aware channel protection and why activation statistics matter.
Explain GGUF as a portable container format, not a quantization algorithm.
Predict how calibration data and group size affect quantization quality.
Evaluate a quantized model with perplexity, task benchmarks, and serving metrics.
Separate weight-only quantization, activation quantization, and KV-cache quantization.

Evaluation rubric

Needs work: You can repeat bit widths, but you can't explain why low-bit storage helps bandwidth before it helps model quality or speed.
Developing: You can describe one format such as GPTQ or GGUF, but you still mix up quantization algorithms, container formats, and runtime offload choices.
Solid: You can size a model at FP16, INT8, and INT4, explain why runtime memory exceeds raw weight math, and choose between GPTQ, AWQ, and GGUF for a concrete hardware limit.
Strong: You can explain why GPTQ protects layer outputs, why AWQ protects activation-sensitive channels, and why calibration data and kernel support decide whether a quantized checkpoint is actually deployable.
Excellent: You can design an approval gate that combines perplexity, task evals, serving metrics, fallback artifacts, and KV-cache checks before routing production traffic to a low-bit model.

Follow-up questions

Your 34B model is 6 GB over the VRAM budget after ordinary FP16 sizing. Which artifact do you try first?

If 4-bit compression would let the full checkpoint stay resident on GPU, benchmark a supported AWQ or GPTQ kernel path. If the model still will not fit after low-bit packing, evaluate GGUF because partial offload is now the main constraint.

Your French customer-support model quantizes badly after calibrating on English Wikipedia. What likely went wrong?

The quantizer protected the wrong directions. GPTQ uses calibration activations to estimate which output errors matter, and AWQ uses them to find activation-sensitive channels. If deployment traffic is French retail support, English Wikipedia gives the wrong activation distribution.

A single outlier channel ruins your per-tensor 4-bit result. Why does per-group or per-channel scaling help?

One global scale forces every weight to share the outlier's range. Per-channel or per-group scaling lets smaller neighborhoods keep their own range, which preserves more local precision while adding moderate metadata overhead.

After moving to W4A16, your long-context agent still runs out of memory at high concurrency. What lever do you pull next?

Look at the KV cache. Weight quantization shrinks static parameters, but the KV cache grows with context length and concurrent sessions. Once weights fit, cache compression or shorter context budgets often become the next bottleneck.

You have hardware with strong native FP8 support and an 8B model that already fits cleanly in VRAM. Why might FP8 beat a jump straight to INT4?

FP8 is a milder compression step than INT4 while still reducing memory pressure. If the accelerator has native FP8 kernels, benchmark it as a candidate before accepting the larger compression risk of INT4.

Summary

Quantization basics: quantization stores weights with fewer bits using scales and, sometimes, zero points. The main win is lower memory bandwidth and smaller model artifacts.
GPTQ: uses representative activations and approximate second-order information to minimize output error during post-training quantization.^[3]
AWQ: identifies activation-sensitive channels and rescales them so the quantizer spends precision where it matters most.^[6]
GGUF: is a portable container for local inference that can store many ggml quantization types and supports CPU/GPU-offload workflows.^[2]
Deployment choice: benchmark AWQ or GPTQ for fully GPU-resident serving, and benchmark GGUF when portability or partial offload is the main constraint.

Course handoff

You now understand how to shrink model weights for production. You can calculate memory footprints, choose between GPTQ, AWQ, and GGUF based on hardware constraints, and evaluate whether a quantized model is good enough for your task. These decisions are the foundation of cost-effective LLM serving.

The next challenge is local deployment. Even a quantized model needs the right runtime, hardware plan, container, health check, and evaluation gate. In Local LLM Deployment, you'll turn compression choices into a repeatable serving plan.

Next Step

Continue to Local LLM Deployment

Quantization reduces the bytes moved for each model call; local deployment shows how to package that compressed model into a measured service with hardware budgets and rollback.

PreviousModel Parallelism for LLM Inference

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.

Xiao, G., et al. · 2023 · ICML 2023

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.

Frantar, E., et al. · 2023 · ICLR 2023

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Raffel, C., et al. · 2020 · JMLR

GPTQ

Hugging Face · 2026

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Lin, J., et al. · 2023 · MLSys 2024

AWQ

Hugging Face · 2026

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

Back to Topics

LearnInference & Production ScaleModel Quantization: GPTQ, AWQ & GGUF

🚀HardInference Optimization

Model Quantization: GPTQ, AWQ & GGUF

Understand how GPTQ, AWQ, and GGUF trade off accuracy, memory footprint, and portability when serving LLMs on GPUs or local hardware.

36 min read

Learning path

Step 131 of 155 in the full curriculum

Model Parallelism for LLM Inference Local LLM Deployment

Model Quantization: GPTQ, AWQ, and GGUF

Model parallelism splits a large model across several accelerators. Quantization asks a different question: what if each weight used fewer bytes before you split anything?

What is quantization?

A tiny worked example

Divide: $0.73 / 0.1 = 7.3$
Round to the nearest integer: $7$
Store the integer $q = 7$

During inference you reverse the process:

$\hat{w} = s \cdot q = 0.1 \cdot 7 = 0.7$

At the simplest level, quantization maps a floating-point weight to a smaller integer range using a scale and, in some schemes, a zero point:

$q = \text{clip}\left(\text{round}\left(\frac{w}{s}\right) + z,\ q_{\min},\ q_{\max}\right)$

The quantization formula

$w$ is the original floating-point weight
$q$ is the stored integer
$s$ is the scale factor
$z$ is the zero point
$q_{\min}, q_{\max}$ are the integer limits (for example, 0 to 15 for unsigned 4-bit)

Dequantization reverses the process during inference:

$\hat{w} = s \cdot (q - z)$

Reading the formula

Subtract the zero point from the stored integer, then multiply by the scale. The result $\hat{w}$ is only an approximation of the original weight because rounding already threw away information.

symmetric-int4-roundtrip.py

weights = [0.15, -1.22, 2.40, -0.45]
qmax = 7
scale = max(abs(weight) for weight in weights) / qmax
quantized = [max(-qmax, min(qmax, round(weight / scale))) for weight in weights]
restored = [value * scale for value in quantized]
mean_error = sum(abs(a - b) for a, b in zip(weights, restored)) / len(weights)

print(f"scale: {scale:.5f}")
print(f"INT4 values: {quantized}")
print(f"reconstructed: {[round(value, 3) for value in restored]}")
print(f"mean absolute error: {mean_error:.3f}")

Output

scale: 0.34286
INT4 values: [0, -4, 7, -1]
reconstructed: [0.0, -1.371, 2.4, -0.343]
mean absolute error: 0.102

Symmetric vs. asymmetric

Imagine a warehouse scale that stores package-weight readings using only 16 possible markings. Two strategies are available:

Symmetric: center the 16 markings around zero, such as -8 to +7. This is simple and hardware-friendly, but it wastes range if the values are skewed.
Asymmetric: move the markings to the actual range you observed. This uses the integer range more efficiently, but it requires the extra zero-point offset.

More formally:

Symmetric quantization: uses $z = 0$ and maps weights around zero. This is common for weight-only LLM kernels because the math is simpler.
Asymmetric quantization: uses a non-zero $z$ to better cover distributions that are not centered at zero. This is common for activations and some weight formats.

asymmetric-activation-range.py

activations = [0.0, 1.2, 1.8, 2.6, 3.0]
signed_qmax = 7
unsigned_qmax = 15

symmetric_scale = max(activations) / signed_qmax
asymmetric_scale = (max(activations) - min(activations)) / unsigned_qmax

print(f"symmetric signed step: {symmetric_scale:.3f}")
print(f"asymmetric unsigned step: {asymmetric_scale:.3f}")
print("A non-negative activation range can use more 4-bit levels asymmetrically.")

Output

symmetric signed step: 0.429
asymmetric unsigned step: 0.200
A non-negative activation range can use more 4-bit levels asymmetrically.

Memory savings

The most obvious benefit of quantization is memory reduction. That directly translates to lower serving cost, larger batch sizes, and the ability to fit bigger models on smaller devices.

The table below is weights only. Actual runtime memory is higher because you still pay for activations, the KV cache, scale metadata, and framework overhead.

Precision	Bits/Weight	7B Weights	70B Weights
FP16 / BF16	16	~14 GB	~140 GB
INT8 / FP8	8	~7 GB	~70 GB
INT4	4	~3.5 GB	~35 GB
INT3	3	~2.6 GB	~26 GB

group-scale-metadata.py

parameters = 70_000_000_000
bits_per_weight = 4
group_size = 128
bytes_per_scale = 2

ideal_weight_gb = parameters * bits_per_weight / 8 / 1_000_000_000
scale_metadata_gb = parameters / group_size * bytes_per_scale / 1_000_000_000

print(f"ideal INT4 weights: {ideal_weight_gb:.2f} GB")
print(f"one FP16 scale per {group_size} weights: {scale_metadata_gb:.2f} GB")
print("Alignment and mixed-precision tensors can add more.")

Output

ideal INT4 weights: 35.00 GB
one FP16 scale per 128 weights: 1.09 GB
Alignment and mixed-precision tensors can add more.

Sizing exercise

Try this before moving on. You have a workstation with one NVIDIA RTX 4060 (8 GB VRAM) and you want to run Llama 3 70B. The model has roughly 70 billion parameters.

How much VRAM would the model need in FP16?
How much would it need at INT4?
Can you run it on the 4060?

Use this tiny calculator to check the arithmetic without any framework overhead:

quantization-sizing.py

def weight_gb(parameters_billion: float, bits_per_weight: int) -> float:
    return parameters_billion * bits_per_weight / 8

params = 70
for bits in (16, 8, 4, 3):
    print(f"70B at {bits:>2}-bit weights: {weight_gb(params, bits):5.1f} GB")

gpu_vram_gb = 8
usable_vram_gb = gpu_vram_gb * 0.8
print(f"8 GB GPU with 20% reserve: {usable_vram_gb:.1f} GB usable")

Output

70B at 16-bit weights: 140.0 GB
70B at  8-bit weights:  70.0 GB
70B at  4-bit weights:  35.0 GB
70B at  3-bit weights:  26.2 GB
8 GB GPU with 20% reserve: 6.4 GB usable

Solution

FP16: $70 \times 2 = 140$ GB. The model doesn't fit.
INT4: $70 \times 0.5 = 35$ GB. The model still doesn't fit on an 8 GB card.
You can't run the full model on the GPU alone. You need GGUF with heavy CPU offloading, or a larger GPU, or a smaller model. The INT4 size is a huge improvement, but it isn't magic.

Weight-only vs. weight-activation quantization

Two broad approaches to quantization exist, and confusing them causes a lot of interview mistakes.

Common mistake: Candidates often claim W4A16 quantization gives a 4x end-to-end speedup. It does not guarantee that. Low-bit weights reduce raw weight traffic, but kernel overhead and non-weight work remain, and the workload may not be bandwidth-bound. The guaranteed first-order gain is smaller stored weights, not a fixed tokens-per-second multiplier.

bandwidth-saving-is-not-speedup.py

bandwidth_gb_s = 1_000
fp16_weight_gb = 14.0
int4_weight_gb = 3.5
other_work_ms = 3.0

fp16_ms = fp16_weight_gb / bandwidth_gb_s * 1000 + other_work_ms
int4_ms = int4_weight_gb / bandwidth_gb_s * 1000 + other_work_ms

print(f"raw weight traffic reduction: {fp16_weight_gb / int4_weight_gb:.1f}x")
print(f"illustrative step speedup with fixed overhead: {fp16_ms / int4_ms:.2f}x")

Output

raw weight traffic reduction: 4.0x
illustrative step speedup with fixed overhead: 2.62x

GPTQ (post-training quantization)

The intuition: balancing errors like a checkbook

Algorithm: minimize output error with curvature information

\begin{aligned} \hat{w} &= \arg\min_{\tilde{w} \in \mathcal{Q}} \|wX - \tilde{w}X\|_2^2 \\ &\approx \arg\min_{\tilde{w} \in \mathcal{Q}} (w - \tilde{w})^T H (w - \tilde{w}) \\ H &\approx XX^T \end{aligned}

Reading the formula

In practice, GPTQ looks like this:

Collect representative activations from a calibration set such as C4^[4].
Approximate $H \approx XX^T$ for each linear layer.
Quantize the weights sequentially while using an approximate inverse Hessian to compensate the remaining floating-point weights.
Pack the result into a low-bit format that an inference kernel can consume efficiently.

The original paper reports quantizing 175B-class models (OPT-175B and BLOOM-176B) in about four GPU-hours while preserving strong accuracy at 3-bit and 4-bit settings.^[3]

gptq-curvature-proxy.py

rounding_errors = [0.20, 0.20]
curvature_proxy = [25.0, 1.0]
weighted_cost = [h * error**2 for h, error in zip(curvature_proxy, rounding_errors)]

print(f"same absolute errors: {rounding_errors}")
print(f"curvature-weighted costs: {weighted_cost}")
print(f"first direction costs {weighted_cost[0] / weighted_cost[1]:.0f}x more to distort")

Output

same absolute errors: [0.2, 0.2]
curvature-weighted costs: [1.0000000000000002, 0.04000000000000001]
first direction costs 25x more to distort

reading-the-formula.py

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=gptq_config,
)

# If quantized with device_map="auto", gather the model onto one device before saving.
model.to("cpu")
model.save_pretrained("opt-125m-gptq")
tokenizer.save_pretrained("opt-125m-gptq")

Granularity: per-tensor, per-channel, per-group

A naive quantizer uses one scale for the entire tensor. This is per-tensor quantization. It's cheap, but one large outlier can ruin the precision of everything else.

Modern LLM quantizers almost always use finer granularity:

Granularity	What Gets Its Own Scale	Accuracy	Metadata Overhead
Per-tensor	Entire tensor	Lowest	Lowest
Per-channel	One output channel / row	Better	Moderate
Per-group	Small group of weights, often 64 or 128	Common 4-bit choice	Moderate

Per-group quantization is the common compromise for 4-bit LLM inference. Smaller groups usually improve fidelity, but they also require storing more scale metadata and may reduce kernel efficiency.

AWQ (activation-aware weight quantization)

Not all weights are equally dangerous to quantize

awq-salient-channel-proxy.py

activations = [100.0, 0.1]
naive_weight_errors = [0.10, 0.10]
protected_weight_errors = [0.02, 0.10]

naive_output_error = sum(a * e for a, e in zip(activations, naive_weight_errors))
protected_output_error = sum(a * e for a, e in zip(activations, protected_weight_errors))

print(f"naive output-error proxy: {naive_output_error:.2f}")
print(f"protect high-activation channel: {protected_output_error:.2f}")

Output

naive output-error proxy: 10.01
protect high-activation channel: 2.01

Why protecting a few weights matters

Algorithm

AWQ doesn't rebuild the entire weight matrix the way GPTQ does. Instead, it uses an equivalent rescaling trick:

$W x = W \cdot \mathrm{diag}(s) \cdot \mathrm{diag}(s)^{-1} x \approx Q\!\left(W \cdot \mathrm{diag}(s)\right) \cdot \mathrm{diag}(s)^{-1} x$

Reading the formula

The practical workflow is:

Run representative inputs through the model and collect activation statistics.
Identify salient channels with unusually large activation magnitude.
Search for scaling factors that reduce the quantization error on those channels.
Quantize the rescaled weights into a hardware-friendly 4-bit format.

GGUF (llama.cpp format)

What is GGUF?

GGUF is a file format, not a quantization algorithm. It's the container format used by the ggml / llama.cpp ecosystem for local inference.^[2]

GGUF matters because it bundles the tensors with the metadata needed to run them:

tokenizer and vocabulary information
architecture metadata and tensor shapes
tensor-by-tensor quantization types inside one portable file
compatibility with CPU inference and partial GPU offload in local runtimes

That last point is why GGUF is so important for local LLMs. If the full model doesn't fit in VRAM, llama.cpp-style runtimes can keep some layers on the GPU and spill the rest to system memory.

gguf-partial-offload-budget.py

artifact_gib = 5.0
layers = 32
usable_gpu_gib = 4.0
runtime_reserve_gib = 0.5

layer_budget_gib = usable_gpu_gib - runtime_reserve_gib
gpu_layers = int(layer_budget_gib / (artifact_gib / layers))

print(f"GPU budget for model layers: {layer_budget_gib:.1f} GiB")
print(f"even-size approximation: {gpu_layers}/{layers} layers fit on GPU")
print("Measure real tensor placement and KV memory in the chosen runtime.")

Output

GPU budget for model layers: 3.5 GiB
even-size approximation: 22/32 layers fit on GPU
Measure real tensor placement and KV memory in the chosen runtime.

Quantization families inside GGUF

GGUF can store several quantization families. The format doesn't force one specific quantizer.

Family	Example	Extra Calibration	Typical Use
Legacy block quantization	`Q4_0`, `Q5_0`	No	Simple and widely supported
K-quants	`Q4_K_M`, `Q5_K_M`	No	Common local default for size/quality
IQ / iMatrix-aware formats	`IQ4_XS`, `IQ3_M`	Usually yes	Better quality when squeezing below comfortable 4-bit settings

iMatrix quantization

terminal

# 1) Convert a Hugging Face checkpoint to GGUF
python3 convert_hf_to_gguf.py ./Meta-Llama-3.1-8B-Instruct \
    --outtype f16 \
    --outfile llama-3.1-8b-f16.gguf

# 2) If needed: build an importance matrix from representative text
llama-imatrix \
    -m llama-3.1-8b-f16.gguf \
    -f calibration.txt \
    -o llama-3.1.imatrix.dat

# 3a) Common default without iMatrix
llama-quantize \
    llama-3.1-8b-f16.gguf \
    llama-3.1-8b-Q4_K_M.gguf \
    Q4_K_M

# 3b) Importance-aware quantization
llama-quantize \
    --imatrix llama-3.1.imatrix.dat \
    llama-3.1-8b-f16.gguf \
    llama-3.1-8b-IQ4_XS.gguf \
    IQ4_XS

Beyond weight-only: FP8 and KV cache quantization

GPTQ, AWQ, and GGUF all shrink the model weights. Once the weights are small, the next bottleneck is often the KV cache, the attention state that grows with sequence length.

FP8 sits adjacent to the big three rather than replacing them. It's an 8-bit floating-point format that becomes attractive when the serving hardware has native FP8 kernels.^[8]

FP8 has two common encodings:^[8]

E4M3: more mantissa precision, less dynamic range
E5M2: less mantissa precision, more dynamic range

Unlike 4-bit weight-only methods, FP8 is usually chosen when you want a milder accuracy-memory tradeoff and the accelerator is built to exploit FP8 directly.

kv-cache-after-weight-quantization.py

weights_int4_gib = 7_000_000_000 * 0.5 / 1024**3
batch, sequence = 32, 8_192
layers, kv_heads, head_dim, kv_bytes = 32, 8, 128, 2
kv_gib = 2 * batch * sequence * layers * kv_heads * head_dim * kv_bytes / 1024**3

print(f"hypothetical 7B INT4 weights: {weights_int4_gib:.1f} GiB")
print(f"FP16 KV cache at batch={batch}, context={sequence}: {kv_gib:.1f} GiB")
print("Shrinking weights alone does not solve long-context capacity.")

Output

hypothetical 7B INT4 weights: 3.3 GiB
FP16 KV cache at batch=32, context=8192: 32.0 GiB
Shrinking weights alone does not solve long-context capacity.

Comparison

When choosing a quantization strategy, the real question isn't "Which one is best?" It's "What hardware constraint am I solving for?"

Feature	GPTQ	AWQ	GGUF
Meaning	Weight-only PTQ algorithm	Weight-only PTQ algorithm	Portable file/container format
Core idea	Minimize layer output error with Hessian-weighted reconstruction	Protect salient weight channels using activation statistics	Store tensors + metadata + chosen ggml quantizers in one artifact
Calibration	Required	Required	Depends on quantizer; iMatrix uses representative data
Common deployment target	Fully GPU-resident serving	Fully GPU-resident serving	CPU, Apple Silicon, or mixed CPU/GPU
Strength	Mature second-order method	Strong 4-bit quality with hardware-friendly kernels	Single-file portability and partial GPU offload
Tradeoff	Offline quantization is heavier	Runtime/kernel compatibility still matters	Usually slower than specialized full-GPU kernels

Common pitfalls

Treating all quantizers as equivalent

Symptom: You choose "4-bit" from a model hub without checking whether it is GPTQ, AWQ, GGUF, or a runtime quantization path.

Cause: Bit width describes storage size, not calibration method, tensor layout, kernel support, offload behavior, or quality profile.

Fix: Name the artifact and the runtime together: "AWQ on vLLM," "GPTQ on Transformers," or "Q4_K_M GGUF on llama.cpp." Then test that exact pair.

The calibration trap

Symptom: Your quantized French customer-support model speaks gibberish, even though the English version quantized fine.

Fix: Use calibration text that matches the target domain and language, then verify quality on held-out target tasks.

Confusing weight-only and full quantization

Symptom: A design doc claims GPTQ or AWQ makes the entire model 4-bit.

Cause: GPTQ and AWQ are weight-only methods. Activations usually stay at FP16 or BF16, and accumulation happens in higher precision.

Fix: Write the precision contract explicitly. W4A16 means 4-bit stored weights and 16-bit activations, not full W4A4 inference.

The speed fallacy

Symptom: You quantize to 4-bit expecting a 4x speedup, but tokens per second barely improve.

Fix: Measure end-to-end tokens per second on your exact hardware and batch size. Bandwidth savings are real, but they only translate to speed when the runtime is optimized for your GPU.

Ignoring perplexity

Symptom: The quantized model still chats politely, but it hallucinates order statuses or generates invalid JSON for your shipping API.

Fix: Always pair perplexity with task-specific benchmarks. For a logistics model, run your own production eval set that includes the exact output formats the model must produce.

Confusing GGUF with the quantizer

Symptom: Someone says "we used GGUF quantization" as if that fully specifies the quality and runtime behavior.

Cause: GGUF is the container. Q4_K_M, IQ4_XS, Q5_K_M, and related tensor types describe the actual low-bit encoding inside the file.

Fix: Report both: "GGUF Q4_K_M with 20 GPU layers," "GGUF IQ4_XS with iMatrix," or another concrete artifact/runtime pairing.

How to evaluate a quantized model

GPTQ and AWQ both report that 4-bit weight-only quantization can preserve language-modeling quality surprisingly well on large models, while more aggressive bit widths degrade more sharply.^[3]^[6]

Evaluation Axis	What To Measure	Why It Matters
Language modeling	Held-out perplexity	Fast check that next-token behavior didn't drift too far
Reasoning and knowledge	MMLU^[9], GSM8K^[10]	Catches multi-step failures that perplexity can hide
Code generation	HumanEval^[11] or your own coding eval	Code is often more brittle than chat completion
Systems performance	Tokens/s, VRAM use, max context, cold-start time	Quantization is a systems tradeoff, not just an accuracy number

Three practical rules:

GPTQ and AWQ results support testing 4-bit weight-only artifacts before pushing to more aggressive bit widths.^[3]^[6]
Task-specific regressions cannot be inferred from a generic quality score; reasoning, structured output, and tool-use tasks need their own gates.
Group size, calibration data, and the serving kernel can matter as much as the headline format name.

quantized-artifact-approval-gate.py

candidates = [
    {"name": "FP16", "task_accuracy": 0.93, "p95_ms": 70, "vram_gib": 14.0},
    {"name": "AWQ-4bit", "task_accuracy": 0.92, "p95_ms": 49, "vram_gib": 4.1},
    {"name": "aggressive-3bit", "task_accuracy": 0.85, "p95_ms": 43, "vram_gib": 3.2},
]
minimum_accuracy = 0.90
maximum_vram_gib = 8.0
approved = [c["name"] for c in candidates if c["task_accuracy"] >= minimum_accuracy and c["vram_gib"] <= maximum_vram_gib]

print(f"approved artifacts: {approved}")
print("Smaller artifact is rejected when task quality misses the gate.")

Output

approved artifacts: ['AWQ-4bit']
Smaller artifact is rejected when task quality misses the gate.

Decision guide

Choose based on the bottleneck

Scenario	Recommended Starting Point	Why
Production GPU server, model fits in VRAM	Benchmark AWQ or GPTQ	Candidate artifacts for specialized GPU kernels
24 GB GPU, model is too large	GGUF with partial offload	Lets system RAM absorb the overflow
CPU or Apple Silicon laptop	GGUF	Common local-runtime artifact path
Datacenter accelerator with native FP8 path	FP8 + KV-cache tuning	Better quality/memory tradeoff than jumping straight to INT4

Practical rule of thumb: if the whole model fits on GPU, benchmark an AWQ or GPTQ path supported by your runtime. If it does not, benchmark a GGUF partial-offload path.

Mastery check

By the end of this lesson, you should be able to:

Distinguish symmetric quantization from asymmetric quantization with scale and zero point.
Explain why GPTQ uses approximate second-order information to minimize layer reconstruction error.
Describe AWQ's activation-aware channel protection and why activation statistics matter.
Explain GGUF as a portable container format, not a quantization algorithm.
Predict how calibration data and group size affect quantization quality.
Evaluate a quantized model with perplexity, task benchmarks, and serving metrics.
Separate weight-only quantization, activation quantization, and KV-cache quantization.

Evaluation rubric

Needs work: You can repeat bit widths, but you can't explain why low-bit storage helps bandwidth before it helps model quality or speed.
Developing: You can describe one format such as GPTQ or GGUF, but you still mix up quantization algorithms, container formats, and runtime offload choices.
Solid: You can size a model at FP16, INT8, and INT4, explain why runtime memory exceeds raw weight math, and choose between GPTQ, AWQ, and GGUF for a concrete hardware limit.
Strong: You can explain why GPTQ protects layer outputs, why AWQ protects activation-sensitive channels, and why calibration data and kernel support decide whether a quantized checkpoint is actually deployable.
Excellent: You can design an approval gate that combines perplexity, task evals, serving metrics, fallback artifacts, and KV-cache checks before routing production traffic to a low-bit model.

Follow-up questions

Your 34B model is 6 GB over the VRAM budget after ordinary FP16 sizing. Which artifact do you try first?

Your French customer-support model quantizes badly after calibrating on English Wikipedia. What likely went wrong?

A single outlier channel ruins your per-tensor 4-bit result. Why does per-group or per-channel scaling help?

After moving to W4A16, your long-context agent still runs out of memory at high concurrency. What lever do you pull next?

You have hardware with strong native FP8 support and an 8B model that already fits cleanly in VRAM. Why might FP8 beat a jump straight to INT4?

Summary

Quantization basics: quantization stores weights with fewer bits using scales and, sometimes, zero points. The main win is lower memory bandwidth and smaller model artifacts.
GPTQ: uses representative activations and approximate second-order information to minimize output error during post-training quantization.^[3]
AWQ: identifies activation-sensitive channels and rescales them so the quantizer spends precision where it matters most.^[6]
GGUF: is a portable container for local inference that can store many ggml quantization types and supports CPU/GPU-offload workflows.^[2]
Deployment choice: benchmark AWQ or GPTQ for fully GPU-resident serving, and benchmark GGUF when portability or partial offload is the main constraint.

Course handoff

Next Step

Continue to Local LLM Deployment

Quantization reduces the bytes moved for each model call; local deployment shows how to package that compressed model into a measured service with hardware budgets and rollback.

PreviousModel Parallelism for LLM Inference

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.

Xiao, G., et al. · 2023 · ICML 2023

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.

Frantar, E., et al. · 2023 · ICLR 2023

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Raffel, C., et al. · 2020 · JMLR

GPTQ

Hugging Face · 2026

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Lin, J., et al. · 2023 · MLSys 2024

AWQ

Hugging Face · 2026

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

Model Quantization: GPTQ, AWQ & GGUF

Model Quantization: GPTQ, AWQ, and GGUF

What problem does quantization solve first: speed, memory, or accuracy?

What is quantization?

A tiny worked example

In the tiny example, why is scale choice the main quality decision?

The quantization formula

Reading the formula

What information is lost during quantization, and what do scale and zero point preserve?

Symmetric vs. asymmetric

Why are activations usually harder to quantize than weights?

Memory savings

Why is ideal INT4 storage for a 70B model about 35 GB, but real files can be larger?

Sizing exercise

Solution

Why can't an 8 GB RTX 4060 run a full 70B model entirely on GPU even at INT4?

Weight-only vs. weight-activation quantization

What does W4A16 mean, and why doesn't it imply that every computation is 4-bit?

GPTQ (post-training quantization)

The intuition: balancing errors like a checkbook

Algorithm: minimize output error with curvature information

Reading the formula

What makes GPTQ different from plain round-to-nearest quantization?

Granularity: per-tensor, per-channel, per-group

Why is per-group quantization a common 4-bit compromise?

AWQ (activation-aware weight quantization)

Not all weights are equally dangerous to quantize

What does AWQ mean by a salient weight channel?

Why protecting a few weights matters

Algorithm

Reading the formula

How should you choose between GPTQ and AWQ when both are available?

GGUF (llama.cpp format)

What is GGUF?

Why is GGUF not the same kind of thing as GPTQ or AWQ?

Quantization families inside GGUF

iMatrix quantization

When does iMatrix-style GGUF quantization help most?

Beyond weight-only: FP8 and KV cache quantization

Why can KV-cache quantization matter after weight quantization succeeds?

Comparison

What is the fastest decision rule for GPTQ, AWQ, and GGUF?

Common pitfalls

Treating all quantizers as equivalent

The calibration trap

Confusing weight-only and full quantization

The speed fallacy

Ignoring perplexity

Confusing GGUF with the quantizer

How to evaluate a quantized model

Why isn't perplexity enough to approve a quantized model?

Decision guide

Choose based on the bottleneck

For a 24 GB GPU and a model too large for full VRAM residency, why is GGUF a reasonable first artifact?

Mastery check

Evaluation rubric

Follow-up questions

Your 34B model is 6 GB over the VRAM budget after ordinary FP16 sizing. Which artifact do you try first?

Your French customer-support model quantizes badly after calibrating on English Wikipedia. What likely went wrong?

A single outlier channel ruins your per-tensor 4-bit result. Why does per-group or per-channel scaling help?

After moving to W4A16, your long-context agent still runs out of memory at high concurrency. What lever do you pull next?

You have hardware with strong native FP8 support and an 8B model that already fits cleanly in VRAM. Why might FP8 beat a jump straight to INT4?

Summary

Course handoff

Model Quantization: GPTQ, AWQ & GGUF

Model Quantization: GPTQ, AWQ, and GGUF

What problem does quantization solve first: speed, memory, or accuracy?

What is quantization?

A tiny worked example

In the tiny example, why is scale choice the main quality decision?

The quantization formula

Reading the formula

What information is lost during quantization, and what do scale and zero point preserve?

Symmetric vs. asymmetric

Why are activations usually harder to quantize than weights?

Memory savings

Why is ideal INT4 storage for a 70B model about 35 GB, but real files can be larger?

Sizing exercise

Solution

Why can't an 8 GB RTX 4060 run a full 70B model entirely on GPU even at INT4?