Understand how GPTQ, AWQ, and GGUF trade off accuracy, memory footprint, and portability when serving LLMs on GPUs or local hardware.
Model parallelism splits a large model across several accelerators. Quantization asks a different question: what if each weight used fewer bytes before you split anything?
Model quantization stores selected tensors with fewer bits so large models can require less memory and potentially less bandwidth. This chapter explains GPTQ, a Hessian-aware post-training quantization method, AWQ (Activation-Aware Weight Quantization), and GGUF, a portable local-inference container format, from the serving tradeoff: what footprint you reduce, what quality can move, and which runtime can use the artifact.
Imagine you run the AI team for an online marketplace. A 70 billion parameter model needs about 140 GB of weight storage in FP16 or BF16. That exceeds one 80 GB accelerator before KV cache or runtime buffers. You need to test whether a lower-bit artifact fits while preserving required task quality.
Quantization is that technique. It works like replacing a precise ruler with a coarser grid: the storage reduction is predictable, while the error depends on which values the grid distorts. Instead of storing every model weight as a high-precision floating-point number, a weight-only method packs approximate low-bit values plus metadata. Some models and tasks tolerate that approximation well; others regress enough that the artifact must be rejected.
By this point in the curriculum you already know how a Transformer turns tokens into predictions. The next question is how to fit the resulting model onto real hardware. LLM inference is often memory-bandwidth bound, which means the accelerator spends much of its time streaming weights from memory instead of saturating the math units. Quantization helps because 4-bit weights cut raw weight traffic to about one quarter of FP16. Real speedups are smaller than 4x because kernels still have to dequantize and accumulate in higher precision, but the memory savings are real. Shrinking weights from about 140 GB to about 35 GB can be the difference between "needs multiple datacenter GPUs" and "fits on a workstation or mixed CPU/GPU local setup." This article breaks down GPTQ, AWQ, and GGUF: what problem each one solves, and when to use each.
Think of it like rounding numbers on a ruler. A precise ruler has millimeter markings, so you can measure 3.7 mm, 4.2 mm, 12.1 mm. A coarse ruler only has centimeter markings, so those same values become 4 cm, 4 cm, 12 cm. You lose some precision, but the ruler is simpler and cheaper. Quantization does this to every number in a neural network.
Before the general formula, walk through one weight by hand. Suppose a weight has the value and you choose a scale of . That means each integer step represents in the original space.
During inference you reverse the process:
The stored value is , not . The error is tiny (), and when this happens across billions of weights the model usually stays useful. If you had chosen a coarser scale of , the same weight would become and the reconstructed value would be , which is a much larger error. The art of quantization is choosing the right scale so the errors stay small where they matter most.
At the simplest level, quantization maps a floating-point weight to a smaller integer range using a scale and, in some schemes, a zero point:
Take the original weight , divide by the scale to express it in "integer-sized steps," shift by the zero point , round to the nearest integer, and clamp to the representable range. That's how a high-precision weight becomes an INT8 or INT4 value.
Dequantization reverses the process during inference:
Subtract the zero point from the stored integer, then multiply by the scale. The result is only an approximation of the original weight because rounding already threw away information.
1weights = [0.15, -1.22, 2.40, -0.45]
2qmax = 7
3scale = max(abs(weight) for weight in weights) / qmax
4quantized = [max(-qmax, min(qmax, round(weight / scale))) for weight in weights]
5restored = [value * scale for value in quantized]
6mean_error = sum(abs(a - b) for a, b in zip(weights, restored)) / len(weights)
7
8print(f"scale: {scale:.5f}")
9print(f"INT4 values: {quantized}")
10print(f"reconstructed: {[round(value, 3) for value in restored]}")
11print(f"mean absolute error: {mean_error:.3f}")1scale: 0.34286
2INT4 values: [0, -4, 7, -1]
3reconstructed: [0.0, -1.371, 2.4, -0.343]
4mean absolute error: 0.102Imagine a warehouse scale that stores package-weight readings using only 16 possible markings. Two strategies are available:
More formally:
The key idea: most LLM weight tensors are roughly zero-centered, so symmetric or near-symmetric per-group quantization often works well for weights. Activations are harder because a few channels can contain very large outliers. Techniques like SmoothQuant[1] make activation quantization easier by shifting some of that difficulty into the weights.
1activations = [0.0, 1.2, 1.8, 2.6, 3.0]
2signed_qmax = 7
3unsigned_qmax = 15
4
5symmetric_scale = max(activations) / signed_qmax
6asymmetric_scale = (max(activations) - min(activations)) / unsigned_qmax
7
8print(f"symmetric signed step: {symmetric_scale:.3f}")
9print(f"asymmetric unsigned step: {asymmetric_scale:.3f}")
10print("A non-negative activation range can use more 4-bit levels asymmetrically.")1symmetric signed step: 0.429
2asymmetric unsigned step: 0.200
3A non-negative activation range can use more 4-bit levels asymmetrically.The figure below shows the basic quantization pipeline. Start with high-precision weights, estimate the statistics needed to compute scales, then pack the results into a low-bit representation plus metadata.
The most obvious benefit of quantization is memory reduction. That directly translates to lower serving cost, larger batch sizes, and the ability to fit bigger models on smaller devices.
The table below is weights only. Actual runtime memory is higher because you still pay for activations, the KV cache, scale metadata, and framework overhead.
| Precision | Bits/Weight | 7B Weights | 70B Weights |
|---|---|---|---|
| FP16 / BF16 | 16 | ~14 GB | ~140 GB |
| INT8 / FP8 | 8 | ~7 GB | ~70 GB |
| INT4 | 4 | ~3.5 GB | ~35 GB |
| INT3 | 3 | ~2.6 GB | ~26 GB |
Those numbers are idealized weight math. Real packed formats are larger when they also store per-group scales, alignment, or tensors kept at higher precision. For a GGUF artifact, inspect the selected quantization type and actual file/runtime footprint rather than assuming the ideal 35 GB number.[2]
Because LLM generation is often weight-bandwidth bound, shrinking the weights can also speed up inference. The raw bandwidth demand drops almost linearly with bit width, although the observed throughput gain depends on the kernel, batch size, and how much extra work the runtime does to unpack the weights.
1parameters = 70_000_000_000
2bits_per_weight = 4
3group_size = 128
4bytes_per_scale = 2
5
6ideal_weight_gb = parameters * bits_per_weight / 8 / 1_000_000_000
7scale_metadata_gb = parameters / group_size * bytes_per_scale / 1_000_000_000
8
9print(f"ideal INT4 weights: {ideal_weight_gb:.2f} GB")
10print(f"one FP16 scale per {group_size} weights: {scale_metadata_gb:.2f} GB")
11print("Alignment and mixed-precision tensors can add more.")1ideal INT4 weights: 35.00 GB
2one FP16 scale per 128 weights: 1.09 GB
3Alignment and mixed-precision tensors can add more.Try this before moving on. You have a workstation with one NVIDIA RTX 4060 (8 GB VRAM) and you want to run Llama 3 70B. The model has roughly 70 billion parameters.
Use this tiny calculator to check the arithmetic without any framework overhead:
1def weight_gb(parameters_billion: float, bits_per_weight: int) -> float:
2 return parameters_billion * bits_per_weight / 8
3
4params = 70
5for bits in (16, 8, 4, 3):
6 print(f"70B at {bits:>2}-bit weights: {weight_gb(params, bits):5.1f} GB")
7
8gpu_vram_gb = 8
9usable_vram_gb = gpu_vram_gb * 0.8
10print(f"8 GB GPU with 20% reserve: {usable_vram_gb:.1f} GB usable")170B at 16-bit weights: 140.0 GB
270B at 8-bit weights: 70.0 GB
370B at 4-bit weights: 35.0 GB
470B at 3-bit weights: 26.2 GB
58 GB GPU with 20% reserve: 6.4 GB usableThis is exactly the calculation you should do before choosing a quantization strategy. The formula is simple: parameters bytes per parameter = weight footprint. Remember that runtime memory is higher because of activations, KV cache, and metadata.
Two broad approaches to quantization exist, and confusing them causes a lot of interview mistakes.
Weight-only quantization is what GPTQ and AWQ target. The stored weights are low precision, while optimized kernels unpack or dequantize low-bit values during computation and accumulate with higher-precision activations or accumulators. This is why you often see notation like W4A16: 4-bit stored weights, 16-bit activations.
Weight-activation quantization pushes both sides of the matmul down, for example W8A8 or W4A8. This can be faster and more memory-efficient, but it's harder because activation distributions are spikier and less stable than weight distributions. Techniques like SmoothQuant[1] exist specifically to make weight-activation quantization practical.
GPTQ and AWQ are weight-only methods. They capture large weight-memory savings without requiring equally low-bit activations. Lower-bit activation paths such as W4A4 need separate hardware, kernel, and quality validation.
Common mistake: Candidates often claim W4A16 quantization gives a 4x end-to-end speedup. It does not guarantee that. Low-bit weights reduce raw weight traffic, but kernel overhead and non-weight work remain, and the workload may not be bandwidth-bound. The guaranteed first-order gain is smaller stored weights, not a fixed tokens-per-second multiplier.
1bandwidth_gb_s = 1_000
2fp16_weight_gb = 14.0
3int4_weight_gb = 3.5
4other_work_ms = 3.0
5
6fp16_ms = fp16_weight_gb / bandwidth_gb_s * 1000 + other_work_ms
7int4_ms = int4_weight_gb / bandwidth_gb_s * 1000 + other_work_ms
8
9print(f"raw weight traffic reduction: {fp16_weight_gb / int4_weight_gb:.1f}x")
10print(f"illustrative step speedup with fixed overhead: {fp16_ms / int4_ms:.2f}x")1raw weight traffic reduction: 4.0x
2illustrative step speedup with fixed overhead: 2.62xImagine a layer has only two weights for simplicity: and . A naive quantizer might round both toward the nearest integer, turning them into and . The first weight lost , the second gained . The total output change depends on how the layer uses those weights.
If the calibration data shows that is multiplied by large activations and by small ones, the error on hurts the output far more than the error on . GPTQ notices this through the Hessian approximation and compensates: it might quantize more carefully, or adjust in the opposite direction to cancel some of the damage. The goal isn't to make every weight close to its original value. The goal is to keep the layer's output on real data as close as possible to the original output.
GPTQ works like carefully removing blocks from a Jenga tower. When you remove one block (round one weight), the tower shifts. A careless player ignores the shift and keeps pulling until the tower collapses. GPTQ compensates for the shift. After quantizing one set of weights, it updates the remaining floating-point weights so the layer output changes as little as possible.
GPTQ[3] is a one-shot post-training quantization method based on approximate second-order information. It builds on the layer-wise Optimal Brain Quantization (OBQ) solver, which quantizes weights one at a time and updates the remaining weights to minimize the layer's output error. GPTQ makes that idea fast enough for billion-parameter models by quantizing weights in a fixed order and using lazy batched updates. The key objective isn't "make the quantized weights numerically close to the originals." It's "make the layer output on real activations stay close to the original output." For a weight row and calibration activations , GPTQ approximates:
The Hessian approximation tells GPTQ which input directions matter most on the calibration set. A small error on an unimportant direction is cheap. The same numeric error on a frequently used direction is expensive. That's why GPTQ usually beats plain round-to-nearest quantization at the same bit width.
In practice, GPTQ looks like this:
The original paper reports quantizing 175B-class models (OPT-175B and BLOOM-176B) in about four GPU-hours while preserving strong accuracy at 3-bit and 4-bit settings.[3]
1rounding_errors = [0.20, 0.20]
2curvature_proxy = [25.0, 1.0]
3weighted_cost = [h * error**2 for h, error in zip(curvature_proxy, rounding_errors)]
4
5print(f"same absolute errors: {rounding_errors}")
6print(f"curvature-weighted costs: {weighted_cost}")
7print(f"first direction costs {weighted_cost[0] / weighted_cost[1]:.0f}x more to distort")1same absolute errors: [0.2, 0.2]
2curvature-weighted costs: [1.0000000000000002, 0.04000000000000001]
3first direction costs 25x more to distortThe configuration sketch below shows the shape of a transformers GPTQ workflow. Backend packages and supported arguments change over time, so verify current library documentation and evaluate the resulting artifact before deployment.[5]
1from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
2
3model_id = "facebook/opt-125m"
4tokenizer = AutoTokenizer.from_pretrained(model_id)
5
6gptq_config = GPTQConfig(
7 bits=4,
8 dataset="c4",
9 tokenizer=tokenizer,
10)
11
12model = AutoModelForCausalLM.from_pretrained(
13 model_id,
14 device_map="auto",
15 quantization_config=gptq_config,
16)
17
18# If quantized with device_map="auto", gather the model onto one device before saving.
19model.to("cpu")
20model.save_pretrained("opt-125m-gptq")
21tokenizer.save_pretrained("opt-125m-gptq")A naive quantizer uses one scale for the entire tensor. This is per-tensor quantization. It's cheap, but one large outlier can ruin the precision of everything else.
Modern LLM quantizers almost always use finer granularity:
| Granularity | What Gets Its Own Scale | Accuracy | Metadata Overhead |
|---|---|---|---|
| Per-tensor | Entire tensor | Lowest | Lowest |
| Per-channel | One output channel / row | Better | Moderate |
| Per-group | Small group of weights, often 64 or 128 | Common 4-bit choice | Moderate |
Per-group quantization is the common compromise for 4-bit LLM inference. Smaller groups usually improve fidelity, but they also require storing more scale metadata and may reduce kernel efficiency.
Imagine renovating a building and needing to remove material wherever possible. Most walls can be thinned or moved with little impact. A few are load-bearing. Disturb those and the whole structure becomes unstable. AWQ tries to identify the "load-bearing" weight channels.
AWQ[6] starts from the observation that activation magnitudes are not evenly distributed. A small fraction of channels carry disproportionately large activations. If the corresponding weight columns are quantized poorly, the downstream matmul error gets amplified. The AWQ paper reports that protecting only about 1% of salient weights can greatly reduce quantization error.[6]
1activations = [100.0, 0.1]
2naive_weight_errors = [0.10, 0.10]
3protected_weight_errors = [0.02, 0.10]
4
5naive_output_error = sum(a * e for a, e in zip(activations, naive_weight_errors))
6protected_output_error = sum(a * e for a, e in zip(activations, protected_weight_errors))
7
8print(f"naive output-error proxy: {naive_output_error:.2f}")
9print(f"protect high-activation channel: {protected_output_error:.2f}")1naive output-error proxy: 10.01
2protect high-activation channel: 2.01Imagine the same two-weight layer, but now the activation pattern is extreme: is almost always multiplied by , while is multiplied by . A rounding error on becomes a unit output error, while the same error on becomes only units. AWQ identifies these "high-traffic" channels and rescales them so the quantizer spends more of its limited integer range on the weights that matter most.
AWQ doesn't rebuild the entire weight matrix the way GPTQ does. Instead, it uses an equivalent rescaling trick:
Multiply important weight columns by a scaling vector before quantization so they occupy more of the available integer range. Then divide the corresponding activation channels by the same factor. The floating-point computation stays equivalent, but the quantizer now spends more precision on the columns that matter most.
The practical workflow is:
AWQ artifacts are typically produced offline and loaded by a serving runtime that understands the artifact's quantization metadata, such as group size and zero-point policy. Loader and kernel support varies by runtime version, so verify the chosen artifact/runtime pair before benchmarking.[7]
Compared with GPTQ, AWQ is lighter-weight because it avoids GPTQ's reconstruction step. It often works especially well on instruction-tuned checkpoints, but the final speed and latency picture still depends on the runtime and kernel implementation.[6][7]
GGUF is a file format, not a quantization algorithm. It's the container format used by the ggml / llama.cpp ecosystem for local inference.[2]
The key idea: GPTQ and AWQ mainly answer the question "How should I quantize the weights?" GGUF answers a different question: "How should I package model tensors and metadata so local runtimes can load and run them efficiently?"
GGUF matters because it bundles the tensors with the metadata needed to run them:
That last point is why GGUF is so important for local LLMs. If the full model doesn't fit in VRAM, llama.cpp-style runtimes can keep some layers on the GPU and spill the rest to system memory.
1artifact_gib = 5.0
2layers = 32
3usable_gpu_gib = 4.0
4runtime_reserve_gib = 0.5
5
6layer_budget_gib = usable_gpu_gib - runtime_reserve_gib
7gpu_layers = int(layer_budget_gib / (artifact_gib / layers))
8
9print(f"GPU budget for model layers: {layer_budget_gib:.1f} GiB")
10print(f"even-size approximation: {gpu_layers}/{layers} layers fit on GPU")
11print("Measure real tensor placement and KV memory in the chosen runtime.")1GPU budget for model layers: 3.5 GiB
2even-size approximation: 22/32 layers fit on GPU
3Measure real tensor placement and KV memory in the chosen runtime.GGUF can store several quantization families. The format doesn't force one specific quantizer.
| Family | Example | Extra Calibration | Typical Use |
|---|---|---|---|
| Legacy block quantization | Q4_0, Q5_0 | No | Simple and widely supported |
| K-quants | Q4_K_M, Q5_K_M | No | Common local default for size/quality |
| IQ / iMatrix-aware formats | IQ4_XS, IQ3_M | Usually yes | Better quality when squeezing below comfortable 4-bit settings |
Q4_K_M is a commonly encountered local-inference candidate, but the right choice depends on target model, quality gate, and hardware. If the full model fits in VRAM, benchmark GPU-oriented GPTQ or AWQ artifacts against the local runtime; if partial offload is required, GGUF is a useful packaging option.
Importance-matrix quantization uses representative text to estimate which directions are expensive to distort. That extra signal lets IQ formats spend precision where it buys the most quality. Conceptually, it fills the same role as calibration in GPTQ and AWQ: representative data tells the quantizer what errors matter most.
The commands below illustrate a llama.cpp-style conversion and quantization path. Binary names and supported quant types can change, so check the installed revision's documentation before running it.[2]
1# 1) Convert a Hugging Face checkpoint to GGUF
2python3 convert_hf_to_gguf.py ./Meta-Llama-3.1-8B-Instruct \
3 --outtype f16 \
4 --outfile llama-3.1-8b-f16.gguf
5
6# 2) If needed: build an importance matrix from representative text
7llama-imatrix \
8 -m llama-3.1-8b-f16.gguf \
9 -f calibration.txt \
10 -o llama-3.1.imatrix.dat
11
12# 3a) Common default without iMatrix
13llama-quantize \
14 llama-3.1-8b-f16.gguf \
15 llama-3.1-8b-Q4_K_M.gguf \
16 Q4_K_M
17
18# 3b) Importance-aware quantization
19llama-quantize \
20 --imatrix llama-3.1.imatrix.dat \
21 llama-3.1-8b-f16.gguf \
22 llama-3.1-8b-IQ4_XS.gguf \
23 IQ4_XSGPTQ, AWQ, and GGUF all shrink the model weights. Once the weights are small, the next bottleneck is often the KV cache, the attention state that grows with sequence length.
FP8 sits adjacent to the big three rather than replacing them. It's an 8-bit floating-point format that becomes attractive when the serving hardware has native FP8 kernels.[8]
FP8 has two common encodings:[8]
Unlike 4-bit weight-only methods, FP8 is usually chosen when you want a milder accuracy-memory tradeoff and the accelerator is built to exploit FP8 directly.
A newer branch of the same idea is 4-bit floating-point. MXFP4 and NVFP4 are block-scaled FP4 formats that pair tiny FP4 values with a shared per-block scale: MXFP4 uses 32-element blocks with a power-of-two (E8M0) scale, while NVFP4 uses 16-element blocks with an FP8 (E4M3) scale for finer resolution. These are not competitors to GPTQ and AWQ; in practice GPTQ-style or AWQ-style calibration is applied on top of an FP4 format to recover accuracy. Native FP4 tensor-core acceleration requires recent NVIDIA Blackwell hardware, and on older GPUs runtimes still load FP4 weights through software kernels that give the memory savings without the full compute speedup. MXFP4 has seen production use (for example in GPT-OSS releases); confirm format and hardware support against your runtime before committing to it.
Quantizing the KV cache is a separate lever. Weight quantization shrinks static model weights. KV-cache quantization shrinks attention state that grows with sequence length. Once weights fit, long-context concurrency can become limited by KV state; a runtime may then offer lower-precision KV storage as another measured tradeoff.
1weights_int4_gib = 7_000_000_000 * 0.5 / 1024**3
2batch, sequence = 32, 8_192
3layers, kv_heads, head_dim, kv_bytes = 32, 8, 128, 2
4kv_gib = 2 * batch * sequence * layers * kv_heads * head_dim * kv_bytes / 1024**3
5
6print(f"hypothetical 7B INT4 weights: {weights_int4_gib:.1f} GiB")
7print(f"FP16 KV cache at batch={batch}, context={sequence}: {kv_gib:.1f} GiB")
8print("Shrinking weights alone does not solve long-context capacity.")1hypothetical 7B INT4 weights: 3.3 GiB
2FP16 KV cache at batch=32, context=8192: 32.0 GiB
3Shrinking weights alone does not solve long-context capacity.When choosing a quantization strategy, the real question isn't "Which one is best?" It's "What hardware constraint am I solving for?"
| Feature | GPTQ | AWQ | GGUF |
|---|---|---|---|
| Meaning | Weight-only PTQ algorithm | Weight-only PTQ algorithm | Portable file/container format |
| Core idea | Minimize layer output error with Hessian-weighted reconstruction | Protect salient weight channels using activation statistics | Store tensors + metadata + chosen ggml quantizers in one artifact |
| Calibration | Required | Required | Depends on quantizer; iMatrix uses representative data |
| Common deployment target | Fully GPU-resident serving | Fully GPU-resident serving | CPU, Apple Silicon, or mixed CPU/GPU |
| Strength | Mature second-order method | Strong 4-bit quality with hardware-friendly kernels | Single-file portability and partial GPU offload |
| Tradeoff | Offline quantization is heavier | Runtime/kernel compatibility still matters | Usually slower than specialized full-GPU kernels |
AWQ's paper reports strong 4-bit results by protecting activation-sensitive channels.[6] GPTQ remains relevant when a runtime or kernel stack supports its packed artifacts efficiently.[3] On GPU servers, benchmark the exact low-bit kernel path. For local or partial-offload deployments, benchmark the GGUF runtime and placement plan rather than assuming a format name decides performance.
Symptom: You choose "4-bit" from a model hub without checking whether it is GPTQ, AWQ, GGUF, or a runtime quantization path.
Cause: Bit width describes storage size, not calibration method, tensor layout, kernel support, offload behavior, or quality profile.
Fix: Name the artifact and the runtime together: "AWQ on vLLM," "GPTQ on Transformers," or "Q4_K_M GGUF on llama.cpp." Then test that exact pair.
Symptom: Your quantized French customer-support model speaks gibberish, even though the English version quantized fine.
Cause: GPTQ and AWQ both rely on calibration data to understand which weights matter. If you use English Wikipedia to quantize a model trained on French retail data, the activation statistics are wrong and the quantizer throws away precision in the wrong places.
Fix: Use calibration text that matches the target domain and language, then verify quality on held-out target tasks.
Symptom: A design doc claims GPTQ or AWQ makes the entire model 4-bit.
Cause: GPTQ and AWQ are weight-only methods. Activations usually stay at FP16 or BF16, and accumulation happens in higher precision.
Fix: Write the precision contract explicitly. W4A16 means 4-bit stored weights and 16-bit activations, not full W4A4 inference.
Symptom: You quantize to 4-bit expecting a 4x speedup, but tokens per second barely improve.
Cause: 4-bit weights save memory bandwidth, but the kernel still has to dequantize them into higher precision before the matrix multiply. If the dequantization code path is slow or the GPU isn't memory-bound to begin with, the speedup shrinks.
Fix: Measure end-to-end tokens per second on your exact hardware and batch size. Bandwidth savings are real, but they only translate to speed when the runtime is optimized for your GPU.
Symptom: The quantized model still chats politely, but it hallucinates order statuses or generates invalid JSON for your shipping API.
Cause: Perplexity on held-out text is a fast sanity check, but a model can show only a small perplexity increase while regressing sharply on structured tasks like code generation or multi-step reasoning.
Fix: Always pair perplexity with task-specific benchmarks. For a logistics model, run your own production eval set that includes the exact output formats the model must produce.
Symptom: Someone says "we used GGUF quantization" as if that fully specifies the quality and runtime behavior.
Cause: GGUF is the container. Q4_K_M, IQ4_XS, Q5_K_M, and related tensor types describe the actual low-bit encoding inside the file.
Fix: Report both: "GGUF Q4_K_M with 20 GPU layers," "GGUF IQ4_XS with iMatrix," or another concrete artifact/runtime pairing.
Perplexity is a good first sanity check, but it is not enough. A quantized model can show a modest change in perplexity while still regressing on code generation, structured output, or domain decisions. A returns-classification model should therefore be tested on representative extraction and action-format cases, not only general text.
GPTQ and AWQ both report that 4-bit weight-only quantization can preserve language-modeling quality surprisingly well on large models, while more aggressive bit widths degrade more sharply.[3][6]
| Evaluation Axis | What To Measure | Why It Matters |
|---|---|---|
| Language modeling | Held-out perplexity | Fast check that next-token behavior didn't drift too far |
| Reasoning and knowledge | MMLU[9], GSM8K[10] | Catches multi-step failures that perplexity can hide |
| Code generation | HumanEval[11] or your own coding eval | Code is often more brittle than chat completion |
| Systems performance | Tokens/s, VRAM use, max context, cold-start time | Quantization is a systems tradeoff, not just an accuracy number |
Three practical rules:
1candidates = [
2 {"name": "FP16", "task_accuracy": 0.93, "p95_ms": 70, "vram_gib": 14.0},
3 {"name": "AWQ-4bit", "task_accuracy": 0.92, "p95_ms": 49, "vram_gib": 4.1},
4 {"name": "aggressive-3bit", "task_accuracy": 0.85, "p95_ms": 43, "vram_gib": 3.2},
5]
6minimum_accuracy = 0.90
7maximum_vram_gib = 8.0
8approved = [c["name"] for c in candidates if c["task_accuracy"] >= minimum_accuracy and c["vram_gib"] <= maximum_vram_gib]
9
10print(f"approved artifacts: {approved}")
11print("Smaller artifact is rejected when task quality misses the gate.")1approved artifacts: ['AWQ-4bit']
2Smaller artifact is rejected when task quality misses the gate.
| Scenario | Recommended Starting Point | Why |
|---|---|---|
| Production GPU server, model fits in VRAM | Benchmark AWQ or GPTQ | Candidate artifacts for specialized GPU kernels |
| 24 GB GPU, model is too large | GGUF with partial offload | Lets system RAM absorb the overflow |
| CPU or Apple Silicon laptop | GGUF | Common local-runtime artifact path |
| Datacenter accelerator with native FP8 path | FP8 + KV-cache tuning | Better quality/memory tradeoff than jumping straight to INT4 |
Practical rule of thumb: if the whole model fits on GPU, benchmark an AWQ or GPTQ path supported by your runtime. If it does not, benchmark a GGUF partial-offload path.
By the end of this lesson, you should be able to:
If 4-bit compression would let the full checkpoint stay resident on GPU, benchmark a supported AWQ or GPTQ kernel path. If the model still will not fit after low-bit packing, evaluate GGUF because partial offload is now the main constraint.
The quantizer protected the wrong directions. GPTQ uses calibration activations to estimate which output errors matter, and AWQ uses them to find activation-sensitive channels. If deployment traffic is French retail support, English Wikipedia gives the wrong activation distribution.
One global scale forces every weight to share the outlier's range. Per-channel or per-group scaling lets smaller neighborhoods keep their own range, which preserves more local precision while adding moderate metadata overhead.
Look at the KV cache. Weight quantization shrinks static parameters, but the KV cache grows with context length and concurrent sessions. Once weights fit, cache compression or shorter context budgets often become the next bottleneck.
FP8 is a milder compression step than INT4 while still reducing memory pressure. If the accelerator has native FP8 kernels, benchmark it as a candidate before accepting the larger compression risk of INT4.
You now understand how to shrink model weights for production. You can calculate memory footprints, choose between GPTQ, AWQ, and GGUF based on hardware constraints, and evaluate whether a quantized model is good enough for your task. These decisions are the foundation of cost-effective LLM serving.
The next challenge is local deployment. Even a quantized model needs the right runtime, hardware plan, container, health check, and evaluation gate. In Local LLM Deployment, you'll turn compression choices into a repeatable serving plan.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.
Xiao, G., et al. · 2023 · ICML 2023
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G. · 2023
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
Frantar, E., et al. · 2023 · ICLR 2023
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Raffel, C., et al. · 2020 · JMLR
GPTQ
Hugging Face · 2026
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.
Lin, J., et al. · 2023 · MLSys 2024
AWQ
Hugging Face · 2026
FP8 Formats for Deep Learning.
Micikevicius, P., et al. · 2022
Measuring Massive Multitask Language Understanding (MMLU).
Hendrycks, D., et al. · 2021 · ICLR 2021
Training Verifiers to Solve Math Word Problems (GSM8K).
Cobbe, K., et al. · 2021
Evaluating Large Language Models Trained on Code (HumanEval).
Chen, M., et al. · 2021 · arXiv preprint