LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleModel Quantization: GPTQ, AWQ & GGUF
🚀HardInference Optimization

Model Quantization: GPTQ, AWQ & GGUF

Understand how GPTQ, AWQ, and GGUF trade off accuracy, memory footprint, and portability when serving LLMs on GPUs or local hardware.

36 min read
Learning path
Step 131 of 155 in the full curriculum
Model Parallelism for LLM InferenceLocal LLM Deployment

Model Quantization: GPTQ, AWQ, and GGUF

Model parallelism splits a large model across several accelerators. Quantization asks a different question: what if each weight used fewer bytes before you split anything?

Model quantization stores selected tensors with fewer bits so large models can require less memory and potentially less bandwidth. This chapter explains GPTQ, a Hessian-aware post-training quantization method, AWQ (Activation-Aware Weight Quantization), and GGUF, a portable local-inference container format, from the serving tradeoff: what footprint you reduce, what quality can move, and which runtime can use the artifact.

Imagine you run the AI team for an online marketplace. A 70 billion parameter model needs about 140 GB of weight storage in FP16 or BF16. That exceeds one 80 GB accelerator before KV cache or runtime buffers. You need to test whether a lower-bit artifact fits while preserving required task quality.

Quantization is that technique. It works like replacing a precise ruler with a coarser grid: the storage reduction is predictable, while the error depends on which values the grid distorts. Instead of storing every model weight as a high-precision floating-point number, a weight-only method packs approximate low-bit values plus metadata. Some models and tasks tolerate that approximation well; others regress enough that the artifact must be rejected.

By this point in the curriculum you already know how a Transformer turns tokens into predictions. The next question is how to fit the resulting model onto real hardware. LLM inference is often memory-bandwidth bound, which means the accelerator spends much of its time streaming weights from memory instead of saturating the math units. Quantization helps because 4-bit weights cut raw weight traffic to about one quarter of FP16. Real speedups are smaller than 4x because kernels still have to dequantize and accumulate in higher precision, but the memory savings are real. Shrinking weights from about 140 GB to about 35 GB can be the difference between "needs multiple datacenter GPUs" and "fits on a workstation or mixed CPU/GPU local setup." This article breaks down GPTQ, AWQ, and GGUF: what problem each one solves, and when to use each.


What is quantization?

Think of it like rounding numbers on a ruler. A precise ruler has millimeter markings, so you can measure 3.7 mm, 4.2 mm, 12.1 mm. A coarse ruler only has centimeter markings, so those same values become 4 cm, 4 cm, 12 cm. You lose some precision, but the ruler is simpler and cheaper. Quantization does this to every number in a neural network.

A tiny worked example

Before the general formula, walk through one weight by hand. Suppose a weight has the value w=0.73w = 0.73w=0.73 and you choose a scale of s=0.1s = 0.1s=0.1. That means each integer step represents 0.10.10.1 in the original space.

  1. Divide: 0.73/0.1=7.30.73 / 0.1 = 7.30.73/0.1=7.3
  2. Round to the nearest integer: 777
  3. Store the integer q=7q = 7q=7

During inference you reverse the process:

w^=s⋅q=0.1⋅7=0.7\hat{w} = s \cdot q = 0.1 \cdot 7 = 0.7w^=s⋅q=0.1⋅7=0.7

The stored value is 0.70.70.7, not 0.730.730.73. The error is tiny (0.030.030.03), and when this happens across billions of weights the model usually stays useful. If you had chosen a coarser scale of s=0.5s = 0.5s=0.5, the same weight would become q=round(0.73/0.5)=1q = \text{round}(0.73 / 0.5) = 1q=round(0.73/0.5)=1 and the reconstructed value would be 0.50.50.5, which is a much larger error. The art of quantization is choosing the right scale so the errors stay small where they matter most.

At the simplest level, quantization maps a floating-point weight to a smaller integer range using a scale and, in some schemes, a zero point:

q=clip(round(ws)+z, qmin⁡, qmax⁡)q = \text{clip}\left(\text{round}\left(\frac{w}{s}\right) + z,\ q_{\min},\ q_{\max}\right)q=clip(round(sw​)+z, qmin​, qmax​)

The quantization formula

Take the original weight www, divide by the scale sss to express it in "integer-sized steps," shift by the zero point zzz, round to the nearest integer, and clamp to the representable range. That's how a high-precision weight becomes an INT8 or INT4 value.

  • www is the original floating-point weight
  • qqq is the stored integer
  • sss is the scale factor
  • zzz is the zero point
  • qmin⁡,qmax⁡q_{\min}, q_{\max}qmin​,qmax​ are the integer limits (for example, 0 to 15 for unsigned 4-bit)

Dequantization reverses the process during inference:

w^=s⋅(q−z)\hat{w} = s \cdot (q - z)w^=s⋅(q−z)

Reading the formula

Subtract the zero point from the stored integer, then multiply by the scale. The result w^\hat{w}w^ is only an approximation of the original weight because rounding already threw away information.

symmetric-int4-roundtrip.py
1weights = [0.15, -1.22, 2.40, -0.45] 2qmax = 7 3scale = max(abs(weight) for weight in weights) / qmax 4quantized = [max(-qmax, min(qmax, round(weight / scale))) for weight in weights] 5restored = [value * scale for value in quantized] 6mean_error = sum(abs(a - b) for a, b in zip(weights, restored)) / len(weights) 7 8print(f"scale: {scale:.5f}") 9print(f"INT4 values: {quantized}") 10print(f"reconstructed: {[round(value, 3) for value in restored]}") 11print(f"mean absolute error: {mean_error:.3f}")
Output
1scale: 0.34286 2INT4 values: [0, -4, 7, -1] 3reconstructed: [0.0, -1.371, 2.4, -0.343] 4mean absolute error: 0.102

Symmetric vs. asymmetric

Imagine a warehouse scale that stores package-weight readings using only 16 possible markings. Two strategies are available:

  • Symmetric: center the 16 markings around zero, such as -8 to +7. This is simple and hardware-friendly, but it wastes range if the values are skewed.
  • Asymmetric: move the markings to the actual range you observed. This uses the integer range more efficiently, but it requires the extra zero-point offset.

More formally:

  1. Symmetric quantization: uses z=0z = 0z=0 and maps weights around zero. This is common for weight-only LLM kernels because the math is simpler.
  2. Asymmetric quantization: uses a non-zero zzz to better cover distributions that are not centered at zero. This is common for activations and some weight formats.

The key idea: most LLM weight tensors are roughly zero-centered, so symmetric or near-symmetric per-group quantization often works well for weights. Activations are harder because a few channels can contain very large outliers. Techniques like SmoothQuant[1] make activation quantization easier by shifting some of that difficulty into the weights.

asymmetric-activation-range.py
1activations = [0.0, 1.2, 1.8, 2.6, 3.0] 2signed_qmax = 7 3unsigned_qmax = 15 4 5symmetric_scale = max(activations) / signed_qmax 6asymmetric_scale = (max(activations) - min(activations)) / unsigned_qmax 7 8print(f"symmetric signed step: {symmetric_scale:.3f}") 9print(f"asymmetric unsigned step: {asymmetric_scale:.3f}") 10print("A non-negative activation range can use more 4-bit levels asymmetrically.")
Output
1symmetric signed step: 0.429 2asymmetric unsigned step: 0.200 3A non-negative activation range can use more 4-bit levels asymmetrically.

The figure below shows the basic quantization pipeline. Start with high-precision weights, estimate the statistics needed to compute scales, then pack the results into a low-bit representation plus metadata.

Quantization pipeline: high-precision weights become calibrated low-bit values plus the metadata an inference kernel needs. Quantization pipeline: high-precision weights become calibrated low-bit values plus the metadata an inference kernel needs.
Quantized weights still need metadata. Scales, group size, tensor layout, and kernel packing explain why real artifacts are larger than ideal bit math.

Memory savings

The most obvious benefit of quantization is memory reduction. That directly translates to lower serving cost, larger batch sizes, and the ability to fit bigger models on smaller devices.

The table below is weights only. Actual runtime memory is higher because you still pay for activations, the KV cache, scale metadata, and framework overhead.

Bar chart comparing 70B model weight and artifact footprints: FP16 at 140 GB, INT8 at 70 GB, ideal INT4 at 35 GB, packed Q4 at about 43 GB, and ideal INT3 at 26 GB. Bar chart comparing 70B model weight and artifact footprints: FP16 at 140 GB, INT8 at 70 GB, ideal INT4 at 35 GB, packed Q4 at about 43 GB, and ideal INT3 at 26 GB.
Ideal bit math is the starting point. Packed Q4 is illustrative: artifact size varies with quant type, metadata, tensor layout, alignment, and mixed tensor choices.
PrecisionBits/Weight7B Weights70B Weights
FP16 / BF1616~14 GB~140 GB
INT8 / FP88~7 GB~70 GB
INT44~3.5 GB~35 GB
INT33~2.6 GB~26 GB

Those numbers are idealized weight math. Real packed formats are larger when they also store per-group scales, alignment, or tensors kept at higher precision. For a GGUF artifact, inspect the selected quantization type and actual file/runtime footprint rather than assuming the ideal 35 GB number.[2]

Because LLM generation is often weight-bandwidth bound, shrinking the weights can also speed up inference. The raw bandwidth demand drops almost linearly with bit width, although the observed throughput gain depends on the kernel, batch size, and how much extra work the runtime does to unpack the weights.

group-scale-metadata.py
1parameters = 70_000_000_000 2bits_per_weight = 4 3group_size = 128 4bytes_per_scale = 2 5 6ideal_weight_gb = parameters * bits_per_weight / 8 / 1_000_000_000 7scale_metadata_gb = parameters / group_size * bytes_per_scale / 1_000_000_000 8 9print(f"ideal INT4 weights: {ideal_weight_gb:.2f} GB") 10print(f"one FP16 scale per {group_size} weights: {scale_metadata_gb:.2f} GB") 11print("Alignment and mixed-precision tensors can add more.")
Output
1ideal INT4 weights: 35.00 GB 2one FP16 scale per 128 weights: 1.09 GB 3Alignment and mixed-precision tensors can add more.

Sizing exercise

Try this before moving on. You have a workstation with one NVIDIA RTX 4060 (8 GB VRAM) and you want to run Llama 3 70B. The model has roughly 70 billion parameters.

  1. How much VRAM would the model need in FP16?
  2. How much would it need at INT4?
  3. Can you run it on the 4060?

Use this tiny calculator to check the arithmetic without any framework overhead:

quantization-sizing.py
1def weight_gb(parameters_billion: float, bits_per_weight: int) -> float: 2 return parameters_billion * bits_per_weight / 8 3 4params = 70 5for bits in (16, 8, 4, 3): 6 print(f"70B at {bits:>2}-bit weights: {weight_gb(params, bits):5.1f} GB") 7 8gpu_vram_gb = 8 9usable_vram_gb = gpu_vram_gb * 0.8 10print(f"8 GB GPU with 20% reserve: {usable_vram_gb:.1f} GB usable")
Output
170B at 16-bit weights: 140.0 GB 270B at 8-bit weights: 70.0 GB 370B at 4-bit weights: 35.0 GB 470B at 3-bit weights: 26.2 GB 58 GB GPU with 20% reserve: 6.4 GB usable

Solution

  1. FP16: 70×2=14070 \times 2 = 14070×2=140 GB. The model doesn't fit.
  2. INT4: 70×0.5=3570 \times 0.5 = 3570×0.5=35 GB. The model still doesn't fit on an 8 GB card.
  3. You can't run the full model on the GPU alone. You need GGUF with heavy CPU offloading, or a larger GPU, or a smaller model. The INT4 size is a huge improvement, but it isn't magic.

This is exactly the calculation you should do before choosing a quantization strategy. The formula is simple: parameters ×\times× bytes per parameter = weight footprint. Remember that runtime memory is higher because of activations, KV cache, and metadata.


Weight-only vs. weight-activation quantization

Two broad approaches to quantization exist, and confusing them causes a lot of interview mistakes.

Weight-only quantization is what GPTQ and AWQ target. The stored weights are low precision, while optimized kernels unpack or dequantize low-bit values during computation and accumulate with higher-precision activations or accumulators. This is why you often see notation like W4A16: 4-bit stored weights, 16-bit activations.

Weight-activation quantization pushes both sides of the matmul down, for example W8A8 or W4A8. This can be faster and more memory-efficient, but it's harder because activation distributions are spikier and less stable than weight distributions. Techniques like SmoothQuant[1] exist specifically to make weight-activation quantization practical.

GPTQ and AWQ are weight-only methods. They capture large weight-memory savings without requiring equally low-bit activations. Lower-bit activation paths such as W4A4 need separate hardware, kernel, and quality validation.

Common mistake: Candidates often claim W4A16 quantization gives a 4x end-to-end speedup. It does not guarantee that. Low-bit weights reduce raw weight traffic, but kernel overhead and non-weight work remain, and the workload may not be bandwidth-bound. The guaranteed first-order gain is smaller stored weights, not a fixed tokens-per-second multiplier.

bandwidth-saving-is-not-speedup.py
1bandwidth_gb_s = 1_000 2fp16_weight_gb = 14.0 3int4_weight_gb = 3.5 4other_work_ms = 3.0 5 6fp16_ms = fp16_weight_gb / bandwidth_gb_s * 1000 + other_work_ms 7int4_ms = int4_weight_gb / bandwidth_gb_s * 1000 + other_work_ms 8 9print(f"raw weight traffic reduction: {fp16_weight_gb / int4_weight_gb:.1f}x") 10print(f"illustrative step speedup with fixed overhead: {fp16_ms / int4_ms:.2f}x")
Output
1raw weight traffic reduction: 4.0x 2illustrative step speedup with fixed overhead: 2.62x

GPTQ (post-training quantization)

The intuition: balancing errors like a checkbook

Imagine a layer has only two weights for simplicity: w1=1.2w_1 = 1.2w1​=1.2 and w2=0.8w_2 = 0.8w2​=0.8. A naive quantizer might round both toward the nearest integer, turning them into 1.01.01.0 and 1.01.01.0. The first weight lost 0.20.20.2, the second gained 0.20.20.2. The total output change depends on how the layer uses those weights.

If the calibration data shows that w1w_1w1​ is multiplied by large activations and w2w_2w2​ by small ones, the −0.2-0.2−0.2 error on w1w_1w1​ hurts the output far more than the +0.2+0.2+0.2 error on w2w_2w2​. GPTQ notices this through the Hessian approximation and compensates: it might quantize w1w_1w1​ more carefully, or adjust w2w_2w2​ in the opposite direction to cancel some of the damage. The goal isn't to make every weight close to its original value. The goal is to keep the layer's output on real data as close as possible to the original output.

Algorithm: minimize output error with curvature information

GPTQ works like carefully removing blocks from a Jenga tower. When you remove one block (round one weight), the tower shifts. A careless player ignores the shift and keeps pulling until the tower collapses. GPTQ compensates for the shift. After quantizing one set of weights, it updates the remaining floating-point weights so the layer output changes as little as possible.

GPTQ[3] is a one-shot post-training quantization method based on approximate second-order information. It builds on the layer-wise Optimal Brain Quantization (OBQ) solver, which quantizes weights one at a time and updates the remaining weights to minimize the layer's output error. GPTQ makes that idea fast enough for billion-parameter models by quantizing weights in a fixed order and using lazy batched updates. The key objective isn't "make the quantized weights numerically close to the originals." It's "make the layer output on real activations stay close to the original output." For a weight row www and calibration activations XXX, GPTQ approximates:

w^=arg⁡min⁡w~∈Q∥wX−w~X∥22≈arg⁡min⁡w~∈Q(w−w~)TH(w−w~)H≈XXT\begin{aligned} \hat{w} &= \arg\min_{\tilde{w} \in \mathcal{Q}} \|wX - \tilde{w}X\|_2^2 \\ &\approx \arg\min_{\tilde{w} \in \mathcal{Q}} (w - \tilde{w})^T H (w - \tilde{w}) \\ H &\approx XX^T \end{aligned}w^H​=argw~∈Qmin​∥wX−w~X∥22​≈argw~∈Qmin​(w−w~)TH(w−w~)≈XXT​

Reading the formula

The Hessian approximation HHH tells GPTQ which input directions matter most on the calibration set. A small error on an unimportant direction is cheap. The same numeric error on a frequently used direction is expensive. That's why GPTQ usually beats plain round-to-nearest quantization at the same bit width.

In practice, GPTQ looks like this:

  1. Collect representative activations from a calibration set such as C4[4].
  2. Approximate H≈XXTH \approx XX^TH≈XXT for each linear layer.
  3. Quantize the weights sequentially while using an approximate inverse Hessian to compensate the remaining floating-point weights.
  4. Pack the result into a low-bit format that an inference kernel can consume efficiently.

The original paper reports quantizing 175B-class models (OPT-175B and BLOOM-176B) in about four GPU-hours while preserving strong accuracy at 3-bit and 4-bit settings.[3]

gptq-curvature-proxy.py
1rounding_errors = [0.20, 0.20] 2curvature_proxy = [25.0, 1.0] 3weighted_cost = [h * error**2 for h, error in zip(curvature_proxy, rounding_errors)] 4 5print(f"same absolute errors: {rounding_errors}") 6print(f"curvature-weighted costs: {weighted_cost}") 7print(f"first direction costs {weighted_cost[0] / weighted_cost[1]:.0f}x more to distort")
Output
1same absolute errors: [0.2, 0.2] 2curvature-weighted costs: [1.0000000000000002, 0.04000000000000001] 3first direction costs 25x more to distort

The configuration sketch below shows the shape of a transformers GPTQ workflow. Backend packages and supported arguments change over time, so verify current library documentation and evaluate the resulting artifact before deployment.[5]

reading-the-formula.py
1from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig 2 3model_id = "facebook/opt-125m" 4tokenizer = AutoTokenizer.from_pretrained(model_id) 5 6gptq_config = GPTQConfig( 7 bits=4, 8 dataset="c4", 9 tokenizer=tokenizer, 10) 11 12model = AutoModelForCausalLM.from_pretrained( 13 model_id, 14 device_map="auto", 15 quantization_config=gptq_config, 16) 17 18# If quantized with device_map="auto", gather the model onto one device before saving. 19model.to("cpu") 20model.save_pretrained("opt-125m-gptq") 21tokenizer.save_pretrained("opt-125m-gptq")

Granularity: per-tensor, per-channel, per-group

A naive quantizer uses one scale for the entire tensor. This is per-tensor quantization. It's cheap, but one large outlier can ruin the precision of everything else.

Modern LLM quantizers almost always use finer granularity:

GranularityWhat Gets Its Own ScaleAccuracyMetadata Overhead
Per-tensorEntire tensorLowestLowest
Per-channelOne output channel / rowBetterModerate
Per-groupSmall group of weights, often 64 or 128Common 4-bit choiceModerate

Per-group quantization is the common compromise for 4-bit LLM inference. Smaller groups usually improve fidelity, but they also require storing more scale metadata and may reduce kernel efficiency.


AWQ (activation-aware weight quantization)

Not all weights are equally dangerous to quantize

Imagine renovating a building and needing to remove material wherever possible. Most walls can be thinned or moved with little impact. A few are load-bearing. Disturb those and the whole structure becomes unstable. AWQ tries to identify the "load-bearing" weight channels.

AWQ[6] starts from the observation that activation magnitudes are not evenly distributed. A small fraction of channels carry disproportionately large activations. If the corresponding weight columns are quantized poorly, the downstream matmul error gets amplified. The AWQ paper reports that protecting only about 1% of salient weights can greatly reduce quantization error.[6]

awq-salient-channel-proxy.py
1activations = [100.0, 0.1] 2naive_weight_errors = [0.10, 0.10] 3protected_weight_errors = [0.02, 0.10] 4 5naive_output_error = sum(a * e for a, e in zip(activations, naive_weight_errors)) 6protected_output_error = sum(a * e for a, e in zip(activations, protected_weight_errors)) 7 8print(f"naive output-error proxy: {naive_output_error:.2f}") 9print(f"protect high-activation channel: {protected_output_error:.2f}")
Output
1naive output-error proxy: 10.01 2protect high-activation channel: 2.01

Why protecting a few weights matters

Imagine the same two-weight layer, but now the activation pattern is extreme: w1w_1w1​ is almost always multiplied by 100100100, while w2w_2w2​ is multiplied by 0.10.10.1. A 0.10.10.1 rounding error on w1w_1w1​ becomes a 101010 unit output error, while the same error on w2w_2w2​ becomes only 0.010.010.01 units. AWQ identifies these "high-traffic" channels and rescales them so the quantizer spends more of its limited integer range on the weights that matter most.

Algorithm

AWQ doesn't rebuild the entire weight matrix the way GPTQ does. Instead, it uses an equivalent rescaling trick:

Wx=W⋅diag(s)⋅diag(s)−1x≈Q ⁣(W⋅diag(s))⋅diag(s)−1xW x = W \cdot \mathrm{diag}(s) \cdot \mathrm{diag}(s)^{-1} x \approx Q\!\left(W \cdot \mathrm{diag}(s)\right) \cdot \mathrm{diag}(s)^{-1} xWx=W⋅diag(s)⋅diag(s)−1x≈Q(W⋅diag(s))⋅diag(s)−1x

Reading the formula

Multiply important weight columns by a scaling vector sss before quantization so they occupy more of the available integer range. Then divide the corresponding activation channels by the same factor. The floating-point computation stays equivalent, but the quantizer now spends more precision on the columns that matter most.

The practical workflow is:

  1. Run representative inputs through the model and collect activation statistics.
  2. Identify salient channels with unusually large activation magnitude.
  3. Search for scaling factors that reduce the quantization error on those channels.
  4. Quantize the rescaled weights into a hardware-friendly 4-bit format.

AWQ artifacts are typically produced offline and loaded by a serving runtime that understands the artifact's quantization metadata, such as group size and zero-point policy. Loader and kernel support varies by runtime version, so verify the chosen artifact/runtime pair before benchmarking.[7]

Compared with GPTQ, AWQ is lighter-weight because it avoids GPTQ's reconstruction step. It often works especially well on instruction-tuned checkpoints, but the final speed and latency picture still depends on the runtime and kernel implementation.[6][7]

Comparison of GPTQ and AWQ mechanics: GPTQ compensates reconstruction error while AWQ protects activation-sensitive channels before packing 4-bit weights. Comparison of GPTQ and AWQ mechanics: GPTQ compensates reconstruction error while AWQ protects activation-sensitive channels before packing 4-bit weights.
GPTQ and AWQ both compress weights, but protect different signals. GPTQ targets reconstructed layer output; AWQ protects activation-sensitive channels.


GGUF (llama.cpp format)

What is GGUF?

GGUF is a file format, not a quantization algorithm. It's the container format used by the ggml / llama.cpp ecosystem for local inference.[2]

The key idea: GPTQ and AWQ mainly answer the question "How should I quantize the weights?" GGUF answers a different question: "How should I package model tensors and metadata so local runtimes can load and run them efficiently?"

GGUF matters because it bundles the tensors with the metadata needed to run them:

  • tokenizer and vocabulary information
  • architecture metadata and tensor shapes
  • tensor-by-tensor quantization types inside one portable file
  • compatibility with CPU inference and partial GPU offload in local runtimes

That last point is why GGUF is so important for local LLMs. If the full model doesn't fit in VRAM, llama.cpp-style runtimes can keep some layers on the GPU and spill the rest to system memory.

GGUF packaging and partial offload: a portable file stores tensors and metadata, then a local runtime places some layers in GPU VRAM and the rest in system RAM. GGUF packaging and partial offload: a portable file stores tensors and metadata, then a local runtime places some layers in GPU VRAM and the rest in system RAM.
GGUF is a local runtime package. It can hold mixed quantized tensor types and enough metadata for a runtime to place layers across GPU VRAM and system RAM.

gguf-partial-offload-budget.py
1artifact_gib = 5.0 2layers = 32 3usable_gpu_gib = 4.0 4runtime_reserve_gib = 0.5 5 6layer_budget_gib = usable_gpu_gib - runtime_reserve_gib 7gpu_layers = int(layer_budget_gib / (artifact_gib / layers)) 8 9print(f"GPU budget for model layers: {layer_budget_gib:.1f} GiB") 10print(f"even-size approximation: {gpu_layers}/{layers} layers fit on GPU") 11print("Measure real tensor placement and KV memory in the chosen runtime.")
Output
1GPU budget for model layers: 3.5 GiB 2even-size approximation: 22/32 layers fit on GPU 3Measure real tensor placement and KV memory in the chosen runtime.

Quantization families inside GGUF

GGUF can store several quantization families. The format doesn't force one specific quantizer.

FamilyExampleExtra CalibrationTypical Use
Legacy block quantizationQ4_0, Q5_0NoSimple and widely supported
K-quantsQ4_K_M, Q5_K_MNoCommon local default for size/quality
IQ / iMatrix-aware formatsIQ4_XS, IQ3_MUsually yesBetter quality when squeezing below comfortable 4-bit settings

Q4_K_M is a commonly encountered local-inference candidate, but the right choice depends on target model, quality gate, and hardware. If the full model fits in VRAM, benchmark GPU-oriented GPTQ or AWQ artifacts against the local runtime; if partial offload is required, GGUF is a useful packaging option.

iMatrix quantization

Importance-matrix quantization uses representative text to estimate which directions are expensive to distort. That extra signal lets IQ formats spend precision where it buys the most quality. Conceptually, it fills the same role as calibration in GPTQ and AWQ: representative data tells the quantizer what errors matter most.

The commands below illustrate a llama.cpp-style conversion and quantization path. Binary names and supported quant types can change, so check the installed revision's documentation before running it.[2]

terminal
1# 1) Convert a Hugging Face checkpoint to GGUF 2python3 convert_hf_to_gguf.py ./Meta-Llama-3.1-8B-Instruct \ 3 --outtype f16 \ 4 --outfile llama-3.1-8b-f16.gguf 5 6# 2) If needed: build an importance matrix from representative text 7llama-imatrix \ 8 -m llama-3.1-8b-f16.gguf \ 9 -f calibration.txt \ 10 -o llama-3.1.imatrix.dat 11 12# 3a) Common default without iMatrix 13llama-quantize \ 14 llama-3.1-8b-f16.gguf \ 15 llama-3.1-8b-Q4_K_M.gguf \ 16 Q4_K_M 17 18# 3b) Importance-aware quantization 19llama-quantize \ 20 --imatrix llama-3.1.imatrix.dat \ 21 llama-3.1-8b-f16.gguf \ 22 llama-3.1-8b-IQ4_XS.gguf \ 23 IQ4_XS


Beyond weight-only: FP8 and KV cache quantization

GPTQ, AWQ, and GGUF all shrink the model weights. Once the weights are small, the next bottleneck is often the KV cache, the attention state that grows with sequence length.

FP8 sits adjacent to the big three rather than replacing them. It's an 8-bit floating-point format that becomes attractive when the serving hardware has native FP8 kernels.[8]

FP8 has two common encodings:[8]

  • E4M3: more mantissa precision, less dynamic range
  • E5M2: less mantissa precision, more dynamic range

Unlike 4-bit weight-only methods, FP8 is usually chosen when you want a milder accuracy-memory tradeoff and the accelerator is built to exploit FP8 directly.

A newer branch of the same idea is 4-bit floating-point. MXFP4 and NVFP4 are block-scaled FP4 formats that pair tiny FP4 values with a shared per-block scale: MXFP4 uses 32-element blocks with a power-of-two (E8M0) scale, while NVFP4 uses 16-element blocks with an FP8 (E4M3) scale for finer resolution. These are not competitors to GPTQ and AWQ; in practice GPTQ-style or AWQ-style calibration is applied on top of an FP4 format to recover accuracy. Native FP4 tensor-core acceleration requires recent NVIDIA Blackwell hardware, and on older GPUs runtimes still load FP4 weights through software kernels that give the memory savings without the full compute speedup. MXFP4 has seen production use (for example in GPT-OSS releases); confirm format and hardware support against your runtime before committing to it.

Quantizing the KV cache is a separate lever. Weight quantization shrinks static model weights. KV-cache quantization shrinks attention state that grows with sequence length. Once weights fit, long-context concurrency can become limited by KV state; a runtime may then offer lower-precision KV storage as another measured tradeoff.

kv-cache-after-weight-quantization.py
1weights_int4_gib = 7_000_000_000 * 0.5 / 1024**3 2batch, sequence = 32, 8_192 3layers, kv_heads, head_dim, kv_bytes = 32, 8, 128, 2 4kv_gib = 2 * batch * sequence * layers * kv_heads * head_dim * kv_bytes / 1024**3 5 6print(f"hypothetical 7B INT4 weights: {weights_int4_gib:.1f} GiB") 7print(f"FP16 KV cache at batch={batch}, context={sequence}: {kv_gib:.1f} GiB") 8print("Shrinking weights alone does not solve long-context capacity.")
Output
1hypothetical 7B INT4 weights: 3.3 GiB 2FP16 KV cache at batch=32, context=8192: 32.0 GiB 3Shrinking weights alone does not solve long-context capacity.

Comparison

When choosing a quantization strategy, the real question isn't "Which one is best?" It's "What hardware constraint am I solving for?"

FeatureGPTQAWQGGUF
MeaningWeight-only PTQ algorithmWeight-only PTQ algorithmPortable file/container format
Core ideaMinimize layer output error with Hessian-weighted reconstructionProtect salient weight channels using activation statisticsStore tensors + metadata + chosen ggml quantizers in one artifact
CalibrationRequiredRequiredDepends on quantizer; iMatrix uses representative data
Common deployment targetFully GPU-resident servingFully GPU-resident servingCPU, Apple Silicon, or mixed CPU/GPU
StrengthMature second-order methodStrong 4-bit quality with hardware-friendly kernelsSingle-file portability and partial GPU offload
TradeoffOffline quantization is heavierRuntime/kernel compatibility still mattersUsually slower than specialized full-GPU kernels

AWQ's paper reports strong 4-bit results by protecting activation-sensitive channels.[6] GPTQ remains relevant when a runtime or kernel stack supports its packed artifacts efficiently.[3] On GPU servers, benchmark the exact low-bit kernel path. For local or partial-offload deployments, benchmark the GGUF runtime and placement plan rather than assuming a format name decides performance.


Common pitfalls

Treating all quantizers as equivalent

Symptom: You choose "4-bit" from a model hub without checking whether it is GPTQ, AWQ, GGUF, or a runtime quantization path.

Cause: Bit width describes storage size, not calibration method, tensor layout, kernel support, offload behavior, or quality profile.

Fix: Name the artifact and the runtime together: "AWQ on vLLM," "GPTQ on Transformers," or "Q4_K_M GGUF on llama.cpp." Then test that exact pair.

The calibration trap

Symptom: Your quantized French customer-support model speaks gibberish, even though the English version quantized fine.

Cause: GPTQ and AWQ both rely on calibration data to understand which weights matter. If you use English Wikipedia to quantize a model trained on French retail data, the activation statistics are wrong and the quantizer throws away precision in the wrong places.

Fix: Use calibration text that matches the target domain and language, then verify quality on held-out target tasks.

Confusing weight-only and full quantization

Symptom: A design doc claims GPTQ or AWQ makes the entire model 4-bit.

Cause: GPTQ and AWQ are weight-only methods. Activations usually stay at FP16 or BF16, and accumulation happens in higher precision.

Fix: Write the precision contract explicitly. W4A16 means 4-bit stored weights and 16-bit activations, not full W4A4 inference.

The speed fallacy

Symptom: You quantize to 4-bit expecting a 4x speedup, but tokens per second barely improve.

Cause: 4-bit weights save memory bandwidth, but the kernel still has to dequantize them into higher precision before the matrix multiply. If the dequantization code path is slow or the GPU isn't memory-bound to begin with, the speedup shrinks.

Fix: Measure end-to-end tokens per second on your exact hardware and batch size. Bandwidth savings are real, but they only translate to speed when the runtime is optimized for your GPU.

Ignoring perplexity

Symptom: The quantized model still chats politely, but it hallucinates order statuses or generates invalid JSON for your shipping API.

Cause: Perplexity on held-out text is a fast sanity check, but a model can show only a small perplexity increase while regressing sharply on structured tasks like code generation or multi-step reasoning.

Fix: Always pair perplexity with task-specific benchmarks. For a logistics model, run your own production eval set that includes the exact output formats the model must produce.

Confusing GGUF with the quantizer

Symptom: Someone says "we used GGUF quantization" as if that fully specifies the quality and runtime behavior.

Cause: GGUF is the container. Q4_K_M, IQ4_XS, Q5_K_M, and related tensor types describe the actual low-bit encoding inside the file.

Fix: Report both: "GGUF Q4_K_M with 20 GPU layers," "GGUF IQ4_XS with iMatrix," or another concrete artifact/runtime pairing.


How to evaluate a quantized model

Perplexity is a good first sanity check, but it is not enough. A quantized model can show a modest change in perplexity while still regressing on code generation, structured output, or domain decisions. A returns-classification model should therefore be tested on representative extraction and action-format cases, not only general text.

GPTQ and AWQ both report that 4-bit weight-only quantization can preserve language-modeling quality surprisingly well on large models, while more aggressive bit widths degrade more sharply.[3][6]

Three-gate approval workflow for quantized models: broad language drift, task regressions, and serving performance all must pass before deployment. Three-gate approval workflow for quantized models: broad language drift, task regressions, and serving performance all must pass before deployment.
A strong quantization job checks quality and serving behavior together. Approval depends on target workload gates, not checkpoint size alone.
Evaluation AxisWhat To MeasureWhy It Matters
Language modelingHeld-out perplexityFast check that next-token behavior didn't drift too far
Reasoning and knowledgeMMLU[9], GSM8K[10]Catches multi-step failures that perplexity can hide
Code generationHumanEval[11] or your own coding evalCode is often more brittle than chat completion
Systems performanceTokens/s, VRAM use, max context, cold-start timeQuantization is a systems tradeoff, not just an accuracy number

Three practical rules:

  • GPTQ and AWQ results support testing 4-bit weight-only artifacts before pushing to more aggressive bit widths.[3][6]
  • Task-specific regressions cannot be inferred from a generic quality score; reasoning, structured output, and tool-use tasks need their own gates.
  • Group size, calibration data, and the serving kernel can matter as much as the headline format name.

quantized-artifact-approval-gate.py
1candidates = [ 2 {"name": "FP16", "task_accuracy": 0.93, "p95_ms": 70, "vram_gib": 14.0}, 3 {"name": "AWQ-4bit", "task_accuracy": 0.92, "p95_ms": 49, "vram_gib": 4.1}, 4 {"name": "aggressive-3bit", "task_accuracy": 0.85, "p95_ms": 43, "vram_gib": 3.2}, 5] 6minimum_accuracy = 0.90 7maximum_vram_gib = 8.0 8approved = [c["name"] for c in candidates if c["task_accuracy"] >= minimum_accuracy and c["vram_gib"] <= maximum_vram_gib] 9 10print(f"approved artifacts: {approved}") 11print("Smaller artifact is rejected when task quality misses the gate.")
Output
1approved artifacts: ['AWQ-4bit'] 2Smaller artifact is rejected when task quality misses the gate.

Decision guide

Choose based on the bottleneck

Decision table mapping deployment bottlenecks to candidate quantization paths: GPU residency to AWQ or GPTQ, local offload to GGUF, native FP8 hardware to FP8, and long-context memory pressure to KV-cache tuning. Decision table mapping deployment bottlenecks to candidate quantization paths: GPU residency to AWQ or GPTQ, local offload to GGUF, native FP8 hardware to FP8, and long-context memory pressure to KV-cache tuning.
Choose candidates from the deployment bottleneck, then benchmark them. A model that fits in VRAM needs different testing from a local model that must offload layers.
ScenarioRecommended Starting PointWhy
Production GPU server, model fits in VRAMBenchmark AWQ or GPTQCandidate artifacts for specialized GPU kernels
24 GB GPU, model is too largeGGUF with partial offloadLets system RAM absorb the overflow
CPU or Apple Silicon laptopGGUFCommon local-runtime artifact path
Datacenter accelerator with native FP8 pathFP8 + KV-cache tuningBetter quality/memory tradeoff than jumping straight to INT4

Practical rule of thumb: if the whole model fits on GPU, benchmark an AWQ or GPTQ path supported by your runtime. If it does not, benchmark a GGUF partial-offload path.


Mastery check

By the end of this lesson, you should be able to:

  • Distinguish symmetric quantization from asymmetric quantization with scale and zero point.
  • Explain why GPTQ uses approximate second-order information to minimize layer reconstruction error.
  • Describe AWQ's activation-aware channel protection and why activation statistics matter.
  • Explain GGUF as a portable container format, not a quantization algorithm.
  • Predict how calibration data and group size affect quantization quality.
  • Evaluate a quantized model with perplexity, task benchmarks, and serving metrics.
  • Separate weight-only quantization, activation quantization, and KV-cache quantization.

Evaluation rubric

  • Needs work: You can repeat bit widths, but you can't explain why low-bit storage helps bandwidth before it helps model quality or speed.
  • Developing: You can describe one format such as GPTQ or GGUF, but you still mix up quantization algorithms, container formats, and runtime offload choices.
  • Solid: You can size a model at FP16, INT8, and INT4, explain why runtime memory exceeds raw weight math, and choose between GPTQ, AWQ, and GGUF for a concrete hardware limit.
  • Strong: You can explain why GPTQ protects layer outputs, why AWQ protects activation-sensitive channels, and why calibration data and kernel support decide whether a quantized checkpoint is actually deployable.
  • Excellent: You can design an approval gate that combines perplexity, task evals, serving metrics, fallback artifacts, and KV-cache checks before routing production traffic to a low-bit model.

Follow-up questions

Your 34B model is 6 GB over the VRAM budget after ordinary FP16 sizing. Which artifact do you try first?

If 4-bit compression would let the full checkpoint stay resident on GPU, benchmark a supported AWQ or GPTQ kernel path. If the model still will not fit after low-bit packing, evaluate GGUF because partial offload is now the main constraint.

Your French customer-support model quantizes badly after calibrating on English Wikipedia. What likely went wrong?

The quantizer protected the wrong directions. GPTQ uses calibration activations to estimate which output errors matter, and AWQ uses them to find activation-sensitive channels. If deployment traffic is French retail support, English Wikipedia gives the wrong activation distribution.

A single outlier channel ruins your per-tensor 4-bit result. Why does per-group or per-channel scaling help?

One global scale forces every weight to share the outlier's range. Per-channel or per-group scaling lets smaller neighborhoods keep their own range, which preserves more local precision while adding moderate metadata overhead.

After moving to W4A16, your long-context agent still runs out of memory at high concurrency. What lever do you pull next?

Look at the KV cache. Weight quantization shrinks static parameters, but the KV cache grows with context length and concurrent sessions. Once weights fit, cache compression or shorter context budgets often become the next bottleneck.

You have hardware with strong native FP8 support and an 8B model that already fits cleanly in VRAM. Why might FP8 beat a jump straight to INT4?

FP8 is a milder compression step than INT4 while still reducing memory pressure. If the accelerator has native FP8 kernels, benchmark it as a candidate before accepting the larger compression risk of INT4.


Summary

  • Quantization basics: quantization stores weights with fewer bits using scales and, sometimes, zero points. The main win is lower memory bandwidth and smaller model artifacts.
  • GPTQ: uses representative activations and approximate second-order information to minimize output error during post-training quantization.[3]
  • AWQ: identifies activation-sensitive channels and rescales them so the quantizer spends precision where it matters most.[6]
  • GGUF: is a portable container for local inference that can store many ggml quantization types and supports CPU/GPU-offload workflows.[2]
  • Deployment choice: benchmark AWQ or GPTQ for fully GPU-resident serving, and benchmark GGUF when portability or partial offload is the main constraint.

Course handoff

You now understand how to shrink model weights for production. You can calculate memory footprints, choose between GPTQ, AWQ, and GGUF based on hardware constraints, and evaluate whether a quantized model is good enough for your task. These decisions are the foundation of cost-effective LLM serving.

The next challenge is local deployment. Even a quantized model needs the right runtime, hardware plan, container, health check, and evaluation gate. In Local LLM Deployment, you'll turn compression choices into a repeatable serving plan.

Next Step
Continue to Local LLM Deployment

Quantization reduces the bytes moved for each model call; local deployment shows how to package that compressed model into a measured service with hardware budgets and rollback.

PreviousModel Parallelism for LLM Inference
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.

Xiao, G., et al. · 2023 · ICML 2023

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.

Frantar, E., et al. · 2023 · ICLR 2023

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Raffel, C., et al. · 2020 · JMLR

GPTQ

Hugging Face · 2026

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Lin, J., et al. · 2023 · MLSys 2024

AWQ

Hugging Face · 2026

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint