The Short Version
4-bit quantization shrinks model weights from 16-bit floats to 4-bit integers. A Llama 3 8B model goes from ~16 GB to ~4.5 GB of VRAM. A 70B model drops from ~140 GB to ~35 GB – suddenly one GPU instead of two. The quality hit is surprisingly small: expect 0.1-0.3 perplexity increase on most benchmarks.
GPTQ and AWQ are the two dominant post-training quantization methods. Both produce 4-bit models, but they work differently under the hood and have different strengths.
Here’s the fastest way to quantize a model with each method.
Install AutoGPTQ and AutoAWQ
| |
AutoGPTQ requires a CUDA-capable GPU for quantization. AutoAWQ also needs a GPU, but its installation is less finicky. If you’re on CUDA 12.1+, both packages install cleanly from pip most of the time.
Quantize a Model with GPTQ
GPTQ works by processing the model layer-by-layer, finding optimal 4-bit representations using a calibration dataset. The calibration data matters – use something representative of your actual use case.
| |
For an 8B model, expect quantization to take 30-60 minutes on an A100 or 1-2 hours on an RTX 4090. You need enough VRAM to hold the full fp16 model during quantization – about 16 GB for an 8B model.
Quantize a Model with AWQ
AWQ (Activation-aware Weight Quantization) takes a different approach. Instead of optimizing all weights equally, it identifies which weights matter most based on activation magnitudes and protects those during quantization. This usually gives better quality than GPTQ at the same bit width.
| |
AWQ quantization is faster than GPTQ – typically 15-30 minutes for an 8B model. The version parameter controls which kernel is used at inference time: "GEMM" is faster for batch sizes > 1, while "GEMV" is optimized for single-request inference.
Load and Run Quantized Models
Once quantized, loading these models is straightforward.
Load a GPTQ Model
| |
Load an AWQ Model
| |
You can also load pre-quantized models directly from Hugging Face. TheBloke and many model authors publish GPTQ and AWQ variants:
| |
GPTQ vs AWQ: When to Use Which
| Aspect | GPTQ | AWQ |
|---|---|---|
| Quantization speed | 30-60 min (8B) | 15-30 min (8B) |
| Perplexity (Llama 3 8B, 4-bit) | ~6.2 (vs 5.9 fp16) | ~6.1 (vs 5.9 fp16) |
| Inference throughput | Good with Marlin kernel | Excellent with fused layers |
| vLLM support | Yes | Yes (preferred) |
| Calibration | User-provided dataset | Built-in, automatic |
| Custom calibration | Full control | Limited |
| Model availability | Thousands on HF Hub | Growing fast |
| VRAM (8B, 4-bit) | ~4.5 GB | ~4.5 GB |
My recommendation: use AWQ for most cases. It’s faster to quantize, produces slightly better quality, and has first-class support in vLLM and other serving frameworks. Choose GPTQ when you need fine control over calibration data or when a pre-quantized GPTQ model already exists for your target architecture.
Serve Quantized Models with vLLM
Quantized models really shine when served through an inference engine. vLLM handles GPTQ and AWQ models natively.
| |
You can also serve pre-quantized Hub models directly:
| |
With vLLM, a 4-bit AWQ model typically gets 1.5-2x the throughput of the same model in fp16, because the smaller weights mean more KV cache space and more concurrent requests.
Benchmark VRAM and Throughput
Here’s what to expect on real hardware.
Llama 3.1 8B on RTX 4090 (24 GB)
| Format | VRAM | Tokens/sec (batch=1) | Tokens/sec (batch=16) |
|---|---|---|---|
| fp16 | 16.2 GB | 52 | 380 |
| GPTQ 4-bit | 4.8 GB | 68 | 520 |
| AWQ 4-bit | 4.6 GB | 72 | 550 |
Llama 3.1 70B on A100 80 GB
| Format | VRAM | Tokens/sec (batch=1) | Tokens/sec (batch=8) |
|---|---|---|---|
| fp16 | 140 GB (2x A100) | 18 | 95 |
| GPTQ 4-bit | 36 GB (1x A100) | 24 | 145 |
| AWQ 4-bit | 35 GB (1x A100) | 26 | 155 |
The 70B numbers are the real story. Going from two A100s to one – while actually getting better throughput per GPU – is a significant cost saving. At cloud GPU prices, that’s cutting your inference bill in half.
Common Errors and Fixes
CUDA extension not installed when importing auto_gptq.
This means the CUDA kernels weren’t compiled during installation. Reinstall with the correct CUDA version:
| |
Check your CUDA version with nvcc --version and match the wheel URL accordingly (cu118, cu121, cu124).
RuntimeError: CUDA out of memory during quantization.
GPTQ quantization needs to hold the full fp16 model in memory. For 8B models, you need at least 18-20 GB free. Try closing other GPU processes, or set desc_act=False in the GPTQ config to reduce peak memory:
| |
ValueError: Tokenizer class ... is not supported with AutoAWQ.
Update to the latest version. Newer model architectures (like Llama 3) sometimes need a recent AutoAWQ release:
| |
AssertionError: only support ... bits in auto_gptq.
You’re trying to quantize a model architecture that AutoGPTQ doesn’t support yet. Check the supported models list and update to the latest version.
Quantized model produces garbage output. Almost always a tokenizer mismatch. Make sure you saved and loaded the tokenizer from the same source as the original model. Also verify that the quantization completed without errors – partial quantization can produce corrupted weights.
vLLM fails with Quantization method gptq is not supported for model type.
The quantization metadata in the model’s config.json might be missing or incorrect. Verify the quantization_config field exists in the config file:
| |
When Not to Quantize
Quantization isn’t always the right call. Skip it when:
- You’re fine-tuning the model (quantized weights are frozen; use QLoRA instead for memory-efficient training)
- Your task is extremely sensitive to small accuracy differences, like medical diagnosis or legal analysis
- You already have enough VRAM for the full-precision model and don’t need to maximize throughput
- The model is already small (1-3B parameters) – the memory savings don’t justify the quality hit at that scale
For everything else – inference, serving, prototyping with large models on consumer hardware – 4-bit quantization with AWQ or GPTQ is the way to go.
Related Guides
- How to Run LLMs Locally with Ollama and llama.cpp
- How to Deploy DeepSeek R1 on NVIDIA Blackwell with vLLM’s Disaggregated Serving
- How to Speed Up LLM Inference with Speculative Decoding
- How to Optimize Model Inference with ONNX Runtime
- How to Optimize LLM Serving with KV Cache and PagedAttention
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Set Up Multi-GPU Training with PyTorch
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Set Up Distributed Training with DeepSpeed and ZeRO
- How to Build a Model Training Dashboard with TensorBoard and Prometheus