The Short Version

4-bit quantization shrinks model weights from 16-bit floats to 4-bit integers. A Llama 3 8B model goes from ~16 GB to ~4.5 GB of VRAM. A 70B model drops from ~140 GB to ~35 GB – suddenly one GPU instead of two. The quality hit is surprisingly small: expect 0.1-0.3 perplexity increase on most benchmarks.

GPTQ and AWQ are the two dominant post-training quantization methods. Both produce 4-bit models, but they work differently under the hood and have different strengths.

Here’s the fastest way to quantize a model with each method.

Install AutoGPTQ and AutoAWQ

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# AutoGPTQ -- install from pip with CUDA support
pip install auto-gptq
# If you hit build errors, use the pre-built wheel for your CUDA version
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu121/

# AutoAWQ -- simpler install
pip install autoawq

# Both need transformers and torch
pip install transformers torch accelerate

AutoGPTQ requires a CUDA-capable GPU for quantization. AutoAWQ also needs a GPU, but its installation is less finicky. If you’re on CUDA 12.1+, both packages install cleanly from pip most of the time.

Quantize a Model with GPTQ

GPTQ works by processing the model layer-by-layer, finding optimal 4-bit representations using a calibration dataset. The calibration data matters – use something representative of your actual use case.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_output = "./llama-3.1-8b-gptq-4bit"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Configure quantization: 4-bit, group size 128
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,       # Smaller = better quality, larger model
    desc_act=True,        # Activation order quantization (slower but better)
    damp_percent=0.1,
)

# Load the full-precision model
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
    torch_dtype=torch.float16,
)

# Prepare calibration data (128 samples is a good starting point)
calibration_data = []
texts = [
    "The capital of France is Paris, which is known for",
    "Machine learning models can be trained using",
    "To install Python packages, you typically use pip",
    "The transformer architecture was introduced in the paper",
    # ... add more diverse examples up to 128 samples
]
for text in texts:
    tokenized = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    calibration_data.append(tokenized.input_ids)

# Quantize -- this takes 30-60 minutes for an 8B model on an A100
model.quantize(calibration_data)

# Save the quantized model
model.save_quantized(quant_output)
tokenizer.save_pretrained(quant_output)
print(f"Quantized model saved to {quant_output}")

For an 8B model, expect quantization to take 30-60 minutes on an A100 or 1-2 hours on an RTX 4090. You need enough VRAM to hold the full fp16 model during quantization – about 16 GB for an 8B model.

Quantize a Model with AWQ

AWQ (Activation-aware Weight Quantization) takes a different approach. Instead of optimizing all weights equally, it identifies which weights matter most based on activation magnitudes and protects those during quantization. This usually gives better quality than GPTQ at the same bit width.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
quant_output = "./llama-3.1-8b-awq-4bit"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Configure quantization
quant_config = {
    "zero_point": True,       # Use zero-point quantization
    "q_group_size": 128,      # Weight grouping size
    "w_bit": 4,               # 4-bit quantization
    "version": "GEMM",        # GEMM kernel for inference speed
}

# Quantize with built-in calibration
# AWQ handles calibration data internally using a default dataset
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_output)
tokenizer.save_pretrained(quant_output)
print(f"AWQ model saved to {quant_output}")

AWQ quantization is faster than GPTQ – typically 15-30 minutes for an 8B model. The version parameter controls which kernel is used at inference time: "GEMM" is faster for batch sizes > 1, while "GEMV" is optimized for single-request inference.

Load and Run Quantized Models

Once quantized, loading these models is straightforward.

Load a GPTQ Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

model_path = "./llama-3.1-8b-gptq-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_quantized(
    model_path,
    device="cuda:0",
    use_safetensors=True,
)

# Generate text
input_ids = tokenizer("Explain quantization in one paragraph:", return_tensors="pt").input_ids.to("cuda:0")
output = model.generate(input_ids, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Load an AWQ Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "./llama-3.1-8b-awq-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,    # Fuse attention layers for faster inference
)

input_ids = tokenizer("Explain quantization in one paragraph:", return_tensors="pt").input_ids.to("cuda:0")
output = model.generate(input_ids, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

You can also load pre-quantized models directly from Hugging Face. TheBloke and many model authors publish GPTQ and AWQ variants:

1
2
3
4
5
6
7
8
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pre-quantized AWQ model from Hugging Face
model = AutoModelForCausalLM.from_pretrained(
    "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4")

GPTQ vs AWQ: When to Use Which

AspectGPTQAWQ
Quantization speed30-60 min (8B)15-30 min (8B)
Perplexity (Llama 3 8B, 4-bit)~6.2 (vs 5.9 fp16)~6.1 (vs 5.9 fp16)
Inference throughputGood with Marlin kernelExcellent with fused layers
vLLM supportYesYes (preferred)
CalibrationUser-provided datasetBuilt-in, automatic
Custom calibrationFull controlLimited
Model availabilityThousands on HF HubGrowing fast
VRAM (8B, 4-bit)~4.5 GB~4.5 GB

My recommendation: use AWQ for most cases. It’s faster to quantize, produces slightly better quality, and has first-class support in vLLM and other serving frameworks. Choose GPTQ when you need fine control over calibration data or when a pre-quantized GPTQ model already exists for your target architecture.

Serve Quantized Models with vLLM

Quantized models really shine when served through an inference engine. vLLM handles GPTQ and AWQ models natively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Serve an AWQ model
vllm serve ./llama-3.1-8b-awq-4bit \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9

# Serve a GPTQ model
vllm serve ./llama-3.1-8b-gptq-4bit \
  --quantization gptq \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096

You can also serve pre-quantized Hub models directly:

1
2
3
vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --quantization awq \
  --max-model-len 8192

With vLLM, a 4-bit AWQ model typically gets 1.5-2x the throughput of the same model in fp16, because the smaller weights mean more KV cache space and more concurrent requests.

Benchmark VRAM and Throughput

Here’s what to expect on real hardware.

Llama 3.1 8B on RTX 4090 (24 GB)

FormatVRAMTokens/sec (batch=1)Tokens/sec (batch=16)
fp1616.2 GB52380
GPTQ 4-bit4.8 GB68520
AWQ 4-bit4.6 GB72550

Llama 3.1 70B on A100 80 GB

FormatVRAMTokens/sec (batch=1)Tokens/sec (batch=8)
fp16140 GB (2x A100)1895
GPTQ 4-bit36 GB (1x A100)24145
AWQ 4-bit35 GB (1x A100)26155

The 70B numbers are the real story. Going from two A100s to one – while actually getting better throughput per GPU – is a significant cost saving. At cloud GPU prices, that’s cutting your inference bill in half.

Common Errors and Fixes

CUDA extension not installed when importing auto_gptq. This means the CUDA kernels weren’t compiled during installation. Reinstall with the correct CUDA version:

1
2
pip uninstall auto-gptq -y
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu121/

Check your CUDA version with nvcc --version and match the wheel URL accordingly (cu118, cu121, cu124).

RuntimeError: CUDA out of memory during quantization. GPTQ quantization needs to hold the full fp16 model in memory. For 8B models, you need at least 18-20 GB free. Try closing other GPU processes, or set desc_act=False in the GPTQ config to reduce peak memory:

1
2
3
4
5
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,  # Reduces memory at the cost of slight quality drop
)

ValueError: Tokenizer class ... is not supported with AutoAWQ. Update to the latest version. Newer model architectures (like Llama 3) sometimes need a recent AutoAWQ release:

1
pip install autoawq --upgrade

AssertionError: only support ... bits in auto_gptq. You’re trying to quantize a model architecture that AutoGPTQ doesn’t support yet. Check the supported models list and update to the latest version.

Quantized model produces garbage output. Almost always a tokenizer mismatch. Make sure you saved and loaded the tokenizer from the same source as the original model. Also verify that the quantization completed without errors – partial quantization can produce corrupted weights.

vLLM fails with Quantization method gptq is not supported for model type. The quantization metadata in the model’s config.json might be missing or incorrect. Verify the quantization_config field exists in the config file:

1
python3 -c "import json; print(json.load(open('./llama-3.1-8b-gptq-4bit/config.json')).get('quantization_config', 'MISSING'))"

When Not to Quantize

Quantization isn’t always the right call. Skip it when:

  • You’re fine-tuning the model (quantized weights are frozen; use QLoRA instead for memory-efficient training)
  • Your task is extremely sensitive to small accuracy differences, like medical diagnosis or legal analysis
  • You already have enough VRAM for the full-precision model and don’t need to maximize throughput
  • The model is already small (1-3B parameters) – the memory savings don’t justify the quality hit at that scale

For everything else – inference, serving, prototyping with large models on consumer hardware – 4-bit quantization with AWQ or GPTQ is the way to go.