How to Distill Large LLMs into Smaller, Cheaper Models

Running GPT-4 or Claude at scale gets expensive fast. Knowledge distillation lets you train a smaller model (GPT-3.5, Llama 3 8B, Mistral 7B) to mimic a larger “teacher” model’s behavior. You get 80-95% of the performance at 1/10th to 1/100th the cost.

Here’s the practical approach: generate training data from your expensive teacher model, then fine-tune a cheaper student model on that data. For open-source models, you can also use logit-based distillation to match the teacher’s probability distributions directly.

API-Based Distillation: Generate Training Data from Teacher Models

This is the easiest approach for commercial APIs like GPT-4 or Claude. You create a dataset by sending prompts to the teacher model and collecting its responses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import openai
import json
from tqdm import tqdm

# Define your task-specific prompts
prompts = [
    "Explain quantum computing in simple terms",
    "Write a Python function to detect palindromes",
    "Summarize the key benefits of containerization",
    # Add 1000+ diverse prompts covering your use case
]

# Generate training data from GPT-4 (teacher model)
openai.api_key = "your-api-key"
training_data = []

for prompt in tqdm(prompts):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # Use same temp as production
        max_tokens=500
    )

    training_data.append({
        "prompt": prompt,
        "completion": response.choices[0].message.content
    })

# Save in OpenAI fine-tuning format (JSONL)
with open("distillation_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps({
            "messages": [
                {"role": "user", "content": item["prompt"]},
                {"role": "assistant", "content": item["completion"]}
            ]
        }) + "\n")

Critical details:

Use 1,000-10,000 examples minimum — more data = better student performance
Match your production use case — if you summarize docs, distill summaries
Use the same temperature/parameters you’ll use in production
Split data 80/20 train/validation to track overfitting

Now fine-tune a smaller model (GPT-3.5-turbo, Llama 3 8B) on this dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Fine-tune GPT-3.5-turbo via OpenAI API
openai api fine_tuning.jobs.create \
  -t distillation_data.jsonl \
  -m gpt-3.5-turbo \
  --suffix "distilled-from-gpt4"

# Or use Hugging Face for open-source models
python -m transformers.training \
  --model_name_or_path meta-llama/Llama-3-8B-Instruct \
  --train_file distillation_data.jsonl \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir ./llama3-distilled

The student model learns to approximate the teacher’s outputs. You’ll hit 85-90% of GPT-4’s quality on your specific task at a fraction of the cost.

Logit-Based Distillation with Open-Source Models

If you have access to the teacher model’s logits (probability distributions), you can train the student to match those directly. This works with open-source models like Llama, Mistral, or Qwen.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn.functional import kl_div, log_softmax, softmax

# Load teacher (large) and student (small) models
teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B-Instruct")
student_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")

teacher_model.eval()
student_model.train()

# Temperature for softening probability distributions
temperature = 2.0  # Higher = softer, easier to learn from

# Training loop
optimizer = torch.optim.AdamW(student_model.parameters(), lr=5e-5)

for batch in train_dataloader:
    input_ids = tokenizer(batch["text"], return_tensors="pt", padding=True)

    # Get teacher's logits (no gradient needed)
    with torch.no_grad():
        teacher_logits = teacher_model(**input_ids).logits / temperature

    # Get student's logits
    student_logits = student_model(**input_ids).logits / temperature

    # Distillation loss: KL divergence between distributions
    loss_distill = kl_div(
        log_softmax(student_logits, dim=-1),
        softmax(teacher_logits, dim=-1),
        reduction="batchmean"
    )

    # Optional: add task loss (cross-entropy on actual labels)
    loss_task = torch.nn.functional.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        input_ids["input_ids"].view(-1)
    )

    # Combined loss (weight distillation higher)
    loss = 0.7 * loss_distill + 0.3 * loss_task

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Why this works better than API-based:

Matches the full probability distribution, not just the top-1 answer
Student learns the teacher’s uncertainty and confidence levels
Faster training — no API calls, just local GPU inference

Tradeoffs:

Requires hosting the teacher model (expensive if it’s 70B+ params)
Only works with open-source models where you have logit access
More complex training code than simple fine-tuning

Choosing Between API-Based and Logit-Based Distillation

Use API-based when:

Teacher is a commercial API (GPT-4, Claude, Gemini)
You don’t have GPUs to run large teacher models
You need <10k examples and can afford the API cost

Use logit-based when:

Both teacher and student are open-source models
You have GPUs (A100s or H100s) to run the teacher during training
You want maximum performance transfer

I recommend starting with API-based for commercial models — it’s simpler and cheaper upfront. If you’re distilling Llama 3 70B into Llama 3 8B, logit-based gives better results but needs more infrastructure.

Evaluating Your Distilled Model

Don’t trust vibes — measure quality on a held-out test set:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.metrics import accuracy_score
import numpy as np

# Run teacher and student on same test prompts
test_prompts = load_test_prompts()  # 200+ examples
teacher_outputs = [call_gpt4(p) for p in test_prompts]
student_outputs = [call_distilled_model(p) for p in test_prompts]

# Semantic similarity (use embeddings)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')

teacher_embeds = embedder.encode(teacher_outputs)
student_embeds = embedder.encode(student_outputs)

# Cosine similarity per example
similarities = [
    np.dot(t, s) / (np.linalg.norm(t) * np.linalg.norm(s))
    for t, s in zip(teacher_embeds, student_embeds)
]

print(f"Average similarity: {np.mean(similarities):.3f}")
print(f"% outputs >0.9 similarity: {sum(s > 0.9 for s in similarities) / len(similarities):.1%}")

Aim for >0.85 average similarity on your task. If you’re below that, add more training data or increase epochs.

Common Errors and Fixes

“Student model outputs are too generic”

Increase temperature when generating teacher data (try 0.8-1.0)
Add more diverse prompts — cover edge cases and unusual inputs
Use top-p sampling instead of greedy decoding

“Fine-tuning fails with OOM errors”

Reduce batch size to 1-2 and use gradient accumulation
Use LoRA or QLoRA instead of full fine-tuning (saves 4-8x memory)
Enable gradient checkpointing: model.gradient_checkpointing_enable()

“Student model hallucinates more than teacher”

Add a task loss term (cross-entropy on ground truth) alongside distillation loss
Lower the distillation temperature (try 1.5 instead of 2.0)
Filter teacher outputs for quality before training — remove obvious errors

“Distillation doesn’t improve over baseline fine-tuning”

You need 5-10x more distilled examples than traditional supervised data
Check if teacher is actually better on your task (run evals first)
Try ensembling multiple teacher models (GPT-4 + Claude) for better data

Cost Analysis: Is Distillation Worth It?

Quick math for a chatbot handling 1M queries/month:

GPT-4: $30 per 1M input tokens = ~$30k/month at 1k tokens/query
Distillation cost: $500 for 10k GPT-4 calls + $200 fine-tuning = $700 one-time
GPT-3.5 Turbo (distilled): $2 per 1M tokens = ~$2k/month
Savings: $28k/month after initial $700 investment

Payback in <1 day. Even if the distilled model only handles 70% of queries and falls back to GPT-4 for the rest, you still save $19k/month.

For open-source distillation (e.g., Llama 3 70B → 8B), the cost is mostly GPU time during training. If you’re already serving Llama 3 8B, the distilled version is free to deploy.

API-Based Distillation: Generate Training Data from Teacher Models#

Logit-Based Distillation with Open-Source Models#

Choosing Between API-Based and Logit-Based Distillation#

Evaluating Your Distilled Model#

Common Errors and Fixes#

Cost Analysis: Is Distillation Worth It?#

Related Guides#

About the Author