Running GPT-4 or Claude at scale gets expensive fast. Knowledge distillation lets you train a smaller model (GPT-3.5, Llama 3 8B, Mistral 7B) to mimic a larger “teacher” model’s behavior. You get 80-95% of the performance at 1/10th to 1/100th the cost.

Here’s the practical approach: generate training data from your expensive teacher model, then fine-tune a cheaper student model on that data. For open-source models, you can also use logit-based distillation to match the teacher’s probability distributions directly.

API-Based Distillation: Generate Training Data from Teacher Models

This is the easiest approach for commercial APIs like GPT-4 or Claude. You create a dataset by sending prompts to the teacher model and collecting its responses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import openai
import json
from tqdm import tqdm

# Define your task-specific prompts
prompts = [
    "Explain quantum computing in simple terms",
    "Write a Python function to detect palindromes",
    "Summarize the key benefits of containerization",
    # Add 1000+ diverse prompts covering your use case
]

# Generate training data from GPT-4 (teacher model)
openai.api_key = "your-api-key"
training_data = []

for prompt in tqdm(prompts):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # Use same temp as production
        max_tokens=500
    )

    training_data.append({
        "prompt": prompt,
        "completion": response.choices[0].message.content
    })

# Save in OpenAI fine-tuning format (JSONL)
with open("distillation_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps({
            "messages": [
                {"role": "user", "content": item["prompt"]},
                {"role": "assistant", "content": item["completion"]}
            ]
        }) + "\n")

Critical details:

  • Use 1,000-10,000 examples minimum — more data = better student performance
  • Match your production use case — if you summarize docs, distill summaries
  • Use the same temperature/parameters you’ll use in production
  • Split data 80/20 train/validation to track overfitting

Now fine-tune a smaller model (GPT-3.5-turbo, Llama 3 8B) on this dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Fine-tune GPT-3.5-turbo via OpenAI API
openai api fine_tuning.jobs.create \
  -t distillation_data.jsonl \
  -m gpt-3.5-turbo \
  --suffix "distilled-from-gpt4"

# Or use Hugging Face for open-source models
python -m transformers.training \
  --model_name_or_path meta-llama/Llama-3-8B-Instruct \
  --train_file distillation_data.jsonl \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir ./llama3-distilled

The student model learns to approximate the teacher’s outputs. You’ll hit 85-90% of GPT-4’s quality on your specific task at a fraction of the cost.

Logit-Based Distillation with Open-Source Models

If you have access to the teacher model’s logits (probability distributions), you can train the student to match those directly. This works with open-source models like Llama, Mistral, or Qwen.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn.functional import kl_div, log_softmax, softmax

# Load teacher (large) and student (small) models
teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B-Instruct")
student_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")

teacher_model.eval()
student_model.train()

# Temperature for softening probability distributions
temperature = 2.0  # Higher = softer, easier to learn from

# Training loop
optimizer = torch.optim.AdamW(student_model.parameters(), lr=5e-5)

for batch in train_dataloader:
    input_ids = tokenizer(batch["text"], return_tensors="pt", padding=True)

    # Get teacher's logits (no gradient needed)
    with torch.no_grad():
        teacher_logits = teacher_model(**input_ids).logits / temperature

    # Get student's logits
    student_logits = student_model(**input_ids).logits / temperature

    # Distillation loss: KL divergence between distributions
    loss_distill = kl_div(
        log_softmax(student_logits, dim=-1),
        softmax(teacher_logits, dim=-1),
        reduction="batchmean"
    )

    # Optional: add task loss (cross-entropy on actual labels)
    loss_task = torch.nn.functional.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        input_ids["input_ids"].view(-1)
    )

    # Combined loss (weight distillation higher)
    loss = 0.7 * loss_distill + 0.3 * loss_task

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Why this works better than API-based:

  • Matches the full probability distribution, not just the top-1 answer
  • Student learns the teacher’s uncertainty and confidence levels
  • Faster training — no API calls, just local GPU inference

Tradeoffs:

  • Requires hosting the teacher model (expensive if it’s 70B+ params)
  • Only works with open-source models where you have logit access
  • More complex training code than simple fine-tuning

Choosing Between API-Based and Logit-Based Distillation

Use API-based when:

  • Teacher is a commercial API (GPT-4, Claude, Gemini)
  • You don’t have GPUs to run large teacher models
  • You need <10k examples and can afford the API cost

Use logit-based when:

  • Both teacher and student are open-source models
  • You have GPUs (A100s or H100s) to run the teacher during training
  • You want maximum performance transfer

I recommend starting with API-based for commercial models — it’s simpler and cheaper upfront. If you’re distilling Llama 3 70B into Llama 3 8B, logit-based gives better results but needs more infrastructure.

Evaluating Your Distilled Model

Don’t trust vibes — measure quality on a held-out test set:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.metrics import accuracy_score
import numpy as np

# Run teacher and student on same test prompts
test_prompts = load_test_prompts()  # 200+ examples
teacher_outputs = [call_gpt4(p) for p in test_prompts]
student_outputs = [call_distilled_model(p) for p in test_prompts]

# Semantic similarity (use embeddings)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')

teacher_embeds = embedder.encode(teacher_outputs)
student_embeds = embedder.encode(student_outputs)

# Cosine similarity per example
similarities = [
    np.dot(t, s) / (np.linalg.norm(t) * np.linalg.norm(s))
    for t, s in zip(teacher_embeds, student_embeds)
]

print(f"Average similarity: {np.mean(similarities):.3f}")
print(f"% outputs >0.9 similarity: {sum(s > 0.9 for s in similarities) / len(similarities):.1%}")

Aim for >0.85 average similarity on your task. If you’re below that, add more training data or increase epochs.

Common Errors and Fixes

“Student model outputs are too generic”

  • Increase temperature when generating teacher data (try 0.8-1.0)
  • Add more diverse prompts — cover edge cases and unusual inputs
  • Use top-p sampling instead of greedy decoding

“Fine-tuning fails with OOM errors”

  • Reduce batch size to 1-2 and use gradient accumulation
  • Use LoRA or QLoRA instead of full fine-tuning (saves 4-8x memory)
  • Enable gradient checkpointing: model.gradient_checkpointing_enable()

“Student model hallucinates more than teacher”

  • Add a task loss term (cross-entropy on ground truth) alongside distillation loss
  • Lower the distillation temperature (try 1.5 instead of 2.0)
  • Filter teacher outputs for quality before training — remove obvious errors

“Distillation doesn’t improve over baseline fine-tuning”

  • You need 5-10x more distilled examples than traditional supervised data
  • Check if teacher is actually better on your task (run evals first)
  • Try ensembling multiple teacher models (GPT-4 + Claude) for better data

Cost Analysis: Is Distillation Worth It?

Quick math for a chatbot handling 1M queries/month:

  • GPT-4: $30 per 1M input tokens = ~$30k/month at 1k tokens/query
  • Distillation cost: $500 for 10k GPT-4 calls + $200 fine-tuning = $700 one-time
  • GPT-3.5 Turbo (distilled): $2 per 1M tokens = ~$2k/month
  • Savings: $28k/month after initial $700 investment

Payback in <1 day. Even if the distilled model only handles 70% of queries and falls back to GPT-4 for the rest, you still save $19k/month.

For open-source distillation (e.g., Llama 3 70B → 8B), the cost is mostly GPU time during training. If you’re already serving Llama 3 8B, the distilled version is free to deploy.