Running GPT-4 or Claude at scale gets expensive fast. Knowledge distillation lets you train a smaller model (GPT-3.5, Llama 3 8B, Mistral 7B) to mimic a larger “teacher” model’s behavior. You get 80-95% of the performance at 1/10th to 1/100th the cost.
Here’s the practical approach: generate training data from your expensive teacher model, then fine-tune a cheaper student model on that data. For open-source models, you can also use logit-based distillation to match the teacher’s probability distributions directly.
API-Based Distillation: Generate Training Data from Teacher Models
This is the easiest approach for commercial APIs like GPT-4 or Claude. You create a dataset by sending prompts to the teacher model and collecting its responses.
| |
Critical details:
- Use 1,000-10,000 examples minimum — more data = better student performance
- Match your production use case — if you summarize docs, distill summaries
- Use the same temperature/parameters you’ll use in production
- Split data 80/20 train/validation to track overfitting
Now fine-tune a smaller model (GPT-3.5-turbo, Llama 3 8B) on this dataset:
| |
The student model learns to approximate the teacher’s outputs. You’ll hit 85-90% of GPT-4’s quality on your specific task at a fraction of the cost.
Logit-Based Distillation with Open-Source Models
If you have access to the teacher model’s logits (probability distributions), you can train the student to match those directly. This works with open-source models like Llama, Mistral, or Qwen.
| |
Why this works better than API-based:
- Matches the full probability distribution, not just the top-1 answer
- Student learns the teacher’s uncertainty and confidence levels
- Faster training — no API calls, just local GPU inference
Tradeoffs:
- Requires hosting the teacher model (expensive if it’s 70B+ params)
- Only works with open-source models where you have logit access
- More complex training code than simple fine-tuning
Choosing Between API-Based and Logit-Based Distillation
Use API-based when:
- Teacher is a commercial API (GPT-4, Claude, Gemini)
- You don’t have GPUs to run large teacher models
- You need <10k examples and can afford the API cost
Use logit-based when:
- Both teacher and student are open-source models
- You have GPUs (A100s or H100s) to run the teacher during training
- You want maximum performance transfer
I recommend starting with API-based for commercial models — it’s simpler and cheaper upfront. If you’re distilling Llama 3 70B into Llama 3 8B, logit-based gives better results but needs more infrastructure.
Evaluating Your Distilled Model
Don’t trust vibes — measure quality on a held-out test set:
| |
Aim for >0.85 average similarity on your task. If you’re below that, add more training data or increase epochs.
Common Errors and Fixes
“Student model outputs are too generic”
- Increase temperature when generating teacher data (try 0.8-1.0)
- Add more diverse prompts — cover edge cases and unusual inputs
- Use top-p sampling instead of greedy decoding
“Fine-tuning fails with OOM errors”
- Reduce batch size to 1-2 and use gradient accumulation
- Use LoRA or QLoRA instead of full fine-tuning (saves 4-8x memory)
- Enable gradient checkpointing:
model.gradient_checkpointing_enable()
“Student model hallucinates more than teacher”
- Add a task loss term (cross-entropy on ground truth) alongside distillation loss
- Lower the distillation temperature (try 1.5 instead of 2.0)
- Filter teacher outputs for quality before training — remove obvious errors
“Distillation doesn’t improve over baseline fine-tuning”
- You need 5-10x more distilled examples than traditional supervised data
- Check if teacher is actually better on your task (run evals first)
- Try ensembling multiple teacher models (GPT-4 + Claude) for better data
Cost Analysis: Is Distillation Worth It?
Quick math for a chatbot handling 1M queries/month:
- GPT-4: $30 per 1M input tokens = ~$30k/month at 1k tokens/query
- Distillation cost: $500 for 10k GPT-4 calls + $200 fine-tuning = $700 one-time
- GPT-3.5 Turbo (distilled): $2 per 1M tokens = ~$2k/month
- Savings: $28k/month after initial $700 investment
Payback in <1 day. Even if the distilled model only handles 70% of queries and falls back to GPT-4 for the rest, you still save $19k/month.
For open-source distillation (e.g., Llama 3 70B → 8B), the cost is mostly GPU time during training. If you’re already serving Llama 3 8B, the distilled version is free to deploy.
Related Guides
- How to Fine-Tune LLMs with LoRA and Unsloth
- How to Fine-Tune Embedding Models for Domain-Specific Search
- How to Fine-Tune LLMs on Custom Datasets with Axolotl
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch
- How to Fine-Tune LLMs with DPO and RLHF
- How to Use Claude Sonnet 4.6’s 1M Token Context Window for Long-Document Reasoning
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Build a Knowledge Graph from Text with LLMs
- How to Write Effective System Prompts for LLMs
- How to Build Parallel Tool Calling Pipelines with LLMs