Prefix tuning prepends a sequence of trainable continuous vectors – called “virtual tokens” – to the input at every transformer layer. The base model weights stay frozen. Only the prefix parameters update during training, which means you’re training less than 0.1% of the total parameters. The result: 90%+ memory savings compared to full fine-tuning, and you can store dozens of task-specific adapters as tiny checkpoint files alongside a single base model.
Unlike LoRA, which modifies weight matrices with low-rank decompositions, prefix tuning operates in the activation space. It learns task-specific context that steers the model’s attention patterns without touching any existing parameters. This makes it particularly effective for generation tasks where you want to condition the model on a specific style or domain.
Setting Up PEFT for Prefix Tuning
Install the required packages:
| |
Now configure PrefixTuningConfig and wrap a model. We’ll use GPT-2 here because it runs on any GPU, but this works the same way with Llama, Mistral, or any causal LM in the Hugging Face ecosystem.
| |
Key parameters in PrefixTuningConfig:
num_virtual_tokens: The number of prefix tokens prepended at each layer. 20 is a solid starting point. More tokens give the model more capacity to steer behavior, but beyond 50 you hit diminishing returns and slower inference.encoder_hidden_size: Set this to the model’s hidden dimension. For GPT-2 it’s 768, for Llama-2-7B it’s 4096.prefix_projection: WhenTrue, PEFT uses a 2-layer MLP to generate the prefix embeddings instead of directly optimizing them. This stabilizes training significantly – always leave it on.
Training a Prefix-Tuned Model
We’ll fine-tune on a subset of the dair-ai/emotion dataset for text generation conditioned on emotion labels. This is a real dataset with 6 emotion categories.
| |
The saved adapter is tiny – typically under 5MB. The entire base model stays untouched.
A few training notes worth calling out. The learning rate of 3e-4 is higher than what you’d use for LoRA (2e-4). Prefix tuning benefits from slightly more aggressive learning rates because you’re optimizing far fewer parameters. If you see the loss spike or oscillate, drop it to 1e-4.
Running Inference with a Prefix-Tuned Model
Load the saved adapter and generate text:
| |
You can swap adapters at runtime without reloading the base model. This is one of the big practical wins of parameter-efficient methods – serve one base model, keep a library of adapters for different tasks, and load the right one per request.
| |
Comparing Prefix Tuning to LoRA
Both are PEFT methods, but they work differently and have distinct sweet spots.
Prefix tuning prepends virtual tokens to the key-value pairs in attention. It’s best when you want to steer generation style or condition on task context without altering the model’s internal representations. It trains fewer parameters (often 10-100x fewer than LoRA) and produces smaller adapter files. The downside: it adds latency proportional to num_virtual_tokens because the model processes those extra tokens at every layer during inference.
LoRA injects trainable low-rank matrices into the attention and MLP weight matrices. It modifies how the model computes representations at each layer. LoRA is more expressive for the same parameter count and generally achieves better results on complex tasks like code generation or reasoning. It also adds zero inference latency because you can merge the adapter weights into the base model.
When to use which:
- Pick prefix tuning when you need minimal adapter size, have limited GPU memory during training, or want a quick task-specific conditioner for generation.
- Pick LoRA when you need the best possible quality, are fine-tuning for complex tasks, or want zero-overhead inference after merging.
- Both beat full fine-tuning on cost and flexibility. You can always start with prefix tuning as a fast experiment and switch to LoRA if you need more capacity.
Common Errors and Fixes
ValueError: encoder_hidden_size must match the model's hidden size
You set encoder_hidden_size to the wrong value. Check the model config:
| |
Use whatever this prints as your encoder_hidden_size.
RuntimeError: CUDA out of memory
Even though prefix tuning is memory-efficient, you can still OOM on large models. Reduce batch size first, then reduce num_virtual_tokens:
| |
RuntimeError: expected scalar type Half but found Float
Mixed precision mismatch. Either cast the model explicitly or use torch.autocast:
| |
KeyError: 'past_key_values' during generation
Some older versions of PEFT have bugs with prefix tuning and generate(). Update PEFT:
| |
Training loss doesn’t decrease
Check that prefix_projection=True is set. Without the projection MLP, the prefix embeddings are harder to optimize and training can stall, especially on larger models. Also verify your learning rate isn’t too low – prefix tuning can tolerate 3e-4 to 1e-3 depending on the model size.
Related Guides
- How to Fine-Tune LLMs with LoRA and Unsloth
- How to Build a Knowledge Graph from Text with LLMs
- How to Fine-Tune LLMs with DPO and RLHF
- How to Fine-Tune LLMs on Custom Datasets with Axolotl
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Distill Large LLMs into Smaller, Cheaper Models
- How to Build Structured Output Parsers with Pydantic and LLMs
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Build RAG Applications with LangChain and ChromaDB
- How to Build Parallel Tool Calling Pipelines with LLMs