SetFit lets you train a production-grade text classifier with as few as 8 labeled examples per class. It works by fine-tuning a sentence transformer with contrastive learning, then fitting a classification head on top. No prompts, no massive LLM calls, no GPU cluster required.
Here’s the fastest path to a working classifier:
| |
| |
That loads a sentence transformer backbone and wraps it in SetFit’s classification framework. The model won’t be accurate yet – you need to train it on your labeled data first.
Create a Few-Shot Training Dataset
SetFit shines when you have very little labeled data. You can build a training set with just 8 examples per class and get surprisingly strong results. Here’s a customer support ticket classifier with four categories:
| |
Thirty-two examples total, eight per class. That’s the entire training set. With a traditional fine-tuned BERT model, you’d need hundreds or thousands of examples for decent performance.
Train the SetFit Model
Training happens in two phases. First, the sentence transformer body learns to push same-class examples closer together and different-class examples apart (contrastive learning). Then a logistic regression head fits on the resulting embeddings.
| |
The column_mapping parameter tells the trainer which columns in your dataset correspond to the text input and label. If your dataset already uses text and label as column names, you can omit it – but being explicit prevents confusing errors down the road.
Training on 32 examples takes under a minute on a CPU. On a GPU, it’s seconds.
Run Inference and Evaluate
Once trained, prediction is a single method call:
| |
Expected output:
| |
To evaluate on a held-out test set, pass it to the trainer:
| |
For a real evaluation, build a test set of 50-100 examples that the model never saw during training. Track accuracy, precision, recall, and F1.
Push to Hugging Face Hub
Save your trained model locally or push it to the Hub so your team can load it in one line:
| |
The saved model includes both the fine-tuned sentence transformer body and the classification head. Total size is typically 50-130 MB depending on the backbone model.
Why SetFit Over Zero-Shot LLMs
Zero-shot classification with GPT-4 or Claude is convenient for prototyping, but SetFit wins for production workloads:
- Speed: SetFit inference runs in single-digit milliseconds on CPU. An LLM API call takes 500ms-2s.
- Cost: A fine-tuned SetFit model runs on a $5/month server. LLM API calls at scale cost orders of magnitude more.
- Consistency: The same input always produces the same output. No temperature variance, no prompt sensitivity.
- Privacy: Your data never leaves your infrastructure. No third-party API involved.
- Accuracy: With even 8 examples per class, SetFit typically matches or beats zero-shot LLM performance on domain-specific tasks.
The tradeoff is upfront labeling effort. If you have zero labeled examples and need a quick prototype, start with zero-shot. Once you have 8+ examples per class, switch to SetFit.
Common Errors and Fixes
ValueError: A column mapping must be provided when the dataset does not contain the following columns: {'text', 'label'}
Your dataset columns don’t match what the trainer expects. Fix it with column_mapping:
| |
Map your actual column names (left side) to what SetFit expects (right side).
RuntimeError: CUDA out of memory
Lower the batch size in your training arguments:
| |
You can also switch to a smaller backbone. BAAI/bge-small-en-v1.5 (33M parameters) uses much less memory than sentence-transformers/all-mpnet-base-v2 (110M parameters).
ValueError: not enough values to unpack during training
This usually means your dataset has mismatched lengths between the text and label columns. Double-check that both arrays have the same number of elements:
| |
Model predicts the same class for every input
You likely need more training epochs or more examples. Try bumping num_epochs to 20, or add 4-8 more examples per class. Also check that your training labels are actually balanced – if one class has 16 examples and another has 2, the model will be biased toward the larger class.
Related Guides
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Classify Text with Zero-Shot and Few-Shot LLMs
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Build a Language Detection and Translation Pipeline
- How to Build a Spell Checking and Autocorrect Pipeline with Python
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Text Similarity API with Cross-Encoders
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Document Chunking Strategy Comparison Pipeline