SetFit lets you train a production-grade text classifier with as few as 8 labeled examples per class. It works by fine-tuning a sentence transformer with contrastive learning, then fitting a classification head on top. No prompts, no massive LLM calls, no GPU cluster required.

Here’s the fastest path to a working classifier:

1
pip install setfit datasets scikit-learn
1
2
3
4
5
6
7
from setfit import SetFitModel, Trainer, TrainingArguments

model = SetFitModel.from_pretrained("BAAI/bge-small-en-v1.5")
model.labels = ["negative", "positive"]

preds = model.predict(["This product is amazing!", "Terrible experience."])
print(preds)  # Before training, predictions will be random

That loads a sentence transformer backbone and wraps it in SetFit’s classification framework. The model won’t be accurate yet – you need to train it on your labeled data first.

Create a Few-Shot Training Dataset

SetFit shines when you have very little labeled data. You can build a training set with just 8 examples per class and get surprisingly strong results. Here’s a customer support ticket classifier with four categories:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from datasets import Dataset

train_data = {
    "text": [
        # billing (8 examples)
        "I was charged twice for my subscription",
        "Can you refund my last payment?",
        "My invoice shows the wrong amount",
        "How do I update my credit card on file?",
        "I want to cancel and get a prorated refund",
        "Why was I billed after canceling?",
        "The annual price went up without notice",
        "I need a receipt for my last three payments",
        # technical (8 examples)
        "The app crashes every time I open settings",
        "I can't log in after the latest update",
        "The export feature produces a blank CSV file",
        "API returns 500 errors on the /users endpoint",
        "Search results take over 30 seconds to load",
        "Two-factor authentication codes aren't arriving",
        "The dashboard won't render on Firefox 120",
        "Webhook deliveries are failing with timeouts",
        # account (8 examples)
        "How do I change my email address?",
        "I forgot my password and recovery isn't working",
        "Can I merge two accounts into one?",
        "I need to transfer ownership to another user",
        "How do I enable SSO for my organization?",
        "My account was locked after too many login attempts",
        "I want to delete my account and all my data",
        "Can I change my username without losing history?",
        # feature_request (8 examples)
        "It would be great to have dark mode",
        "Please add support for bulk CSV imports",
        "Can you add Slack integration for notifications?",
        "We need role-based access controls for teams",
        "An API endpoint for batch operations would help",
        "Would love to see a mobile app for iOS",
        "Can you add export to PDF for reports?",
        "Please support SAML authentication",
    ],
    "label": [0]*8 + [1]*8 + [2]*8 + [3]*8,
}

train_dataset = Dataset.from_dict(train_data)

label_names = ["billing", "technical", "account", "feature_request"]

Thirty-two examples total, eight per class. That’s the entire training set. With a traditional fine-tuned BERT model, you’d need hundreds or thousands of examples for decent performance.

Train the SetFit Model

Training happens in two phases. First, the sentence transformer body learns to push same-class examples closer together and different-class examples apart (contrastive learning). Then a logistic regression head fits on the resulting embeddings.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from setfit import SetFitModel, Trainer, TrainingArguments

model = SetFitModel.from_pretrained(
    "BAAI/bge-small-en-v1.5",
    labels=label_names,
)

args = TrainingArguments(
    batch_size=16,
    num_epochs=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    column_mapping={"text": "text", "label": "label"},
)

trainer.train()

The column_mapping parameter tells the trainer which columns in your dataset correspond to the text input and label. If your dataset already uses text and label as column names, you can omit it – but being explicit prevents confusing errors down the road.

Training on 32 examples takes under a minute on a CPU. On a GPU, it’s seconds.

Run Inference and Evaluate

Once trained, prediction is a single method call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
test_texts = [
    "I got billed for a plan I never signed up for",
    "The sync feature keeps failing with error code 42",
    "How do I add another admin to my workspace?",
    "It would be nice to have calendar integration",
    "My payment method expired and I need to update it",
]

predictions = model.predict(test_texts)

for text, pred in zip(test_texts, predictions):
    print(f"{label_names[pred]:<20}{text}")

Expected output:

1
2
3
4
5
billing              → I got billed for a plan I never signed up for
technical            → The sync feature keeps failing with error code 42
account              → How do I add another admin to my workspace?
feature_request      → It would be nice to have calendar integration
billing              → My payment method expired and I need to update it

To evaluate on a held-out test set, pass it to the trainer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
test_data = {
    "text": [
        "Refund my subscription charge",
        "App freezes on the reports page",
        "How do I reset my password?",
        "Add webhook support for events",
    ],
    "label": [0, 1, 2, 3],
}
test_dataset = Dataset.from_dict(test_data)

metrics = trainer.evaluate(test_dataset)
print(metrics)  # {'accuracy': 1.0} on this small test set

For a real evaluation, build a test set of 50-100 examples that the model never saw during training. Track accuracy, precision, recall, and F1.

Push to Hugging Face Hub

Save your trained model locally or push it to the Hub so your team can load it in one line:

1
2
3
4
5
6
7
8
9
# Save locally
model.save_pretrained("setfit-support-ticket-classifier")

# Push to Hub (requires `huggingface-cli login` first)
model.push_to_hub("your-username/setfit-support-ticket-classifier")

# Load from Hub anywhere
loaded_model = SetFitModel.from_pretrained("your-username/setfit-support-ticket-classifier")
preds = loaded_model.predict(["I need a refund"])

The saved model includes both the fine-tuned sentence transformer body and the classification head. Total size is typically 50-130 MB depending on the backbone model.

Why SetFit Over Zero-Shot LLMs

Zero-shot classification with GPT-4 or Claude is convenient for prototyping, but SetFit wins for production workloads:

  • Speed: SetFit inference runs in single-digit milliseconds on CPU. An LLM API call takes 500ms-2s.
  • Cost: A fine-tuned SetFit model runs on a $5/month server. LLM API calls at scale cost orders of magnitude more.
  • Consistency: The same input always produces the same output. No temperature variance, no prompt sensitivity.
  • Privacy: Your data never leaves your infrastructure. No third-party API involved.
  • Accuracy: With even 8 examples per class, SetFit typically matches or beats zero-shot LLM performance on domain-specific tasks.

The tradeoff is upfront labeling effort. If you have zero labeled examples and need a quick prototype, start with zero-shot. Once you have 8+ examples per class, switch to SetFit.

Common Errors and Fixes

ValueError: A column mapping must be provided when the dataset does not contain the following columns: {'text', 'label'}

Your dataset columns don’t match what the trainer expects. Fix it with column_mapping:

1
2
3
4
5
6
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    column_mapping={"sentence": "text", "category": "label"},
)

Map your actual column names (left side) to what SetFit expects (right side).

RuntimeError: CUDA out of memory

Lower the batch size in your training arguments:

1
2
3
4
args = TrainingArguments(
    batch_size=8,  # reduce from 16 or 32
    num_epochs=10,
)

You can also switch to a smaller backbone. BAAI/bge-small-en-v1.5 (33M parameters) uses much less memory than sentence-transformers/all-mpnet-base-v2 (110M parameters).

ValueError: not enough values to unpack during training

This usually means your dataset has mismatched lengths between the text and label columns. Double-check that both arrays have the same number of elements:

1
assert len(train_data["text"]) == len(train_data["label"]), "Text and label counts must match"

Model predicts the same class for every input

You likely need more training epochs or more examples. Try bumping num_epochs to 20, or add 4-8 more examples per class. Also check that your training labels are actually balanced – if one class has 16 examples and another has 2, the model will be biased toward the larger class.