You have a pile of text – support tickets, reviews, user feedback, survey responses – and you need to sort it into categories. The traditional approach is to collect labeled data, train a classifier, tune hyperparameters, and maintain a model. That works, but it takes weeks.
With LLMs, you can classify text in minutes. Zero-shot classification sends text to the model with a list of labels and no examples. Few-shot adds a handful of labeled examples to the prompt. Both approaches skip the entire training pipeline, and for many use cases they are accurate enough to ship to production.
Zero-Shot Classification with OpenAI
The fastest path from nothing to working classifier. Install the SDK and make a single API call:
| |
| |
That is the entire classifier. The Literal type constrains the model to only output one of your valid labels – it cannot hallucinate a category that does not exist. The Pydantic model guarantees you get structured JSON back, every time.
gpt-4o-mini is the right default here. It is fast, cheap ($0.15 per million input tokens), and handles straightforward classification well. Save gpt-4o for ambiguous cases or when you need chain-of-thought reasoning.
Few-Shot Classification with Examples
Zero-shot works surprisingly well, but it struggles with nuanced distinctions. If your “billing” and “account” categories overlap, the model guesses. Few-shot fixes this by showing the model exactly what each category looks like.
The best place to put examples is in the system prompt or as the Pydantic model docstring. Instructor (the library) popularized the docstring approach, and it works with the native OpenAI SDK too:
| |
Two things matter here. First, the chain_of_thought field forces the model to reason before classifying, which bumps accuracy by roughly 10% on ambiguous inputs. Second, the examples in the docstring act as few-shot demonstrations – two per category is enough for most tasks, though you can go up to 10 per category before the prompt gets unwieldy.
Picking Good Few-Shot Examples
Do not pick the easiest examples. Pick the ones that sit near category boundaries. If “I need a refund” could be either billing or account, include it as a billing example. The model learns from edge cases, not obvious ones.
Keep examples balanced – the same number per category. If you give 5 billing examples and 1 technical example, the model develops a bias toward billing.
Using Instructor for Multi-Label Classification
When a single text can belong to multiple categories, you need multi-label classification. The instructor library makes this clean:
| |
| |
Instructor wraps the OpenAI client and adds automatic retries when Pydantic validation fails. If the model returns an invalid label, instructor sends the validation error back and asks the model to fix it. You get self-correcting classification without writing retry logic.
Batch Classification for Large Datasets
Classifying one ticket at a time works for real-time flows. For backfills or bulk processing, you need concurrency. Use asyncio with the AsyncOpenAI client:
| |
The Semaphore limits concurrent requests to avoid hitting OpenAI’s rate limits. Start with 20 and increase if you have a higher tier. For very large datasets (50,000+ items), use OpenAI’s Batch API instead, which gives a 50% cost discount and processes asynchronously within 24 hours.
When to Use Zero-Shot vs. Few-Shot vs. Fine-Tuning
This is the decision that actually matters in production:
Zero-shot is the right starting point when you have no labeled data and need results today. It handles broad, well-defined categories (sentiment, language detection, topic classification) with 80-90% accuracy out of the box. If your categories are intuitive – things a human could label without examples – zero-shot is often good enough.
Few-shot is worth the effort when zero-shot accuracy falls below your threshold or when your categories have subtle distinctions. Adding 2-5 examples per class typically closes a 5-15% accuracy gap. The cost is minimal (a few hundred extra tokens per request) and the payoff is real.
Fine-tuning still wins on accuracy. Research consistently shows fine-tuned small models outperform zero-shot GPT-4 class models by 10-25 percentage points on benchmarks like BANKING77 (77-class intent classification). Fine-tune when you have 100+ labeled examples per class, need <50ms latency, or are classifying millions of items daily where per-token costs add up.
| Approach | Accuracy (typical) | Labeled Data Needed | Cost per 1K items | Latency |
|---|---|---|---|---|
| Zero-shot (gpt-4o-mini) | 80-90% | None | ~$0.02 | ~500ms |
| Few-shot (gpt-4o-mini) | 85-95% | 2-10 per class | ~$0.04 | ~600ms |
| Fine-tuned (small model) | 90-98% | 100+ per class | ~$0.005 | ~50ms |
My recommendation: start zero-shot, evaluate on 50-100 hand-labeled examples, and only escalate to few-shot or fine-tuning if your accuracy is not meeting requirements. Most classification tasks do not need fine-tuning.
Cost Optimization Strategies
LLM classification adds up fast at scale. Here are the levers that make the biggest difference:
Use gpt-4o-mini aggressively. For classification, it matches gpt-4o accuracy at 10x lower cost. Only escalate to the larger model for genuinely ambiguous cases.
Structure prompts for caching. OpenAI automatically caches prompt prefixes longer than 1,024 tokens. Put your static system prompt and few-shot examples at the beginning, and the variable input text last. Cached tokens cost 50% less.
Batch similar requests. Instead of classifying one text per API call, group 5-10 items in a single prompt:
| |
This reduces overhead tokens (system prompt, examples) from N calls down to 1. For a 10-item batch, you cut costs by roughly 60%.
Use the Batch API for offline work. Upload a JSONL file of requests and get results within 24 hours at 50% off. Perfect for daily classification jobs or backfills.
Common Errors and Fixes
openai.BadRequestError: Invalid schema
| |
This happens when your Pydantic model has nested objects without additionalProperties: false. The OpenAI SDK handles this automatically for top-level models, but nested models can trigger it. Fix by simplifying your schema or using model_config = ConfigDict(json_schema_extra={"additionalProperties": False}) on nested models.
openai.RateLimitError: Rate limit reached
| |
You are sending too many tokens per minute. Reduce concurrency in your async code, add exponential backoff, or request a rate limit increase from OpenAI:
| |
pydantic_core._pydantic_core.ValidationError: label
| |
The model returned “Billing” instead of “billing”. Structured outputs should prevent this, but it happens occasionally with older model versions. Fix by normalizing in a validator:
| |
Empty or None output from output_parsed
| |
This means the response was either truncated (hit the output token limit) or the model refused. Check the response status:
| |
For classification tasks, truncation almost never happens (responses are tiny). If you hit this, the input likely triggered content filtering.
Related Guides
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Extract Structured Data from PDFs with LLMs
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Summarize Long Documents with LLMs and Map-Reduce
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Text Similarity API with Cross-Encoders
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Text Readability Scoring Pipeline with Python
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers