You have a pile of text – support tickets, reviews, user feedback, survey responses – and you need to sort it into categories. The traditional approach is to collect labeled data, train a classifier, tune hyperparameters, and maintain a model. That works, but it takes weeks.

With LLMs, you can classify text in minutes. Zero-shot classification sends text to the model with a list of labels and no examples. Few-shot adds a handful of labeled examples to the prompt. Both approaches skip the entire training pipeline, and for many use cases they are accurate enough to ship to production.

Zero-Shot Classification with OpenAI

The fastest path from nothing to working classifier. Install the SDK and make a single API call:

1
pip install openai pydantic
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

client = OpenAI()

class Classification(BaseModel):
    label: Literal["billing", "technical", "account", "general"]
    confidence: float

response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {
            "role": "system",
            "content": (
                "Classify the support ticket into exactly one category. "
                "Return a confidence score between 0 and 1."
            ),
        },
        {
            "role": "user",
            "content": "I was charged twice for my subscription last month",
        },
    ],
    text_format=Classification,
)

result = response.output_parsed
print(f"{result.label} (confidence: {result.confidence})")
# billing (confidence: 0.95)

That is the entire classifier. The Literal type constrains the model to only output one of your valid labels – it cannot hallucinate a category that does not exist. The Pydantic model guarantees you get structured JSON back, every time.

gpt-4o-mini is the right default here. It is fast, cheap ($0.15 per million input tokens), and handles straightforward classification well. Save gpt-4o for ambiguous cases or when you need chain-of-thought reasoning.

Few-Shot Classification with Examples

Zero-shot works surprisingly well, but it struggles with nuanced distinctions. If your “billing” and “account” categories overlap, the model guesses. Few-shot fixes this by showing the model exactly what each category looks like.

The best place to put examples is in the system prompt or as the Pydantic model docstring. Instructor (the library) popularized the docstring approach, and it works with the native OpenAI SDK too:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = OpenAI()

class TicketClassification(BaseModel):
    """
    Classify support tickets into categories.

    Examples:
    - "I was charged twice this month" -> billing
    - "My payment method was declined" -> billing
    - "The app crashes when I open settings" -> technical
    - "Search results are not loading" -> technical
    - "I need to change my email address" -> account
    - "How do I reset my password?" -> account
    - "What are your business hours?" -> general
    - "Do you offer student discounts?" -> general
    """

    chain_of_thought: str = Field(
        description="Brief reasoning for the classification"
    )
    label: Literal["billing", "technical", "account", "general"]
    confidence: float = Field(ge=0.0, le=1.0)

response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {
            "role": "user",
            "content": "I keep getting logged out every 5 minutes on mobile",
        },
    ],
    text_format=TicketClassification,
)

result = response.output_parsed
print(f"Reasoning: {result.chain_of_thought}")
print(f"Label: {result.label} ({result.confidence})")
# Reasoning: The user is experiencing a session timeout issue on mobile, which is a software bug.
# Label: technical (0.92)

Two things matter here. First, the chain_of_thought field forces the model to reason before classifying, which bumps accuracy by roughly 10% on ambiguous inputs. Second, the examples in the docstring act as few-shot demonstrations – two per category is enough for most tasks, though you can go up to 10 per category before the prompt gets unwieldy.

Picking Good Few-Shot Examples

Do not pick the easiest examples. Pick the ones that sit near category boundaries. If “I need a refund” could be either billing or account, include it as a billing example. The model learns from edge cases, not obvious ones.

Keep examples balanced – the same number per category. If you give 5 billing examples and 1 technical example, the model develops a bias toward billing.

Using Instructor for Multi-Label Classification

When a single text can belong to multiple categories, you need multi-label classification. The instructor library makes this clean:

1
pip install instructor openai
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = instructor.from_openai(OpenAI())

class MultiLabelClassification(BaseModel):
    """
    Assign one or more labels to the support ticket.

    Examples:
    - "I was charged twice and can't log in" -> ["billing", "technical"]
    - "App crashes on the payment page" -> ["technical", "billing"]
    - "How do I update my card on file?" -> ["billing", "account"]
    """

    labels: list[Literal["billing", "technical", "account", "general"]] = Field(
        min_length=1, max_length=3
    )
    reasoning: str

result = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=MultiLabelClassification,
    messages=[
        {
            "role": "user",
            "content": "The checkout page froze and now I see a duplicate charge",
        },
    ],
)

print(f"Labels: {result.labels}")
print(f"Reasoning: {result.reasoning}")
# Labels: ['technical', 'billing']
# Reasoning: The checkout page freezing is a technical issue, and the duplicate charge is a billing concern.

Instructor wraps the OpenAI client and adds automatic retries when Pydantic validation fails. If the model returns an invalid label, instructor sends the validation error back and asks the model to fix it. You get self-correcting classification without writing retry logic.

Batch Classification for Large Datasets

Classifying one ticket at a time works for real-time flows. For backfills or bulk processing, you need concurrency. Use asyncio with the AsyncOpenAI client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import asyncio
from openai import AsyncOpenAI
from pydantic import BaseModel
from typing import Literal

client = AsyncOpenAI()

class Classification(BaseModel):
    label: Literal["positive", "negative", "neutral"]

async def classify_one(text: str, semaphore: asyncio.Semaphore) -> dict:
    async with semaphore:
        response = await client.responses.parse(
            model="gpt-4o-mini",
            input=[{"role": "user", "content": f"Classify sentiment: {text}"}],
            text_format=Classification,
        )
        return {"text": text, "label": response.output_parsed.label}

async def classify_batch(texts: list[str], max_concurrent: int = 20) -> list[dict]:
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [classify_one(text, semaphore) for text in texts]
    return await asyncio.gather(*tasks)

# Usage
reviews = [
    "Absolutely love this product",
    "Broke after two days",
    "It's fine, nothing special",
    "Best purchase I made this year",
    "Terrible customer support",
]

results = asyncio.run(classify_batch(reviews))
for r in results:
    print(f"{r['label']:>10} | {r['text']}")

The Semaphore limits concurrent requests to avoid hitting OpenAI’s rate limits. Start with 20 and increase if you have a higher tier. For very large datasets (50,000+ items), use OpenAI’s Batch API instead, which gives a 50% cost discount and processes asynchronously within 24 hours.

When to Use Zero-Shot vs. Few-Shot vs. Fine-Tuning

This is the decision that actually matters in production:

Zero-shot is the right starting point when you have no labeled data and need results today. It handles broad, well-defined categories (sentiment, language detection, topic classification) with 80-90% accuracy out of the box. If your categories are intuitive – things a human could label without examples – zero-shot is often good enough.

Few-shot is worth the effort when zero-shot accuracy falls below your threshold or when your categories have subtle distinctions. Adding 2-5 examples per class typically closes a 5-15% accuracy gap. The cost is minimal (a few hundred extra tokens per request) and the payoff is real.

Fine-tuning still wins on accuracy. Research consistently shows fine-tuned small models outperform zero-shot GPT-4 class models by 10-25 percentage points on benchmarks like BANKING77 (77-class intent classification). Fine-tune when you have 100+ labeled examples per class, need <50ms latency, or are classifying millions of items daily where per-token costs add up.

ApproachAccuracy (typical)Labeled Data NeededCost per 1K itemsLatency
Zero-shot (gpt-4o-mini)80-90%None~$0.02~500ms
Few-shot (gpt-4o-mini)85-95%2-10 per class~$0.04~600ms
Fine-tuned (small model)90-98%100+ per class~$0.005~50ms

My recommendation: start zero-shot, evaluate on 50-100 hand-labeled examples, and only escalate to few-shot or fine-tuning if your accuracy is not meeting requirements. Most classification tasks do not need fine-tuning.

Cost Optimization Strategies

LLM classification adds up fast at scale. Here are the levers that make the biggest difference:

Use gpt-4o-mini aggressively. For classification, it matches gpt-4o accuracy at 10x lower cost. Only escalate to the larger model for genuinely ambiguous cases.

Structure prompts for caching. OpenAI automatically caches prompt prefixes longer than 1,024 tokens. Put your static system prompt and few-shot examples at the beginning, and the variable input text last. Cached tokens cost 50% less.

Batch similar requests. Instead of classifying one text per API call, group 5-10 items in a single prompt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from pydantic import BaseModel
from typing import Literal

class BatchResult(BaseModel):
    classifications: list[dict[str, str]]

texts = [
    "Love the new update!",
    "This is broken",
    "When do you ship to Canada?",
]

prompt = "Classify each text as positive, negative, or neutral:\n"
for i, t in enumerate(texts):
    prompt += f"{i+1}. {t}\n"

This reduces overhead tokens (system prompt, examples) from N calls down to 1. For a 10-item batch, you cut costs by roughly 60%.

Use the Batch API for offline work. Upload a JSONL file of requests and get results within 24 hours at 50% off. Perfect for daily classification jobs or backfills.

Common Errors and Fixes

openai.BadRequestError: Invalid schema

1
2
3
openai.BadRequestError: Invalid schema for response_format
'Classification': In context=(), 'additionalProperties' is required to be
supplied and to be false

This happens when your Pydantic model has nested objects without additionalProperties: false. The OpenAI SDK handles this automatically for top-level models, but nested models can trigger it. Fix by simplifying your schema or using model_config = ConfigDict(json_schema_extra={"additionalProperties": False}) on nested models.

openai.RateLimitError: Rate limit reached

1
2
openai.RateLimitError: Error code: 429 - Rate limit reached for
gpt-4o-mini in organization org-xxx on tokens per min (TPM)

You are sending too many tokens per minute. Reduce concurrency in your async code, add exponential backoff, or request a rate limit increase from OpenAI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import asyncio
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
async def classify_with_retry(text: str):
    response = await client.responses.parse(
        model="gpt-4o-mini",
        input=[{"role": "user", "content": f"Classify: {text}"}],
        text_format=Classification,
    )
    return response.output_parsed

pydantic_core._pydantic_core.ValidationError: label

1
2
3
4
pydantic_core._pydantic_core.ValidationError: 1 validation error for Classification
label
  Input should be 'billing', 'technical', 'account' or 'general'
  [type=literal_error, input_value='Billing', input_type=str]

The model returned “Billing” instead of “billing”. Structured outputs should prevent this, but it happens occasionally with older model versions. Fix by normalizing in a validator:

1
2
3
4
5
6
7
8
9
from pydantic import field_validator

class Classification(BaseModel):
    label: Literal["billing", "technical", "account", "general"]

    @field_validator("label", mode="before")
    @classmethod
    def normalize_label(cls, v: str) -> str:
        return v.lower().strip()

Empty or None output from output_parsed

1
response.output_parsed  # None

This means the response was either truncated (hit the output token limit) or the model refused. Check the response status:

1
2
3
4
5
6
if response.output_parsed is None:
    output_msg = response.output[0]
    if output_msg.status == "incomplete":
        print("Response truncated -- increase max_output_tokens")
    else:
        print("Model refused to classify this input")

For classification tasks, truncation almost never happens (responses are tiny). If you hit this, the input likely triggered content filtering.