The Problem with Manual Labeling

You need 10,000 labeled examples for your classifier. A human annotator handles 50-100 per hour at $20/hour. That’s $2,000-4,000 and weeks of waiting – and you still end up arguing about edge cases.

LLM-assisted annotation flips this around. Instead of starting from scratch, you have an LLM pre-label your data, then route only the uncertain or tricky samples to human reviewers. The LLM handles the obvious cases (usually 60-80% of your dataset), and humans focus on what actually requires judgment.

This isn’t about replacing human annotators. It’s about making them 5-10x more productive by eliminating the grunt work.

Pre-Annotating with the OpenAI API

The simplest approach: send your raw data to an LLM with a structured prompt, get labels back, then review. OpenAI’s structured outputs feature makes this reliable by guaranteeing valid JSON responses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from openai import OpenAI
from pydantic import BaseModel
import json

client = OpenAI()


class Annotation(BaseModel):
    label: str
    confidence: str  # "high", "medium", "low"
    reasoning: str


class AnnotationBatch(BaseModel):
    annotations: list[Annotation]


def annotate_batch(texts: list[str], label_set: list[str]) -> list[dict]:
    """Pre-annotate a batch of texts using GPT-4o with structured output."""

    labels_str = ", ".join(label_set)
    numbered = "\n".join(f"{i+1}. {t}" for i, t in enumerate(texts))

    response = client.beta.chat.completions.parse(
        model="gpt-4o-2024-11-20",
        response_format=AnnotationBatch,
        temperature=0.0,
        messages=[
            {
                "role": "system",
                "content": f"""You are a data annotator. Classify each text into
exactly one of these labels: {labels_str}.

For each text, provide:
- label: the classification
- confidence: "high" if obvious, "medium" if reasonable, "low" if uncertain
- reasoning: one sentence explaining your choice""",
            },
            {
                "role": "user",
                "content": f"Classify these texts:\n\n{numbered}",
            },
        ],
    )

    result = response.choices[0].message.parsed
    return [
        {
            "text": text,
            "label": ann.label,
            "confidence": ann.confidence,
            "reasoning": ann.reasoning,
        }
        for text, ann in zip(texts, result.annotations)
    ]


# Example usage
texts = [
    "My order arrived damaged, need a replacement",
    "Love the new feature update, great work!",
    "How do I change my password?",
    "Been waiting 3 weeks and still no refund",
]

labels = ["complaint", "praise", "question", "refund_request"]
results = annotate_batch(texts, labels)

for r in results:
    print(f"[{r['confidence']}] {r['label']}: {r['text'][:50]}")

Setting temperature=0.0 is critical here. For annotation tasks, you want deterministic outputs – the same input should always produce the same label. Higher temperatures introduce random variation that kills inter-annotator agreement.

Few-Shot Prompting for Better Accuracy

Raw zero-shot prompts work for simple tasks, but accuracy drops fast on anything nuanced. Few-shot examples fix this by anchoring the LLM’s understanding of your label definitions.

The sweet spot is 2-3 examples per label. Research shows diminishing returns beyond that, and longer prompts increase token costs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def build_few_shot_prompt(
    label_set: dict[str, list[str]],
) -> str:
    """Build a few-shot prompt from labeled examples.

    Args:
        label_set: Maps label names to lists of example texts.
    """
    prompt_parts = ["Here are examples of each category:\n"]

    for label, examples in label_set.items():
        prompt_parts.append(f"### {label}")
        for ex in examples[:3]:  # Cap at 3 examples per label
            prompt_parts.append(f'- "{ex}"')
        prompt_parts.append("")

    prompt_parts.append(
        "Classify each new text into one of the categories above. "
        "If a text could fit multiple categories, pick the primary intent."
    )

    return "\n".join(prompt_parts)


# Define your taxonomy with examples
taxonomy = {
    "complaint": [
        "This product broke after one week, terrible quality",
        "I've been on hold for 45 minutes, this is unacceptable",
        "Your app crashes every time I try to checkout",
    ],
    "praise": [
        "Seriously impressed with how fast shipping was",
        "Customer support resolved my issue in under 5 min",
    ],
    "question": [
        "Do you ship internationally?",
        "What's the difference between the pro and basic plan?",
    ],
    "refund_request": [
        "I'd like a refund for order #4521, wrong item received",
        "Please cancel and refund my subscription",
    ],
}

prompt = build_few_shot_prompt(taxonomy)
print(prompt)

Routing Low-Confidence Samples to Human Review

Here’s where the real value comes in. Instead of sending everything to human reviewers, use the LLM’s confidence signal to triage.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def triage_annotations(annotations: list[dict]) -> dict:
    """Split annotations into auto-approved and needs-review buckets."""
    auto_approved = []
    needs_review = []

    for ann in annotations:
        if ann["confidence"] == "high":
            auto_approved.append(ann)
        else:
            needs_review.append(ann)

    total = len(annotations)
    print(f"Auto-approved: {len(auto_approved)}/{total} "
          f"({len(auto_approved)/total*100:.0f}%)")
    print(f"Needs review:  {len(needs_review)}/{total} "
          f"({len(needs_review)/total*100:.0f}%)")

    return {"approved": auto_approved, "review": needs_review}

In practice, you’ll see 60-80% of samples land in the “high confidence” bucket. That’s 60-80% of your annotation budget saved. The remaining samples are exactly the ambiguous ones that benefit most from human judgment.

Integrating with Label Studio

Label Studio is the most popular open-source annotation platform, and its ML backend system makes LLM integration straightforward. Install it and set up a project that receives LLM pre-annotations.

1
2
pip install label-studio label-studio-ml
label-studio start --port 8080

Then create an ML backend that feeds LLM predictions into Label Studio’s review interface:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from label_studio_ml.model import LabelStudioMLBase

class LLMAnnotator(LabelStudioMLBase):
    """Label Studio ML backend that uses an LLM for pre-annotation."""

    def setup(self):
        self.client = OpenAI()

    def predict(self, tasks, **kwargs):
        predictions = []

        for task in tasks:
            text = task["data"].get("text", "")

            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0.0,
                messages=[
                    {
                        "role": "system",
                        "content": "Classify the text as: complaint, "
                        "praise, question, or refund_request. "
                        "Respond with only the label.",
                    },
                    {"role": "user", "content": text},
                ],
            )

            label = response.choices[0].message.content.strip()

            predictions.append(
                {
                    "result": [
                        {
                            "from_name": "label",
                            "to_name": "text",
                            "type": "choices",
                            "value": {"choices": [label]},
                        }
                    ],
                    "score": 0.85,
                }
            )

        return predictions

Start the backend with label-studio-ml start ./my_backend --port 9090, then connect it in Label Studio under Settings > Model. Every new task gets an LLM pre-annotation that your annotators can accept, reject, or correct with a single click.

Using Argilla for LLM Feedback Loops

Argilla (v2.x) takes a different approach – it’s built specifically for LLM-centric workflows and integrates tightly with Hugging Face. It’s a strong choice if you’re already in that ecosystem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import argilla as rg

client = rg.Argilla(api_url="http://localhost:6900", api_key="admin.apikey")

# Define your annotation schema
settings = rg.Settings(
    fields=[rg.TextField(name="text")],
    questions=[
        rg.LabelQuestion(
            name="category",
            labels=["complaint", "praise", "question", "refund_request"],
        ),
        rg.RatingQuestion(name="llm_accuracy", values=[1, 2, 3, 4, 5]),
    ],
)

dataset = rg.Dataset(name="support_tickets", settings=settings)
dataset.create()

# Push LLM-annotated records with suggestions
records = [
    rg.Record(
        fields={"text": ann["text"]},
        suggestions=[
            rg.Suggestion(
                question_name="category",
                value=ann["label"],
                agent="gpt-4o",
            )
        ],
    )
    for ann in results  # results from annotate_batch() above
]

dataset.records.log(records)
print(f"Logged {len(records)} records with LLM suggestions")

Argilla’s Suggestion concept is key – the LLM’s labels show up as pre-filled suggestions that human reviewers accept or override. This captures both the final label and whether the LLM got it right, which you can feed back into prompt improvement.

Common Errors and How to Fix Them

LLM returns labels not in your label set. This happens more than you’d expect, especially with open-ended prompts. The LLM decides “billing_inquiry” is a better label than your defined “question.”

Fix: Use structured outputs (Pydantic models with Literal types) to constrain the response. Alternatively, add a validation step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def validate_label(predicted: str, valid_labels: list[str]) -> str:
    """Map predicted label to closest valid label."""
    predicted_lower = predicted.lower().strip()
    for label in valid_labels:
        if label.lower() == predicted_lower:
            return label

    # Fuzzy match fallback
    from difflib import get_close_matches
    matches = get_close_matches(predicted_lower,
                                [l.lower() for l in valid_labels],
                                n=1, cutoff=0.6)
    if matches:
        idx = [l.lower() for l in valid_labels].index(matches[0])
        return valid_labels[idx]

    return "UNKNOWN"  # Flag for human review

Batch size mismatches. You send 20 texts but get back 19 annotations. The LLM miscounted or merged two items. Always verify the annotation count matches your input count, and fall back to one-at-a-time processing when it doesn’t.

Rate limit errors from the API. When annotating thousands of samples, you’ll hit rate limits fast.

1
openai.RateLimitError: Error code: 429 - Rate limit reached for gpt-4o

Fix: Add exponential backoff with tenacity:

1
pip install tenacity

1
2
3
4
5
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
def annotate_with_retry(texts, labels):
    return annotate_batch(texts, labels)

Inconsistent annotations across batches. The LLM labels a text as “complaint” in batch 1 but “refund_request” in batch 5. Lower the temperature to 0.0, use the exact same system prompt for every call, and consider running each sample through the LLM twice – if the labels disagree, route it to human review.

Measuring Annotation Quality

Don’t just trust the LLM. Measure agreement between LLM labels and a gold-standard human-labeled subset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sklearn.metrics import cohen_kappa_score, classification_report

# Compare LLM labels vs human labels on a sample
human_labels = ["complaint", "praise", "question", "complaint", "refund_request"]
llm_labels =   ["complaint", "praise", "question", "refund_request", "refund_request"]

kappa = cohen_kappa_score(human_labels, llm_labels)
print(f"Cohen's Kappa: {kappa:.3f}")
# > 0.8 = strong agreement
# 0.6-0.8 = moderate agreement
# < 0.6 = weak — reconsider your prompt or use more human review

print(classification_report(human_labels, llm_labels))

If your Cohen’s Kappa drops below 0.6, your prompt needs work. Add more few-shot examples, clarify the label definitions, or narrow your taxonomy. A weak LLM annotator is worse than no annotator because it gives you false confidence in bad data.

When to Skip LLM Annotation

LLM-assisted labeling doesn’t work for everything. Skip it when:

Your labels require domain expertise the LLM lacks (medical diagnosis, legal judgment)
The annotation task is visual (bounding boxes, segmentation masks – use specialized tools instead)
Your label taxonomy has more than 20 classes – LLM accuracy degrades significantly with fine-grained taxonomies
You need perfect accuracy – even with human review, some LLM biases will leak through

For everything else – sentiment, intent classification, topic labeling, NER on common entities – LLM pre-annotation cuts your labeling time by 60-80% while maintaining quality above 90% agreement with human annotators.

The Problem with Manual Labeling#

Pre-Annotating with the OpenAI API#

Few-Shot Prompting for Better Accuracy#

Routing Low-Confidence Samples to Human Review#

Integrating with Label Studio#

Using Argilla for LLM Feedback Loops#

Common Errors and How to Fix Them#

Measuring Annotation Quality#

When to Skip LLM Annotation#