When Synthetic Data Makes Sense

Real labeled data is expensive. A human annotator labels maybe 50-100 examples per hour. At $20/hour, 10,000 labeled examples costs $2,000-4,000 and takes weeks.

LLMs can generate thousands of labeled examples in minutes for pennies. The catch: synthetic data isn’t a replacement for real data, it’s a supplement. Use it to bootstrap a dataset, handle edge cases, or augment underrepresented classes.

The Core Pattern

Generate diverse examples by varying the prompt parameters systematically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import anthropic
import json
import random

client = anthropic.Anthropic()

def generate_examples(
    category: str,
    sentiment: str,
    num_examples: int = 10,
) -> list[dict]:
    """Generate synthetic labeled examples for text classification."""

    prompt = f"""Generate {num_examples} realistic customer support messages.

Category: {category}
Sentiment: {sentiment}

Requirements:
- Each message should be 1-3 sentences
- Use natural, varied language (typos and informal tone are OK)
- Make them realistic — these should look like real customer messages
- Each message should be unique and different from the others

Return a JSON array of objects with "text" and "label" fields.
Label format: {{"category": "{category}", "sentiment": "{sentiment}"}}

Return ONLY the JSON array, no markdown fences."""

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )

    return json.loads(response.content[0].text)

Generate a Balanced Dataset

The key to useful synthetic data is systematic coverage. Generate examples across all your categories and sentiments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
categories = ["billing", "technical", "account", "shipping"]
sentiments = ["positive", "negative", "neutral"]

dataset = []

for category in categories:
    for sentiment in sentiments:
        print(f"Generating: {category} / {sentiment}")
        examples = generate_examples(category, sentiment, num_examples=20)
        dataset.extend(examples)

print(f"Total examples: {len(dataset)}")

# Shuffle to mix categories
random.shuffle(dataset)

# Save to JSONL (one example per line — standard ML format)
with open("training_data.jsonl", "w") as f:
    for example in dataset:
        f.write(json.dumps(example) + "\n")

print(f"Saved {len(dataset)} examples to training_data.jsonl")

This generates 240 labeled examples (4 categories x 3 sentiments x 20 examples) in under a minute. With real annotators, this would take a full workday.

Improving Quality with Seed Examples

Give the LLM a few real examples to anchor the style and quality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def generate_with_seeds(
    seed_examples: list[str],
    category: str,
    num_examples: int = 20,
) -> list[dict]:
    """Generate synthetic data using real examples as seeds."""

    seeds_text = "\n".join(f"- {ex}" for ex in seed_examples)

    prompt = f"""Here are real customer messages in the "{category}" category:

{seeds_text}

Generate {num_examples} NEW messages in the same style and category.
- Match the tone, length, and vocabulary of the real examples
- Don't copy the examples — create novel variations
- Include realistic typos and informal language
- Vary the specific issues mentioned

Return a JSON array with "text" and "category" fields.
Return ONLY the JSON array."""

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )

    return json.loads(response.content[0].text)


# Use 5 real examples as seeds
real_billing_examples = [
    "i was charged twice last month can i get a refund??",
    "My invoice shows $49.99 but my plan is supposed to be $29.99",
    "how do i update my credit card info",
    "Can I get a receipt for my last 3 payments",
    "why was i charged after canceling my subscription",
]

synthetic = generate_with_seeds(real_billing_examples, "billing", num_examples=50)
print(f"Generated {len(synthetic)} examples")

Seed examples dramatically improve the realism of synthetic data. Even 5-10 real examples are enough to anchor the generation.

Validation Pipeline

Never trust synthetic data blindly. Build a validation step to catch garbage.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def validate_example(example: dict, expected_category: str) -> list[str]:
    """Check a synthetic example for common quality issues."""
    issues = []

    text = example.get("text", "")

    # Check minimum length
    if len(text.split()) < 5:
        issues.append("Too short (under 5 words)")

    # Check maximum length
    if len(text.split()) > 100:
        issues.append("Too long (over 100 words)")

    # Check for template artifacts
    template_markers = ["[", "]", "{category}", "{sentiment}", "example"]
    for marker in template_markers:
        if marker.lower() in text.lower():
            issues.append(f"Contains template artifact: '{marker}'")

    # Check category matches
    if example.get("category") != expected_category:
        issues.append(f"Wrong category: got '{example.get('category')}', expected '{expected_category}'")

    return issues


# Validate the whole dataset
valid = []
rejected = []

for example in dataset:
    issues = validate_example(example, example.get("category", ""))
    if issues:
        rejected.append({"example": example, "issues": issues})
    else:
        valid.append(example)

print(f"Valid: {len(valid)}, Rejected: {len(rejected)}")

# Show rejection reasons
for r in rejected[:5]:
    print(f"  Rejected: {r['issues']}{r['example']['text'][:60]}...")

Deduplication

LLMs sometimes generate near-duplicates. Remove them before training.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from difflib import SequenceMatcher

def deduplicate(examples: list[dict], threshold: float = 0.85) -> list[dict]:
    """Remove near-duplicate examples based on text similarity."""
    unique = []

    for example in examples:
        is_duplicate = False
        for existing in unique:
            similarity = SequenceMatcher(
                None,
                example["text"].lower(),
                existing["text"].lower(),
            ).ratio()
            if similarity > threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            unique.append(example)

    removed = len(examples) - len(unique)
    print(f"Removed {removed} near-duplicates ({removed/len(examples)*100:.1f}%)")
    return unique


dataset = deduplicate(valid)

Common Issues

All examples sound the same. Increase temperature in the API call (add temperature=0.9) and explicitly ask for variety in the prompt. Generate in smaller batches (10 at a time) with slightly different prompts.

JSON parsing errors. LLMs sometimes wrap JSON in markdown fences. Strip them before parsing:

1
2
3
4
5
6
7
8
import re

def parse_json_response(text: str) -> list:
    """Parse JSON from an LLM response, handling markdown fences."""
    # Remove markdown code fences if present
    text = re.sub(r"```json?\s*", "", text)
    text = re.sub(r"```\s*$", "", text)
    return json.loads(text.strip())

Class imbalance. Generate equal numbers per class. If your real data is 90% “billing” and 10% “technical,” synthetic data lets you balance the training set without oversampling.

Cost Estimate

Dataset SizeModelApproximate CostTime
1,000 examplesClaude Sonnet~$0.502 min
10,000 examplesClaude Sonnet~$515 min
100,000 examplesClaude Haiku~$101 hour

Compare that to $2,000-4,000 for human annotation. Synthetic data isn’t perfect, but it’s 100-400x cheaper and available immediately.