When Synthetic Data Makes Sense#
Real labeled data is expensive. A human annotator labels maybe 50-100 examples per hour. At $20/hour, 10,000 labeled examples costs $2,000-4,000 and takes weeks.
LLMs can generate thousands of labeled examples in minutes for pennies. The catch: synthetic data isn’t a replacement for real data, it’s a supplement. Use it to bootstrap a dataset, handle edge cases, or augment underrepresented classes.
The Core Pattern#
Generate diverse examples by varying the prompt parameters systematically.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| import anthropic
import json
import random
client = anthropic.Anthropic()
def generate_examples(
category: str,
sentiment: str,
num_examples: int = 10,
) -> list[dict]:
"""Generate synthetic labeled examples for text classification."""
prompt = f"""Generate {num_examples} realistic customer support messages.
Category: {category}
Sentiment: {sentiment}
Requirements:
- Each message should be 1-3 sentences
- Use natural, varied language (typos and informal tone are OK)
- Make them realistic — these should look like real customer messages
- Each message should be unique and different from the others
Return a JSON array of objects with "text" and "label" fields.
Label format: {{"category": "{category}", "sentiment": "{sentiment}"}}
Return ONLY the JSON array, no markdown fences."""
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(response.content[0].text)
|
Generate a Balanced Dataset#
The key to useful synthetic data is systematic coverage. Generate examples across all your categories and sentiments.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| categories = ["billing", "technical", "account", "shipping"]
sentiments = ["positive", "negative", "neutral"]
dataset = []
for category in categories:
for sentiment in sentiments:
print(f"Generating: {category} / {sentiment}")
examples = generate_examples(category, sentiment, num_examples=20)
dataset.extend(examples)
print(f"Total examples: {len(dataset)}")
# Shuffle to mix categories
random.shuffle(dataset)
# Save to JSONL (one example per line — standard ML format)
with open("training_data.jsonl", "w") as f:
for example in dataset:
f.write(json.dumps(example) + "\n")
print(f"Saved {len(dataset)} examples to training_data.jsonl")
|
This generates 240 labeled examples (4 categories x 3 sentiments x 20 examples) in under a minute. With real annotators, this would take a full workday.
Improving Quality with Seed Examples#
Give the LLM a few real examples to anchor the style and quality.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| def generate_with_seeds(
seed_examples: list[str],
category: str,
num_examples: int = 20,
) -> list[dict]:
"""Generate synthetic data using real examples as seeds."""
seeds_text = "\n".join(f"- {ex}" for ex in seed_examples)
prompt = f"""Here are real customer messages in the "{category}" category:
{seeds_text}
Generate {num_examples} NEW messages in the same style and category.
- Match the tone, length, and vocabulary of the real examples
- Don't copy the examples — create novel variations
- Include realistic typos and informal language
- Vary the specific issues mentioned
Return a JSON array with "text" and "category" fields.
Return ONLY the JSON array."""
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
)
return json.loads(response.content[0].text)
# Use 5 real examples as seeds
real_billing_examples = [
"i was charged twice last month can i get a refund??",
"My invoice shows $49.99 but my plan is supposed to be $29.99",
"how do i update my credit card info",
"Can I get a receipt for my last 3 payments",
"why was i charged after canceling my subscription",
]
synthetic = generate_with_seeds(real_billing_examples, "billing", num_examples=50)
print(f"Generated {len(synthetic)} examples")
|
Seed examples dramatically improve the realism of synthetic data. Even 5-10 real examples are enough to anchor the generation.
Validation Pipeline#
Never trust synthetic data blindly. Build a validation step to catch garbage.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| def validate_example(example: dict, expected_category: str) -> list[str]:
"""Check a synthetic example for common quality issues."""
issues = []
text = example.get("text", "")
# Check minimum length
if len(text.split()) < 5:
issues.append("Too short (under 5 words)")
# Check maximum length
if len(text.split()) > 100:
issues.append("Too long (over 100 words)")
# Check for template artifacts
template_markers = ["[", "]", "{category}", "{sentiment}", "example"]
for marker in template_markers:
if marker.lower() in text.lower():
issues.append(f"Contains template artifact: '{marker}'")
# Check category matches
if example.get("category") != expected_category:
issues.append(f"Wrong category: got '{example.get('category')}', expected '{expected_category}'")
return issues
# Validate the whole dataset
valid = []
rejected = []
for example in dataset:
issues = validate_example(example, example.get("category", ""))
if issues:
rejected.append({"example": example, "issues": issues})
else:
valid.append(example)
print(f"Valid: {len(valid)}, Rejected: {len(rejected)}")
# Show rejection reasons
for r in rejected[:5]:
print(f" Rejected: {r['issues']} — {r['example']['text'][:60]}...")
|
Deduplication#
LLMs sometimes generate near-duplicates. Remove them before training.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from difflib import SequenceMatcher
def deduplicate(examples: list[dict], threshold: float = 0.85) -> list[dict]:
"""Remove near-duplicate examples based on text similarity."""
unique = []
for example in examples:
is_duplicate = False
for existing in unique:
similarity = SequenceMatcher(
None,
example["text"].lower(),
existing["text"].lower(),
).ratio()
if similarity > threshold:
is_duplicate = True
break
if not is_duplicate:
unique.append(example)
removed = len(examples) - len(unique)
print(f"Removed {removed} near-duplicates ({removed/len(examples)*100:.1f}%)")
return unique
dataset = deduplicate(valid)
|
Common Issues#
All examples sound the same. Increase temperature in the API call (add temperature=0.9) and explicitly ask for variety in the prompt. Generate in smaller batches (10 at a time) with slightly different prompts.
JSON parsing errors. LLMs sometimes wrap JSON in markdown fences. Strip them before parsing:
1
2
3
4
5
6
7
8
| import re
def parse_json_response(text: str) -> list:
"""Parse JSON from an LLM response, handling markdown fences."""
# Remove markdown code fences if present
text = re.sub(r"```json?\s*", "", text)
text = re.sub(r"```\s*$", "", text)
return json.loads(text.strip())
|
Class imbalance. Generate equal numbers per class. If your real data is 90% “billing” and 10% “technical,” synthetic data lets you balance the training set without oversampling.
Cost Estimate#
| Dataset Size | Model | Approximate Cost | Time |
|---|
| 1,000 examples | Claude Sonnet | ~$0.50 | 2 min |
| 10,000 examples | Claude Sonnet | ~$5 | 15 min |
| 100,000 examples | Claude Haiku | ~$10 | 1 hour |
Compare that to $2,000-4,000 for human annotation. Synthetic data isn’t perfect, but it’s 100-400x cheaper and available immediately.