The Short Answer

Install distilabel and datasets, define a pipeline that generates diverse examples through an LLM, then run deduplication and diversity checks before the data ever touches a training loop. Skipping validation is how you end up feeding a model its own degraded outputs.

1
pip install distilabel[hf-inference-endpoints] datasets sentence-transformers

Hugging Face’s Synthetic Data Generator is the no-code entry point — it uses distilabel under the hood. For anything beyond experimentation you want the Python API, which gives you full control over generation parameters, validation, and filtering.

What Model Collapse Actually Looks Like

Model collapse happens when synthetic data reinforces the model’s existing biases instead of expanding its knowledge. The failure mode is gradual: outputs get blander, rare-event examples disappear, and variance collapses toward a central mode. By the time you notice, several training cycles have already baked in the damage.

Three early warning signs:

  • Generated text that sounds fluent but repeats the same phrasings and sentence structures across examples
  • Rare labels in your classification dataset shrinking to near-zero frequency across generations
  • Embedding clusters collapsing — all your examples drift toward a single dense cluster in vector space

The fix is not to avoid synthetic data. It’s to validate diversity and anchor every generation run against real data. Research from 2025 confirms that accumulating synthetic data alongside a non-shrinking real-data corpus avoids collapse; pure replacement of real data with synthetic data nearly always collapses.

Generating a Text Classification Dataset

The distilabel pipeline approach gives you a reproducible, auditable generation process. Here’s a complete pipeline that generates a sentiment classification dataset for product reviews.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from distilabel.llms import InferenceEndpointsLLM
from datasets import Dataset
import os

# Seed prompts — these drive diversity; make them varied
SEED_INSTRUCTIONS = [
    {"instruction": "Write a detailed positive product review for a wireless keyboard. Focus on typing feel and build quality."},
    {"instruction": "Write a negative product review for a wireless keyboard with connectivity issues. Be specific about the problems."},
    {"instruction": "Write a neutral product review for a wireless keyboard. Mention both pros and cons without strong emotion."},
    {"instruction": "Write a positive product review from a programmer's perspective. Emphasize key travel and latency."},
    {"instruction": "Write a negative review from a casual user who found setup confusing."},
    {"instruction": "Write a short, frustrated one-paragraph negative review mentioning battery life."},
    {"instruction": "Write an enthusiastic positive review mentioning the product replaced an older model."},
    {"instruction": "Write a mixed review where the user likes the product but would not repurchase."},
]

with Pipeline(name="sentiment-classification-pipeline") as pipeline:
    loader = LoadDataFromDicts(
        data=SEED_INSTRUCTIONS,
        batch_size=4,
    )

    generator = TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            generation_kwargs={
                "temperature": 0.85,   # higher temp = more lexical diversity
                "max_new_tokens": 256,
                "do_sample": True,
            },
        ),
        system_prompt=(
            "You are a real customer writing authentic product reviews. "
            "Write naturally, with varying sentence lengths and vocabulary. "
            "Never use the same opening phrase twice."
        ),
        num_generations=5,   # 5 variations per seed instruction
        input_batch_size=4,
    )

    loader >> generator

distiset = pipeline.run(use_cache=True)

# Flatten to a standard dataset
raw_records = []
for row in distiset["default"]["train"]:
    for gen in row["generations"]:
        raw_records.append({
            "text": gen,
            "instruction": row["instruction"],
        })

raw_dataset = Dataset.from_list(raw_records)
print(f"Generated {len(raw_dataset)} raw examples")

The num_generations=5 parameter is the key lever for diversity. Each seed instruction produces 5 different completions, and temperature=0.85 ensures they differ meaningfully rather than being near-copies of each other.

Auto-Labeling with a Second LLM Pass

Generated text isn’t labeled yet. Run a second LLM call to assign sentiment labels, then verify label distribution matches your target.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import anthropic
from datasets import Dataset

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment

def label_sentiment(text: str) -> str:
    """Use Claude to assign a sentiment label. Returns 'positive', 'negative', or 'neutral'."""
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=16,
        messages=[
            {
                "role": "user",
                "content": (
                    f"Classify the sentiment of this product review as exactly one of: "
                    f"positive, negative, neutral.\n\nReview: {text}\n\nLabel:"
                ),
            }
        ],
    )
    label = response.content[0].text.strip().lower()
    if label not in {"positive", "negative", "neutral"}:
        return "neutral"
    return label


# Apply labeling — in production, batch this and use async
labeled_records = []
for row in raw_dataset:
    label = label_sentiment(row["text"])
    labeled_records.append({"text": row["text"], "label": label})

labeled_dataset = Dataset.from_list(labeled_records)

# Check label distribution immediately
from collections import Counter
label_counts = Counter(labeled_dataset["label"])
print(label_counts)
# Counter({'positive': 23, 'neutral': 11, 'negative': 6})
# If any class is under 15% of total, your seeds were not diverse enough

A healthy distribution for a balanced 3-class problem is no class below 20% of total. If neutral or negative examples are being under-generated, add more seed instructions explicitly targeting those labels.

Quality Checks: Deduplication and Diversity

This is where most synthetic data pipelines fail. Skipping this step and directly training on the raw generated data is the direct path to model collapse.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from sentence_transformers import SentenceTransformer
from datasets import Dataset
import numpy as np

def deduplicate_by_embedding(
    dataset: Dataset,
    text_column: str = "text",
    similarity_threshold: float = 0.92,
    model_name: str = "all-MiniLM-L6-v2",
) -> Dataset:
    """
    Remove near-duplicate examples using cosine similarity on sentence embeddings.
    Threshold of 0.92 catches paraphrases; lower it to 0.85 to be more aggressive.
    """
    model = SentenceTransformer(model_name)
    texts = dataset[text_column]
    embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

    # Normalize for cosine similarity via dot product
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    embeddings_norm = embeddings / norms

    keep_indices = []
    dropped = 0

    for i in range(len(embeddings_norm)):
        if not keep_indices:
            keep_indices.append(i)
            continue

        # Compare against all kept embeddings
        kept_embs = embeddings_norm[keep_indices]
        sims = kept_embs @ embeddings_norm[i]  # dot product = cosine similarity

        if sims.max() < similarity_threshold:
            keep_indices.append(i)
        else:
            dropped += 1

    print(f"Deduplication: kept {len(keep_indices)}, dropped {dropped} near-duplicates")
    return dataset.select(keep_indices)


def measure_diversity(dataset: Dataset, text_column: str = "text") -> dict:
    """
    Compute diversity metrics to catch collapse before training.
    Low entropy or high mean similarity = warning sign.
    """
    model = SentenceTransformer("all-MiniLM-L6-v2")
    texts = dataset[text_column]
    embeddings = model.encode(texts, batch_size=64, show_progress_bar=False)

    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    embeddings_norm = embeddings / norms

    # Pairwise cosine similarity — expensive for large datasets
    sim_matrix = embeddings_norm @ embeddings_norm.T
    upper_tri = sim_matrix[np.triu_indices_from(sim_matrix, k=1)]

    mean_sim = upper_tri.mean()
    p95_sim = np.percentile(upper_tri, 95)

    # Vocabulary diversity: unique unigrams / total unigrams
    all_words = " ".join(texts).lower().split()
    vocab_diversity = len(set(all_words)) / len(all_words)

    return {
        "mean_pairwise_cosine_similarity": round(float(mean_sim), 4),
        "p95_pairwise_cosine_similarity": round(float(p95_sim), 4),
        "vocabulary_diversity_ratio": round(vocab_diversity, 4),
        "num_examples": len(texts),
    }


# Run the full validation pipeline
deduped = deduplicate_by_embedding(labeled_dataset, similarity_threshold=0.92)
metrics = measure_diversity(deduped)
print(metrics)
# {'mean_pairwise_cosine_similarity': 0.3812, 'p95_pairwise_cosine_similarity': 0.7203,
#  'vocabulary_diversity_ratio': 0.1847, 'num_examples': 34}

# Abort training if diversity is too low
if metrics["mean_pairwise_cosine_similarity"] > 0.65:
    raise ValueError(
        f"Dataset diversity too low (mean cosine sim = {metrics['mean_pairwise_cosine_similarity']}). "
        "Increase temperature, add more varied seed prompts, or expand label coverage before training."
    )

The numbers to watch: mean pairwise cosine similarity above 0.65 is a collapse warning. Above 0.80, your dataset is effectively one example repeated with minor variations — do not train on it.

Filtering Low-Quality Generations

Not every generated example is usable. Filter out short outputs, examples with placeholder text, and anything the labeler couldn’t confidently classify.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def filter_low_quality(dataset: Dataset, min_words: int = 30) -> Dataset:
    """Drop examples that are too short or contain generation artifacts."""

    def is_quality(example):
        text = example["text"].strip()

        # Too short to be a real review
        if len(text.split()) < min_words:
            return False

        # Common generation artifacts
        artifacts = [
            "as an ai",
            "i cannot",
            "i'm unable",
            "[review]",
            "product name:",
            "insert",
        ]
        text_lower = text.lower()
        if any(a in text_lower for a in artifacts):
            return False

        return True

    before = len(dataset)
    filtered = dataset.filter(is_quality)
    after = len(filtered)
    print(f"Quality filter: kept {after}/{before} examples ({100*after/before:.1f}%)")
    return filtered


clean_dataset = filter_low_quality(deduped, min_words=30)

# Push to Hub for versioning and team access
clean_dataset.push_to_hub(
    "your-org/product-sentiment-synthetic-v1",
    token=os.environ["HF_TOKEN"],
)

Preventing Model Collapse in the Training Loop

Validated synthetic data is necessary but not sufficient. You also need to protect the training loop itself.

Mix synthetic and real data. Never train on 100% synthetic data. A 70/30 or 80/20 real-to-synthetic ratio is safe for most fine-tuning tasks. The real data anchors the distribution.

Version your synthetic datasets. Push every generation run to the Hub with a version tag (v1, v2). When you retrain after a cycle, check the diversity metrics on the new synthetic batch against the baseline metrics from v1. A drop in vocabulary diversity ratio across versions is the first measurable collapse signal.

Regenerate from scratch, don’t resample from existing synthetic data. When you need more data, run the generation pipeline again against your original seed instructions. Never use the previous generation’s outputs as seed prompts for the next run — that’s the recursive trap that causes collapse.

Using the No-Code Interface

If you’re prototyping and want results in 10 minutes, the Hugging Face Space handles text classification and chat SFT datasets without any code. It generates ~50 samples/minute for classification and ~20 for chat, and pushes the result directly to the Hub and Argilla for review.

The tradeoff is that the Space gives you less control over generation parameters and no automated diversity metrics. Use it for quick prototyping, then switch to the distilabel Python pipeline when you need production-quality data.

Common Errors

InferenceEndpointsLLM throws 401 Unauthorized. Set HF_TOKEN in your environment or pass api_key=os.environ["HF_TOKEN"] directly to the LLM constructor. Serverless Inference endpoints require a write-access token.

Pipeline produces identical outputs across seeds. Your temperature is too low, or the system prompt is too constrictive. Raise temperature to at least 0.8 and diversify the system prompt wording. If using a small model like 8B, switch to 70B for better generation diversity.

sentence_transformers encode is slow on CPU. Pass device="cuda" to SentenceTransformer("all-MiniLM-L6-v2", device="cuda"). For CPU-only environments, all-MiniLM-L6-v2 is already the fastest useful model — avoid larger models for deduplication at scale.

Label distribution is 90% positive, 10% everything else. Your seed prompts are skewed. Add 3-4 explicit negative and neutral seed instructions to balance generation. The LLM defaults toward positive/agreeable outputs when the prompt is ambiguous about sentiment target.