How to Build a Synthetic Tabular Data Pipeline with CTGAN

You have a dataset you can’t share. Maybe it contains customer records, medical histories, or financial transactions. You need realistic data for development, testing, or training ML models, but privacy regulations make sharing the original impossible. CTGAN solves this by learning the statistical distributions and correlations in your real data, then generating entirely new rows that look real but aren’t tied to any actual person.

The SDV (Synthetic Data Vault) library wraps CTGAN in a clean API that handles metadata detection, model training, and quality evaluation. Here’s the fastest path from real data to synthetic data.

Generate Synthetic Data in 30 Seconds

Install SDV first:

1
pip install sdv

Now generate synthetic data from a sample dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

# Start with some real data
real_data = pd.DataFrame({
    "age": [25, 34, 45, 52, 29, 38, 41, 60, 33, 27],
    "income": [35000, 52000, 78000, 95000, 41000, 63000, 71000, 110000, 48000, 37000],
    "credit_score": [680, 720, 750, 690, 710, 740, 700, 780, 730, 660],
    "loan_approved": [False, True, True, True, True, True, True, True, True, False],
})

# Detect metadata automatically
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# Train CTGAN and generate synthetic rows
synthesizer = CTGANSynthesizer(metadata, epochs=500, verbose=True)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=1000)

print(synthetic_data.head())
print(f"\nReal mean income: {real_data['income'].mean():.0f}")
print(f"Synthetic mean income: {synthetic_data['income'].mean():.0f}")

That’s the whole pipeline. CTGAN uses a generative adversarial network architecture specifically designed for tabular data – it handles mixed column types (continuous, discrete, categorical) out of the box, which is something vanilla GANs struggle with.

Defining Metadata with SingleTableMetadata

Automatic detection works for quick experiments, but you’ll get better results by telling SDV exactly what each column represents. The metadata object controls how CTGAN treats each feature during training.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# Review what was auto-detected
print(metadata.to_dict())

# Override specific column types
metadata.update_column(column_name="age", sdtype="numerical")
metadata.update_column(column_name="income", sdtype="numerical")
metadata.update_column(column_name="credit_score", sdtype="numerical")
metadata.update_column(column_name="loan_approved", sdtype="boolean")

# If you have an ID column, mark it so CTGAN doesn't try to learn its distribution
# metadata.set_primary_key("customer_id")

# Validate the metadata
metadata.validate()

Getting metadata right matters. If SDV misidentifies a zip code column as numerical, CTGAN will generate zip codes like 53782.4 instead of valid five-digit strings. Mark those as sdtype="categorical" or sdtype="id" depending on your use case.

Training CTGAN with CTGANSynthesizer

CTGAN has several hyperparameters worth tuning. The defaults are reasonable, but for production pipelines you’ll want to adjust them based on your dataset size and complexity.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(
    metadata,
    epochs=300,               # More epochs for complex distributions
    batch_size=500,            # Larger batches stabilize GAN training
    generator_dim=(256, 256),  # Generator network architecture
    discriminator_dim=(256, 256),  # Discriminator network architecture
    verbose=True,              # Watch training progress
)

synthesizer.fit(real_data)

# Save the trained model for reuse
synthesizer.save("ctgan_model.pkl")

# Load it later without retraining
loaded_synthesizer = CTGANSynthesizer.load("ctgan_model.pkl")
synthetic_data = loaded_synthesizer.sample(num_rows=5000)

A few practical notes on training. CTGAN benefits from more epochs when your dataset has complex multi-modal distributions. If you see the loss oscillating wildly, reduce the learning rate by passing generator_lr=1e-4 and discriminator_lr=1e-4. For datasets under 1,000 rows, consider using GaussianCopulaSynthesizer instead – GANs need enough examples to learn meaningful patterns.

Evaluating Synthetic Data Quality

Generating data is only half the job. You need to verify the synthetic output actually preserves the statistical properties of the original.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sdv.evaluation.single_table import evaluate_quality, run_diagnostic

# Run quality evaluation
quality_report = evaluate_quality(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
)

print(f"Overall quality score: {quality_report.get_score():.2%}")

# Get column-level details
column_shapes = quality_report.get_details(property_name="Column Shapes")
print(column_shapes)

column_pairs = quality_report.get_details(property_name="Column Pair Trends")
print(column_pairs)

# Run diagnostic checks for data validity
diagnostic_report = run_diagnostic(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
)
print(f"Diagnostic score: {diagnostic_report.get_score():.2%}")

A quality score above 80% is solid for most use cases. The column shapes metric checks whether individual distributions match (using Kolmogorov-Smirnov for numerical, total variation distance for categorical). Column pair trends checks whether correlations between columns are preserved – this is where CTGAN really shines compared to simpler approaches like random sampling with per-column distributions.

Adding Constraints

Real data has business rules. Ages must be positive. End dates come after start dates. CTGAN doesn’t automatically know these rules, but SDV lets you enforce them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(metadata, epochs=300)

# Add constraints before fitting
credit_constraint = {
    "constraint_class": "ScalarRange",
    "constraint_parameters": {
        "column_name": "credit_score",
        "low_value": 300,
        "high_value": 850,
        "strict_boundaries": True,
    },
}

age_constraint = {
    "constraint_class": "ScalarRange",
    "constraint_parameters": {
        "column_name": "age",
        "low_value": 18,
        "high_value": 100,
        "strict_boundaries": True,
    },
}

synthesizer.add_constraints(constraints=[credit_constraint, age_constraint])
synthesizer.fit(real_data)

synthetic_data = synthesizer.sample(num_rows=1000)

# Verify constraints hold
print(f"Credit score range: {synthetic_data['credit_score'].min():.0f} - {synthetic_data['credit_score'].max():.0f}")
print(f"Age range: {synthetic_data['age'].min():.0f} - {synthetic_data['age'].max():.0f}")

Constraints are applied as a post-processing step – SDV rejects and regenerates rows that violate them. If your constraints are very tight relative to the learned distribution, generation can slow down significantly. Keep constraints to genuine business rules rather than trying to micromanage distributions.

Conditional Generation

Sometimes you need synthetic data that matches specific conditions. Want 500 rows where loan_approved is True? Use conditional sampling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from sdv.sampling import Condition

condition = Condition(
    num_rows=500,
    column_values={"loan_approved": True},
)

conditional_data = synthesizer.sample_from_conditions(conditions=[condition])
print(f"All approved: {conditional_data['loan_approved'].all()}")
print(f"Mean income (approved): {conditional_data['income'].mean():.0f}")

This is particularly useful for addressing class imbalance. If your real dataset has 95% approved and 5% denied, you can generate a balanced synthetic dataset by creating conditions for each class separately.

Common Errors and Fixes

InvalidDataError: The provided data does not match the metadata

Your dataframe columns don’t match the metadata definition. This happens when you define metadata manually and misspell a column name or forget to include one. Run metadata.validate_data(data) before fitting to catch mismatches early.

ValueError: DataProcessor has not been fitted

You’re calling sample() before fit(). CTGAN needs to train on real data before it can generate anything. Make sure synthesizer.fit(real_data) completes without errors first.

ConstraintsNotMetError during sampling

Your constraints conflict with each other or are too restrictive for the learned distribution. SDV tries multiple times to generate valid rows and gives up after a threshold. Loosen the constraints or increase max_tries_per_batch if available.

Poor quality scores (below 60%)

Several things to check:

Increase epochs – CTGAN may need more training time
Check your metadata types – a miscategorized column tanks the score
Make sure your real dataset has enough rows (at least a few hundred)
For small datasets, try GaussianCopulaSynthesizer instead of CTGAN

RuntimeError: CUDA out of memory

CTGAN uses GPU if available. Reduce batch_size or force CPU with:

1
2
import torch
torch.cuda.is_available = lambda: False

Run this before importing SDV.

Synthetic data has unrealistic outliers

CTGAN can extrapolate beyond the range of your training data. Add ScalarRange constraints for columns that have hard boundaries (like age, percentages, or scores with known min/max values).

Generate Synthetic Data in 30 Seconds#

Defining Metadata with SingleTableMetadata#

Training CTGAN with CTGANSynthesizer#

Evaluating Synthetic Data Quality#

Adding Constraints#

Conditional Generation#

Common Errors and Fixes#

Related Guides#

About the Author