You have a dataset you can’t share. Maybe it contains customer records, medical histories, or financial transactions. You need realistic data for development, testing, or training ML models, but privacy regulations make sharing the original impossible. CTGAN solves this by learning the statistical distributions and correlations in your real data, then generating entirely new rows that look real but aren’t tied to any actual person.
The SDV (Synthetic Data Vault) library wraps CTGAN in a clean API that handles metadata detection, model training, and quality evaluation. Here’s the fastest path from real data to synthetic data.
Generate Synthetic Data in 30 Seconds
Install SDV first:
| |
Now generate synthetic data from a sample dataset:
| |
That’s the whole pipeline. CTGAN uses a generative adversarial network architecture specifically designed for tabular data – it handles mixed column types (continuous, discrete, categorical) out of the box, which is something vanilla GANs struggle with.
Defining Metadata with SingleTableMetadata
Automatic detection works for quick experiments, but you’ll get better results by telling SDV exactly what each column represents. The metadata object controls how CTGAN treats each feature during training.
| |
Getting metadata right matters. If SDV misidentifies a zip code column as numerical, CTGAN will generate zip codes like 53782.4 instead of valid five-digit strings. Mark those as sdtype="categorical" or sdtype="id" depending on your use case.
Training CTGAN with CTGANSynthesizer
CTGAN has several hyperparameters worth tuning. The defaults are reasonable, but for production pipelines you’ll want to adjust them based on your dataset size and complexity.
| |
A few practical notes on training. CTGAN benefits from more epochs when your dataset has complex multi-modal distributions. If you see the loss oscillating wildly, reduce the learning rate by passing generator_lr=1e-4 and discriminator_lr=1e-4. For datasets under 1,000 rows, consider using GaussianCopulaSynthesizer instead – GANs need enough examples to learn meaningful patterns.
Evaluating Synthetic Data Quality
Generating data is only half the job. You need to verify the synthetic output actually preserves the statistical properties of the original.
| |
A quality score above 80% is solid for most use cases. The column shapes metric checks whether individual distributions match (using Kolmogorov-Smirnov for numerical, total variation distance for categorical). Column pair trends checks whether correlations between columns are preserved – this is where CTGAN really shines compared to simpler approaches like random sampling with per-column distributions.
Adding Constraints
Real data has business rules. Ages must be positive. End dates come after start dates. CTGAN doesn’t automatically know these rules, but SDV lets you enforce them.
| |
Constraints are applied as a post-processing step – SDV rejects and regenerates rows that violate them. If your constraints are very tight relative to the learned distribution, generation can slow down significantly. Keep constraints to genuine business rules rather than trying to micromanage distributions.
Conditional Generation
Sometimes you need synthetic data that matches specific conditions. Want 500 rows where loan_approved is True? Use conditional sampling.
| |
This is particularly useful for addressing class imbalance. If your real dataset has 95% approved and 5% denied, you can generate a balanced synthetic dataset by creating conditions for each class separately.
Common Errors and Fixes
InvalidDataError: The provided data does not match the metadata
Your dataframe columns don’t match the metadata definition. This happens when you define metadata manually and misspell a column name or forget to include one. Run metadata.validate_data(data) before fitting to catch mismatches early.
ValueError: DataProcessor has not been fitted
You’re calling sample() before fit(). CTGAN needs to train on real data before it can generate anything. Make sure synthesizer.fit(real_data) completes without errors first.
ConstraintsNotMetError during sampling
Your constraints conflict with each other or are too restrictive for the learned distribution. SDV tries multiple times to generate valid rows and gives up after a threshold. Loosen the constraints or increase max_tries_per_batch if available.
Poor quality scores (below 60%)
Several things to check:
- Increase
epochs– CTGAN may need more training time - Check your metadata types – a miscategorized column tanks the score
- Make sure your real dataset has enough rows (at least a few hundred)
- For small datasets, try
GaussianCopulaSynthesizerinstead of CTGAN
RuntimeError: CUDA out of memory
CTGAN uses GPU if available. Reduce batch_size or force CPU with:
| |
Run this before importing SDV.
Synthetic data has unrealistic outliers
CTGAN can extrapolate beyond the range of your training data. Add ScalarRange constraints for columns that have hard boundaries (like age, percentages, or scores with known min/max values).
Related Guides
- How to Create Synthetic Training Data with LLMs
- How to Generate Synthetic Training Data with Hugging Face’s Synthetic Data Generator Without Triggering Model Collapse
- How to Build a Data Augmentation Pipeline for Tabular Data
- How to Build a Feature Importance and Selection Pipeline with Scikit-Learn
- How to Build a Data Annotation Pipeline with Argilla
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Build a Data Versioning Pipeline with Delta Lake for ML
- How to Build a Data Labeling Pipeline with Label Studio
- How to Build a Data Slicing and Stratification Pipeline for ML
- How to Build a Data Sampling Pipeline for Large-Scale ML Training