How to Build a Data Augmentation Pipeline for Tabular Data

Data augmentation is standard practice in computer vision and NLP, but it’s surprisingly underused for tabular data. If you’re working with a small dataset or dealing with class imbalance, augmentation can be the difference between a model that generalizes and one that memorizes. Here’s how to build a pipeline that combines SMOTE oversampling, SDV synthetic data generation, and noise injection into a single reusable workflow.

Install the dependencies first:

1
pip install imbalanced-learn sdv pandas numpy scikit-learn

Oversampling with SMOTE

SMOTE (Synthetic Minority Oversampling Technique) creates new samples by interpolating between existing minority class examples. It’s the go-to fix for imbalanced classification problems where your minority class has too few samples for the model to learn meaningful patterns.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE

# Create a sample imbalanced dataset
np.random.seed(42)
n_majority = 500
n_minority = 50

data = pd.DataFrame({
    "feature_1": np.concatenate([
        np.random.normal(0, 1, n_majority),
        np.random.normal(3, 1, n_minority)
    ]),
    "feature_2": np.concatenate([
        np.random.normal(0, 1, n_majority),
        np.random.normal(2, 1, n_minority)
    ]),
    "feature_3": np.concatenate([
        np.random.uniform(0, 10, n_majority),
        np.random.uniform(5, 15, n_minority)
    ]),
    "label": np.concatenate([
        np.zeros(n_majority, dtype=int),
        np.ones(n_minority, dtype=int)
    ])
})

print(f"Before SMOTE: {data['label'].value_counts().to_dict()}")

X = data.drop(columns=["label"])
y = data["label"]

smote = SMOTE(sampling_strategy="auto", random_state=42, k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X, y)

augmented_data = pd.DataFrame(X_resampled, columns=X.columns)
augmented_data["label"] = y_resampled

print(f"After SMOTE: {augmented_data['label'].value_counts().to_dict()}")
# Before SMOTE: {0: 500, 1: 50}
# After SMOTE: {0: 500, 1: 500}

The k_neighbors parameter controls how many nearest neighbors SMOTE uses to generate synthetic points. Lower values (3-5) produce samples closer to existing data points. Higher values create more diverse samples but risk crossing decision boundaries. For very small minority classes (under 20 samples), drop k_neighbors to 2 or 3 so SMOTE has enough neighbors to work with.

Generating Synthetic Data with SDV

SMOTE works well for balancing classes, but it only interpolates between existing points. SDV (Synthetic Data Vault) learns the statistical distributions and correlations in your data, then generates entirely new rows that preserve those relationships. This is better when you need to expand your entire dataset, not just one class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# Create sample customer data
np.random.seed(42)
original_data = pd.DataFrame({
    "age": np.random.randint(18, 70, size=200),
    "income": np.random.normal(55000, 15000, size=200).round(2),
    "credit_score": np.random.randint(300, 850, size=200),
    "category": np.random.choice(["A", "B", "C"], size=200),
    "churned": np.random.choice([0, 1], size=200, p=[0.8, 0.2])
})

# Detect metadata from the dataframe
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(original_data)

# Train the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(original_data)

# Generate 500 new synthetic rows
synthetic_data = synthesizer.sample(num_rows=500)

print(f"Original shape: {original_data.shape}")
print(f"Synthetic shape: {synthetic_data.shape}")
print(f"\nOriginal means:\n{original_data[['age', 'income', 'credit_score']].mean()}")
print(f"\nSynthetic means:\n{synthetic_data[['age', 'income', 'credit_score']].mean()}")

The synthetic data should have similar distributions to the original. SDV’s GaussianCopulaSynthesizer models each column’s marginal distribution independently and captures inter-column correlations with a Gaussian copula. It handles mixed types well – numerical and categorical columns in the same table are fine.

You can also update the metadata manually if auto-detection gets something wrong:

1
2
metadata.update_column(column_name="churned", sdtype="categorical")
metadata.update_column(column_name="age", sdtype="numerical")

Adding Noise Injection

Sometimes you don’t need entirely new rows. You just need slightly perturbed versions of existing ones. Noise injection adds small random changes to numerical features and occasional random swaps for categorical ones. This teaches your model to be less sensitive to minor variations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import pandas as pd
import numpy as np

def inject_noise(df, numerical_cols, categorical_cols, noise_level=0.05, swap_prob=0.1, n_copies=2, seed=42):
    """Create augmented copies of a dataframe with noise injection.

    Args:
        df: Original dataframe
        numerical_cols: Columns to add Gaussian noise to
        categorical_cols: Columns to randomly swap values in
        noise_level: Standard deviation of noise as fraction of column std
        swap_prob: Probability of swapping a categorical value
        n_copies: Number of noisy copies to generate
        seed: Random seed for reproducibility
    """
    rng = np.random.default_rng(seed)
    augmented_frames = [df.copy()]

    for i in range(n_copies):
        noisy = df.copy()

        # Add Gaussian noise to numerical columns
        for col in numerical_cols:
            col_std = df[col].std()
            noise = rng.normal(0, noise_level * col_std, size=len(df))
            noisy[col] = noisy[col] + noise

        # Random swaps for categorical columns
        for col in categorical_cols:
            unique_vals = df[col].unique()
            mask = rng.random(size=len(df)) < swap_prob
            random_vals = rng.choice(unique_vals, size=mask.sum())
            noisy.loc[mask, col] = random_vals

        augmented_frames.append(noisy)

    return pd.concat(augmented_frames, ignore_index=True)


# Use the same customer data from before
np.random.seed(42)
original_data = pd.DataFrame({
    "age": np.random.randint(18, 70, size=100),
    "income": np.random.normal(55000, 15000, size=100).round(2),
    "credit_score": np.random.randint(300, 850, size=100),
    "category": np.random.choice(["A", "B", "C"], size=100),
    "churned": np.random.choice([0, 1], size=100, p=[0.8, 0.2])
})

augmented = inject_noise(
    original_data,
    numerical_cols=["age", "income", "credit_score"],
    categorical_cols=["category"],
    noise_level=0.05,
    swap_prob=0.1,
    n_copies=3
)

print(f"Original rows: {len(original_data)}")
print(f"Augmented rows: {len(augmented)}")
# Original rows: 100
# Augmented rows: 400

Keep noise_level low – 0.02 to 0.10 is usually the sweet spot. Go too high and you distort the signal your model needs to learn. The same goes for swap_prob on categoricals. A 5-10% swap rate adds variety without destroying class relationships.

Building the Complete Pipeline

Now tie everything together into a single pipeline class that chains SMOTE, SDV, and noise injection in sequence:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer


class TabularAugmentationPipeline:
    def __init__(self, target_col, numerical_cols, categorical_cols):
        self.target_col = target_col
        self.numerical_cols = numerical_cols
        self.categorical_cols = categorical_cols

    def apply_smote(self, df, sampling_strategy="auto", k_neighbors=5):
        """Balance classes using SMOTE."""
        X = df.drop(columns=[self.target_col])
        y = df[self.target_col]

        # Encode categoricals for SMOTE (it needs numeric input)
        X_encoded = pd.get_dummies(X, columns=self.categorical_cols)

        smote = SMOTE(sampling_strategy=sampling_strategy, k_neighbors=k_neighbors, random_state=42)
        X_res, y_res = smote.fit_resample(X_encoded, y)

        result = pd.DataFrame(X_res, columns=X_encoded.columns)
        result[self.target_col] = y_res
        return result

    def apply_sdv(self, df, num_synthetic_rows=None):
        """Generate synthetic rows preserving statistical properties."""
        if num_synthetic_rows is None:
            num_synthetic_rows = len(df)

        metadata = SingleTableMetadata()
        metadata.detect_from_dataframe(df)
        synthesizer = GaussianCopulaSynthesizer(metadata)
        synthesizer.fit(df)
        synthetic = synthesizer.sample(num_rows=num_synthetic_rows)
        return pd.concat([df, synthetic], ignore_index=True)

    def apply_noise(self, df, noise_level=0.05, swap_prob=0.1, n_copies=1):
        """Add Gaussian noise to numericals, random swaps to categoricals."""
        rng = np.random.default_rng(42)
        frames = [df.copy()]

        # Identify which numerical cols are actually in the dataframe
        num_cols = [c for c in self.numerical_cols if c in df.columns]
        cat_cols = [c for c in self.categorical_cols if c in df.columns]

        for _ in range(n_copies):
            noisy = df.copy()
            for col in num_cols:
                std = df[col].std()
                noise = rng.normal(0, noise_level * std, size=len(df))
                noisy[col] = noisy[col] + noise
            for col in cat_cols:
                unique_vals = df[col].unique()
                mask = rng.random(size=len(df)) < swap_prob
                noisy.loc[mask, col] = rng.choice(unique_vals, size=mask.sum())
            frames.append(noisy)

        return pd.concat(frames, ignore_index=True)

    def run(self, df, use_smote=True, use_sdv=True, use_noise=True, sdv_rows=None):
        """Run the full augmentation pipeline."""
        result = df.copy()
        print(f"Starting rows: {len(result)}")

        if use_smote:
            result = self.apply_smote(result)
            print(f"After SMOTE: {len(result)}")

        if use_sdv:
            result = self.apply_sdv(result, num_synthetic_rows=sdv_rows)
            print(f"After SDV: {len(result)}")

        if use_noise:
            result = self.apply_noise(result)
            print(f"After noise injection: {len(result)}")

        return result


# Example usage
np.random.seed(42)
df = pd.DataFrame({
    "age": np.random.randint(18, 70, size=200),
    "income": np.random.normal(55000, 15000, size=200).round(2),
    "credit_score": np.random.randint(300, 850, size=200),
    "category": np.random.choice(["A", "B", "C"], size=200),
    "churned": np.random.choice([0, 1], size=200, p=[0.85, 0.15])
})

pipeline = TabularAugmentationPipeline(
    target_col="churned",
    numerical_cols=["age", "income", "credit_score"],
    categorical_cols=["category"]
)

augmented = pipeline.run(df, use_smote=True, use_sdv=True, use_noise=True, sdv_rows=300)

You can toggle each step independently. For a dataset that’s just imbalanced but large enough, skip SDV and noise – use SMOTE alone. For a genuinely small dataset with balanced classes, skip SMOTE and use SDV plus noise to bulk it up.

Validating Augmented Data

Always check that your augmented data preserves the original distributions. If augmentation shifts the distribution too far, your model learns from data that doesn’t match reality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np

def validate_augmentation(original, augmented, numerical_cols):
    """Compare distributions between original and augmented data."""
    print("Column            | Orig Mean  | Aug Mean   | Orig Std   | Aug Std")
    print("-" * 72)
    for col in numerical_cols:
        orig_mean = original[col].mean()
        aug_mean = augmented[col].mean()
        orig_std = original[col].std()
        aug_std = augmented[col].std()
        print(f"{col:<18}| {orig_mean:>10.2f} | {aug_mean:>10.2f} | {orig_std:>10.2f} | {aug_std:>10.2f}")

# Compare original vs augmented
validate_augmentation(df, augmented, ["age", "income", "credit_score"])

If the means drift by more than 10-15% or the standard deviations diverge significantly, reduce the noise level or generate fewer synthetic rows. The goal is a larger dataset that still looks like your original data.

Common Errors and Fixes

ValueError: Expected n_neighbors <= n_samples_minority from SMOTE – Your minority class has fewer samples than the k_neighbors parameter. Lower k_neighbors to be less than your smallest class count. If you have 5 minority samples, set k_neighbors=3.

InvalidDataError from SDV metadata detection – SDV sometimes misidentifies column types. Manually set the sdtype after calling detect_from_dataframe:

1
2
3
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)
metadata.update_column(column_name="zip_code", sdtype="categorical")

SMOTE fails with categorical features – SMOTE only works with numerical data. Either one-hot encode categoricals before applying SMOTE (as the pipeline class does), or use SMOTENC from imblearn.over_sampling which handles mixed types directly:

1
2
3
4
5
from imblearn.over_sampling import SMOTENC

# categorical_features is a boolean mask indicating which columns are categorical
smotenc = SMOTENC(categorical_features=[False, False, False, True], random_state=42)
X_res, y_res = smotenc.fit_resample(X, y)

Augmented data has NaN values – This usually happens when noise injection produces out-of-range values for integer columns. Cast back to the original dtype after augmentation, and clip values to valid ranges:

1
2
augmented["age"] = augmented["age"].clip(lower=0).round().astype(int)
augmented["credit_score"] = augmented["credit_score"].clip(lower=300, upper=850).round().astype(int)

SDV sample() is slow for large datasets – GaussianCopulaSynthesizer scales with the number of columns more than rows. If you have 100+ columns, consider fitting on a subset of the most important features, or switch to CTGANSynthesizer which handles high-dimensional data better at the cost of longer training time.

Oversampling with SMOTE#

Generating Synthetic Data with SDV#

Adding Noise Injection#

Building the Complete Pipeline#

Validating Augmented Data#

Common Errors and Fixes#

Related Guides#

About the Author