Machine Learning, Copula and Synthetic Data

Synthetic data is one of those ideas that sounds like it shouldn’t work — if the data is fake, how can a model trained on it generalize to real data? The answer is that “fake” here doesn’t mean arbitrary. Synthetic data generated properly preserves the statistical structure of the original: the marginal distributions of each variable, and the dependence structure between them. If you get that right, the downstream model often can’t tell the difference.

Why bother? Two reasons come up constantly: small samples and privacy constraints. When you don’t have enough data to train a reliable model, generating additional synthetic observations from the learned distribution can help. When your data contains sensitive information that limits how freely it can be shared or used, synthetic data gives you a safe stand-in.

Copulas are the right tool for this. A copula separates two things that are usually bundled together in a multivariate distribution: the marginal distribution of each individual variable, and the dependence structure among them. This separation matters because real data is rarely multivariate normal — the marginals may follow different parametric families, and the correlations in the tails may differ from correlations in the center. Copulas handle both.

To give a concrete example, let’s use a few variables from the classic Cars Dataset.

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from copulas.multivariate import GaussianMultivariate
from copulas.univariate import ParametricType, Univariate

df = sns.load_dataset("mpg")
df=df.drop(columns=['origin', 'name'])
df=df.dropna()
df.columns

df=df[['horsepower', 'weight','acceleration','mpg']]
df.describe()
horsepower weight acceleration mpg
count 392.000000 392.000000 392.000000 392.000000
mean 104.469388 2977.584184 15.541327 23.445918
std 38.491160 849.402560 2.758864 7.805007
min 46.000000 1613.000000 8.000000 9.000000
25% 75.000000 2225.250000 13.775000 17.000000
50% 93.500000 2803.500000 15.500000 22.750000
75% 126.000000 3614.750000 17.025000 29.000000
max 230.000000 5140.000000 24.800000 46.600000

Let’s look at the distribution of each variable, the pairwise scatter plots, and the correlations.

def corrdot(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    corr_text = f"{corr_r:2.2f}".replace("0.", ".")
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 1000
    ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
               vmin=-1, vmax=1, transform=ax.transAxes)
    font_size = abs(corr_r) * 10 + 5
    ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                ha='center', va='center', fontsize=font_size)

sns.set(style='white', font_scale=1)
g = sns.PairGrid(df[['horsepower', 'weight','acceleration']], aspect=1, diag_sharey=False)
g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black'})
g.map_diag(sns.kdeplot, color='black')
g.map_upper(corrdot)
plt.show()

A few things stand out. The variables are strongly correlated in some pairs. The distributions are different: acceleration is close to normal, but horsepower and weight have heavier right tails. A Gaussian copula with normal marginals would miss that — which is why we want the copula to handle marginals separately.

Before generating synthetic data, let’s fit a simple linear regression on the original data as a baseline.

y = df.pop('mpg')
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(model.score(X_test, y_test))
0.6501833421053667

Now we generate a synthetic training set using a Gaussian copula, where the marginals are estimated separately as parametric distributions (not KDE). The copula model will fit the best-fitting parametric family for each variable, then learn the correlation structure among them.

# Select the best PARAMETRIC univariate (no KDE)
univariate = Univariate(parametric=ParametricType.PARAMETRIC)


def create_synthetic(X, y):
    """
    This function combines X and y into a single dataset D, models it
    using a Gaussian copula, and generates a synthetic dataset S. It
    returns the new, synthetic versions of X and y.
    """
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)

    distribs =  GaussianMultivariate(distribution=univariate)
    distribs.fit(dataset)

    synthetic = distribs.sample(len(dataset))

    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X, y, distribs

X_synthetic, y_synthetic, dist= create_synthetic(X_train, y_train)

Let’s inspect what the copula model learned — the fitted distribution for each variable:

parameters = dist.to_dict()
parameters['univariates']
[{'a': np.float64(2.505456580509649),
  'loc': np.float64(44.28600428264269),
  'scale': np.float64(24.186079118156652),
  'type': 'copulas.univariate.gamma.GammaUnivariate'},
 {'loc': np.float64(1604.636580448248),
  'scale': np.float64(3779.05677499984),
  'a': np.float64(1.470861609942495),
  'b': np.float64(2.520267107805836),
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': np.float64(1.0632403337723968),
  'scale': np.float64(73.46060125357005),
  'a': np.float64(20.470245095113114),
  'b': np.float64(83.54399672283503),
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': np.float64(9.84592394053172),
  'scale': np.float64(40.487634662917245),
  'a': np.float64(1.6832256330594189),
  'b': np.float64(3.231729039577079),
  'type': 'copulas.univariate.beta.BetaUnivariate'}]

And the learned correlation matrix that defines the joint structure:

parameters['correlation']
[[1.0, 0.8473079530880548, -0.720090874761445, -0.8315209336313288],
 [0.8473079530880548, 1.0, -0.42047354869738274, -0.8311150310370441],
 [-0.720090874761445, -0.42047354869738274, 1.0, 0.4357973452281482],
 [-0.8315209336313288, -0.8311150310370441, 0.4357973452281482, 1.0]]

Now compare the synthetic data against the original — same summary statistics, same pair plots.

syntDF=pd.DataFrame(np.concatenate([X_synthetic, np.expand_dims(y_synthetic, 1)], axis=1),columns=['horsepower', 'weight', 'acceleration','mpg'])

syntDF.describe()
horsepower weight acceleration mpg
count 274.000000 274.000000 274.000000 274.000000
mean 102.424491 2912.768322 15.671238 24.481033
std 37.827007 840.384299 2.789104 8.293174
min 48.911529 1626.293586 6.994269 10.803486
25% 74.760740 2239.351073 13.897932 18.162106
50% 92.855345 2765.721090 15.449325 22.936829
75% 122.545193 3508.569304 17.330007 29.866392
max 290.482675 5093.271119 24.064613 45.514577

The descriptive statistics are close. The pair plot shows the shapes are similar but not identical — the tails in particular may differ somewhat, which is expected given the Gaussian copula assumption on the dependence structure. If the tail behavior mattered critically here, a Clayton or Gumbel copula would be more appropriate.

Finally, re-fit the same linear regression on the synthetic training data and evaluate on the original test set:

model = LinearRegression()
model.fit(X_synthetic, y_synthetic)

print(model.score(X_test, y_test))
0.6236318551606799

The R² is very close to the one from the original data. Not identical — there’s some information loss in the synthetic generation process — but close enough to demonstrate the point: the copula has preserved the statistical structure that the regression model actually needs to make good predictions.

This matters in practice. Synthetic data generated this way isn’t just noise with the right summary statistics; it captures the correlations and distributional shapes that determine model behavior. For augmentation, privacy-preserving data sharing, or testing model robustness, that’s exactly what you need.