Ahmed T. Hammad
  • ℹ️ About
  • 🧑‍🏫 Teaching
  • 🛰 Research
  • 🧑‍🎓 Students
  • ✍ Papers
  • 💡Solutions
  • 🧰Toolkits
  • 📃 CV
  • 🎙️Blog
  • 📽️️ Slides
  • 📸 Gallery
  • Github
  • LinkedIn
  • Email

Machine Learning, Copula and Synthetic Data

Copulas and synthetic data play pivotal roles in statistical modeling, offering innovative solutions for various challenges in Machine Learning. Here, I will focus into the use of copulas for synthetic data generation.

Copulas are mathematical constructs used to model the dependence structure between random variables. Unlike traditional correlation measures, copulas separate the marginal distributions from the dependence structure, providing a more flexible and nuanced approach to capturing complex relationships. They are particularly useful in scenarios where traditional models might fail to capture the intricate dependencies between variables. In the specific case of syntetic data generation, what we need is to mimics the statistical properties of real-world data. But why do we need this “fake” data in the first place? Syntetic data are invaluable in scenarios where obtaining sufficient real data is challenging (small sample size) or when privacy concerns limit access to actual data. By creating synthetic datasets we can augment the available data, facilitating better model generalization and robustness.

Going back on the statistical properties we mentioned earlier, we are interested in the parameters governing the distribution of each variable separately (the marginals) and the dependency structure between them (the copula). Once these are known, we can generate new data from the same distribution and with the same correlation.

To give a simple example let’s take few variables from the classic Cars Dataset.

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from copulas.multivariate import GaussianMultivariate
from copulas.univariate import ParametricType, Univariate

df = sns.load_dataset("mpg")
df=df.drop(columns=['origin', 'name'])
df=df.dropna()
df.columns

df=df[['horsepower', 'weight','acceleration','mpg']]
df.describe()
horsepower weight acceleration mpg
count 392.000000 392.000000 392.000000 392.000000
mean 104.469388 2977.584184 15.541327 23.445918
std 38.491160 849.402560 2.758864 7.805007
min 46.000000 1613.000000 8.000000 9.000000
25% 75.000000 2225.250000 13.775000 17.000000
50% 93.500000 2803.500000 15.500000 22.750000
75% 126.000000 3614.750000 17.025000 29.000000
max 230.000000 5140.000000 24.800000 46.600000

Let’s plot the kernel density distribution of 3 variables, the scatter plot of each pair and the corresponding correlation.

def corrdot(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    corr_text = f"{corr_r:2.2f}".replace("0.", ".")
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 1000
    ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
               vmin=-1, vmax=1, transform=ax.transAxes)
    font_size = abs(corr_r) * 10 + 5
    ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                ha='center', va='center', fontsize=font_size)

sns.set(style='white', font_scale=1)
g = sns.PairGrid(df[['horsepower', 'weight','acceleration']], aspect=1, diag_sharey=False)
g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black'})
g.map_diag(sns.distplot, kde_kws={'color': 'black'})
g.map_upper(corrdot)
plt.show()

We can see all sorts of things here. Aside from the strong correlation among some of the variables, we see that they have different distribution. For example, acceleration is very normal distributed but the same cannot be said about the other two variables.

Before anything else, let’s try a simple model to predict mpg.

y = df.pop('mpg')
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(model.score(X_test, y_test))
0.6501833421053663

Now, can we simulate something so similar to the actual data that we would get the same score? Yes, we can thanks to copulas!!! We ca generate a synthetic dataset with the same underlying structure.

# Select the best PARAMETRIC univariate (no KDE)
univariate = Univariate(parametric=ParametricType.PARAMETRIC)


def create_synthetic(X, y):
    """
    This function combines X and y into a single dataset D, models it
    using a Gaussian copula, and generates a synthetic dataset S. It
    returns the new, synthetic versions of X and y.
    """
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)

    distribs =  GaussianMultivariate(distribution=univariate)
    distribs.fit(dataset)

    synthetic = distribs.sample(len(dataset))

    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X, y, distribs

X_synthetic, y_synthetic, dist= create_synthetic(X_train, y_train)

Let’s look at the individual distributions fitted by the algorithm.

parameters = dist.to_dict()
parameters['univariates']
[{'a': 2.505456580509649,
  'loc': 44.28600428264269,
  'scale': 24.186079118156652,
  'type': 'copulas.univariate.gamma.GammaUnivariate'},
 {'loc': 1604.6365783320787,
  'scale': 3779.0567202878065,
  'a': 1.4708615880361795,
  'b': 2.5202670202239155,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 1.0632403337723968,
  'scale': 73.46060125357005,
  'a': 20.470245065728314,
  'b': 83.5439968070011,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 9.84592394053172,
  'scale': 40.487634662917245,
  'a': 1.6832256330594189,
  'b': 3.2317290395770817,
  'type': 'copulas.univariate.beta.BetaUnivariate'}]

We see that the distributions (Gamma and Beta), and their corresponding parameters like location and scale. We can also take a look at the correlation that defines the join distribution.

parameters['correlation']
[[1.0, 0.8473079532001859, -0.7200908747599617, -0.8315209336313284],
 [0.8473079532001859, 1.0, -0.42047354940020193, -0.831115031494809],
 [-0.7200908747599617, -0.42047354940020193, 1.0, 0.43579734522625596],
 [-0.8315209336313284, -0.831115031494809, 0.43579734522625596, 1.0]]

Now it is time to look at all the synthetic variables and compare them with the original one. Let’s look at the same things. A summary of the dataset and the plot of the 3 variables.

syntDF=pd.DataFrame(np.concatenate([X_synthetic, np.expand_dims(y_synthetic, 1)], axis=1),columns=['horsepower', 'weight', 'acceleration','mpg'])

syntDF.describe()
horsepower weight acceleration mpg
count 274.000000 274.000000 274.000000 274.000000
mean 103.393953 3007.334850 15.565402 24.093801
std 36.227720 822.287724 2.765889 8.281429
min 49.084379 1650.853730 8.469497 10.183055
25% 76.679656 2299.624685 13.538932 17.291204
50% 95.181303 2933.743847 15.416592 22.956229
75% 119.586826 3537.785385 17.370376 30.233671
max 227.592858 5142.078155 23.603403 45.263817

The descriptive statistics are remarkably similar, reflecting the statistical properties emphasized earlier. This holds significant importance in our analysis. However, a closer examination of individual variable distributions and their correlations reveals disparities. The Kernel distributions have evidently undergone changes, and although the correlation values remain within the same magnitude range and exhibit the same sign, they are not identical.

g = sns.PairGrid(syntDF[['horsepower', 'weight','acceleration']], aspect=1, diag_sharey=False)
g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black'})
g.map_diag(sns.distplot, kde_kws={'color': 'black'})
g.map_upper(corrdot)
plt.show()

Now that we have seen similarities and differences, let’s try to run the same simple linear model on the synthetic data.

model = LinearRegression()
model.fit(X_synthetic, y_synthetic)

print(model.score(X_test, y_test))
0.6068245805913557

Upon scrutinizing the results, it is evident that they are highly comparable, even with the constraint of limiting the possible distributions to the simplest univariate forms and utilizing only three variables. This implies that our Gaussian copula has effectively captured the critical statistical characteristics of the dataset essential for addressing the regression problem.