Linear Regression

Plan

Motivations about Linear Regression
Linear regression (visual approach with seaborn)
Linear regression (with statsmodels)
Conditions for Inference
Multivariate Linear Regression

1. Motivations

Linear Regression is the most used algorithm in Data Science and more in general in any field using statistic analysis.

Many advanced algorithms that we will see in the coming days can be more easily understood using Linear Regression as a reference (Tree-based algorithms and Neuronal Networks).

Linear Regression is the main example of white-box models.

inherently transparent
easy to interpret and communicate

Linear Regression will help us analyse:

What features impact an outcome of interest
How to control for confounding factors.

2. Simple Linear Regression (visual approach with `seaborn`)

The mpg (miles per gallon) dataset

🥋 Let’s take an example!

🚗 The mpg dataset

👉 Contains ~400 models of cars statistics from 1970 to 1982

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

mpg = sns.load_dataset("mpg").dropna()
mpg.head()

    mpg  cylinders  displacement  ...  model_year  origin                       name
0  18.0          8         307.0  ...          70     usa  chevrolet chevelle malibu
1  15.0          8         350.0  ...          70     usa          buick skylark 320
2  18.0          8         318.0  ...          70     usa         plymouth satellite
3  16.0          8         304.0  ...          70     usa              amc rebel sst
4  17.0          8         302.0  ...          70     usa                ford torino

[5 rows x 9 columns]

Full description of the dataset is here 🏎️

The columns we will focus on are:

mpg: miles per gallon
cylinders
displacement: volume of all the pistons (in cc)
horsepower
weight: pounds (lbs)
acceleration: zero to sixty miles per hour (in seconds)

mpg.describe().applymap(lambda x: round(x))

       mpg  cylinders  displacement  ...  weight  acceleration  model_year
count  392        392           392  ...     392           392         392
mean    23          5           194  ...    2978            16          76
std      8          2           105  ...     849             3           4
min      9          3            68  ...    1613             8          70
25%     17          4           105  ...    2225            14          73
50%     23          4           151  ...    2804            16          76
75%     29          8           276  ...    3615            17          79
max     47          8           455  ...    5140            25          82

[8 rows x 7 columns]

<string>:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Regress `weight` on `horsepower` ?

sns.scatterplot(x='horsepower', y='weight', data=mpg);
plt.show()

Objective

Find a regression line \(\hat{y} = (\beta_0 + \beta_1 horsepower)\) that is the closest to the weights

🤓 Find

\(\beta = (\beta_0, \beta_1)\)

that minimizes the norm

\(\Vert weights - (\beta_0 + \beta_1 horsepowers)\Vert\)

Ordinary Least Square (OLS) regression

uses the “natural” \(L_2\) Euclidian Distance
solves the \(\beta\) that minimizes Sum of Squared Residuals (SSR)

\[\underset{\beta}{\mathrm{argmin}} \sum_{i=1}^{392} \left(weight_i - (\beta_0 + \beta_1 horsepowers) \right)^2\]

👉 Dynamic vizualisation

⚠️ OLS is very sensitive to outliers!

sns.regplot(x='horsepower', y='weight', data=mpg);
plt.show()

Statsmodels

👉 statsmodels.org

pip install statsmodels

Simple Linear ML models + Statistical Inference
Very easy to use
~ Replace R in Python

import statsmodels.formula.api as smf
model = smf.ols(formula = 'weight ~ horsepower', data=mpg).fit()
print(model.params)

Intercept     984.500327
horsepower     19.078162
dtype: float64

Interpretation

print(model.params)

Intercept     984.500327
horsepower     19.078162
dtype: float64

👉 Horsepower \(\beta_1\): “For each increase of 1 horsepower, a car’s weight increase on average by 19 lbs (pounds)”

👉 Intercept \(\beta_0\): “a car with 0 horsepower should weight 984 lbs”

print(model.rsquared)

0.7474254996898198

R2 (explanation of the variance)

% of the variance of weights that is explainable by the variance of horsepower
Ranges from 0 (explains nothing) to 1 (perfect relationship)

📚 Interpret R-squared - Statistics by Jim

❓Am I confident this relationship generalizes well to all car models in the world❓

Measured by 👇

confidence intervals
p-values associated with hypothesis testing

`Inferential Analysis`: can I trust my coefficients \(\beta\)?

📖 Under certain conditions (randomness of sampling, etc…):

The Distribution of plausible values for the real \(\beta\) can be estimated via the sample mpg
\(\beta_1\)= 19.07 [17.9 - 20.1] with 95% confidence interval
\[ p(\hat{b_1}) \sim \mathcal T_{n=390}\sim \mathcal N(coef = 19.07,\ std\ err = 0.562)\]

std err on the slope \(\beta_1\)?

p-value = “probability that what you observed is just due to pure chance”

= Proba of observing a sample slope of \(\beta_1\)=19.0728 or bigger…

assuming that H0 is true

i.e. assuming that the real slope \(\beta_1\) was actually 0
Since \(n\) = 390, we can use a Gaussian Distribution

= Proba of observing \(\beta_1>19.0728\) if it was sampled from a distribution \(N(0,0.562)\)

= Proba of observing \(t > 33.0927\) from a standard distribution \(N(0,1)\) \(\approx 0\)

p-value \(\approx 0 \leq < 0.05\)

It is almost impossible that the feature wouldn’t be correlated with the target variable
The relationship between weight and horsepower is statistically significant

F-statistic

F-statistic = overall statistical significance of the regression

The F-Statistics reprensents the combined p-value of all your coefficients
It measures the null hypothesis \(H_0\): all coefs are null
\(F \subset [1, \infty]\)
\(F \sim 1 \implies H_0\) not be ruled out
\(F >> 1 \implies least one coef\) p-value < 0.05
\(F >> 1 \implies\) the regression is statistically significant

4. Checking the assumptions for inferential analysis

✔︎ Random sampling

✔︎ Independent sampling (sample with replacement, or n < 10% global pop.)

⚠️ Residuals normally distributed and of equal variance

Are residuals normally distributed ?

predicted_weights = model.predict(mpg['horsepower'])
predicted_weights

# OR 
#model.fittedvalues

0      3464.661329
1      4132.396983
2      3846.224560
3      3846.224560
4      3655.442944
          ...     
393    2625.222220
394    1976.564728
395    2587.065897
396    2491.675089
397    2548.909574
Length: 392, dtype: float64

residuals = predicted_weights - mpg['weight']
residuals
# OR avaiable via 
#model.resid

0      -39.338671
1      439.396983
2      410.224560
3      413.224560
4      206.442944
          ...    
393   -164.777780
394   -153.435272
395    292.065897
396   -133.324911
397   -171.090426
Length: 392, dtype: float64

# visual check
sns.histplot(residuals, kde=True, edgecolor='w');

/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

plt.show()

Are residuals of equal variance?

# Check with Residuals vs. Fitted scatterplot
sns.scatterplot(x=predicted_weights, y=residuals)
plt.xlabel('Predicted weight')
plt.ylabel('Residual weight');
plt.show()

👀 Beware of heteroscedasticity

👀 Beware also of autoregressive residuals

👉 If a pattern is seen, a factor might be a missing in your model!
👉 Frequent issue in Time Series (ex: inflation, weekly patterns etc…)

What if my residuals are really not random ?

✅ R-squared remains perfecly valid (deterministic coef)

⚠️ However, inferencial coefs cannot be trusted

p-values and confidence intervals may be smaller than they should be
Don’t be too confident that your model generalizes well

💡 Fixes?

Try to create/add new features that explain the residual patterns?
Try to model a transformed version of Y instead (e.g. log(Y)…)?
Try other statistical representations than linear ones (next module…)

5. Multivariate Linear Regressions

✏️ Let’s run a second OLS model where we regress variable weight on both horsepower and cylinders

\[ weight∼\beta_0+\beta_1 horsepower+\beta_2 cylinders\]

# run OLS model
model2 = smf.ols(formula='weight ~ horsepower + cylinders', data=mpg).fit()
model2.rsquared

0.8458154043882244

R-squared

84% of the variance in the cars’ weights can be explained by the combined variations in horsepower and cylinders

In order for the \(R^2\) to be meaningful, the regression must contain an “intercept” (i.e. the matrix X of features must contain column vector of ones)

Contrary to simple linear regression,\(R^2 \neq Corr(Y,X_i)^2\)

\[R^{2}= 1 - {\frac {\sum (y_{i}-{\hat{y}_i})^{2}}{\sum (y_{i}-{\overline {y}})^{2}}} = 1-{\frac {\color {blue}{SS_{\text{residuals}}}}{\color {red}{SS_{\text{mean}}}}}\]

\(R^2\) by how much is my model better than a “simple mean” prediction ❓

🚀\(R^2=1\) best case scenario where the target is 100% explained by the features

🤨 \(R^2=0\) simple mean

😱 \(R^2<1\) can exist and in this worst case scenario, predicting the mean would be even better than running a Linear Model!

model2.params

Intercept     528.876711
horsepower      8.231070
cylinders     290.356425
dtype: float64

Each increase in horsepower increases the weight by 8, holding cylinders’ number constant.

Controlling for the cylinders number, each increase in horsepower increases the weight by 8 lbs

Categorical features?

mpg['origin'].unique()

array(['usa', 'japan', 'europe'], dtype=object)

# Use C(variable) in the formula
model3 = smf.ols(formula='weight ~ C(origin)', data=mpg).fit()
model3.params

Intercept             2433.470588
C(origin)[T.japan]    -212.242740
C(origin)[T.usa]       939.019208
dtype: float64

A car made in Japan is on average 212 lbs lighter than a European one

When passing a categorical variable, statsmodel use the first variable as the reference.

The intercept is equal to the mean of the reference (here origin==europe)

Each coefficient corresponds to the difference with the mean of the reference

mpg.groupby('origin').agg({'weight':'mean'})

             weight
origin             
europe  2433.470588
japan   2221.227848
usa     3372.489796

# Drop the intercept if you want to 
model3 = smf.ols(formula='weight ~ C(origin) -1', data=mpg).fit()
model3.params

C(origin)[europe]    2433.470588
C(origin)[japan]     2221.227848
C(origin)[usa]       3372.489796
dtype: float64

Interaction Between Variables

An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.

In more complex study areas, the independent variables might interact with each other.

The relationship between an independent and dependent variable changes depending on the value of a third variable.

Interaction effects in practice (cont*cat)

# run OLS model
model2 = smf.ols(formula='weight ~ horsepower *C(origin)', data=mpg).fit()
model2.params

Intercept                        1231.344663
C(origin)[T.japan]               -255.910544
C(origin)[T.usa]                  151.340370
horsepower                         14.922337
horsepower:C(origin)[T.japan]       0.682182
horsepower:C(origin)[T.usa]         1.791832
dtype: float64

Interaction effects in practice (cont*cont)

# run OLS model
model2 = smf.ols(formula='weight ~ horsepower * cylinders', data=mpg).fit()
model2.params

Intercept              -727.793925
horsepower               22.111233
cylinders               482.754926
horsepower:cylinders     -1.987972
dtype: float64

Higher-order interactions

A three-way interaction has three variables in the term, such as \(X_1*X_2*X_3\) \((horsepower * cylinders * origin)\).

In this case, the relationship between Weight and horsepower depends on both cylinders and origin.

However, this type of effect is challenging to interpret. In practice, analysts use them infrequently.

However, in some models, they might be necessary to provide an adequate fit and to capture non-linear relationships in the data.

Tree-based algorithms leverage the power of non-parametric modelling to capture non-linear relationships finding interactions between variables.

Regression Diagnostic Cheat Sheet

Check	Description	Diagnosis
Goodness-of-fit	The model explains a good deal of the observed variance of the dependent variable	R-square
Statistical significance	Can we trust the regression coefficients of the model - do they generalize?	p-values and F-statistic
Inference conditions	Random Residuals: zero-mean, constant variance, not correlated

Linear regression annotated output