1. Motivations
Linear Regression is the most used algorithm in Data Science and more in general in any field using statistic analysis.
Many advanced algorithms that we will see in the coming days can be more easily understood using Linear Regression as a reference (Tree-based algorithms and Neuronal Networks).
Linear Regression is the main example of white-box models
.
inherently transparent
easy to interpret and communicate
Linear Regression will help us analyse:
What features impact an outcome of interest
How to control for confounding factors.
2. Simple Linear Regression (visual approach with seaborn
)
The mpg (miles per gallon) dataset
🥋 Let’s take an example!
🚗 The mpg
dataset
👉 Contains ~400 models of cars statistics from 1970 to 1982
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset("mpg" ).dropna()
mpg.head()
mpg cylinders displacement ... model_year origin name
0 18.0 8 307.0 ... 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 ... 70 usa buick skylark 320
2 18.0 8 318.0 ... 70 usa plymouth satellite
3 16.0 8 304.0 ... 70 usa amc rebel sst
4 17.0 8 302.0 ... 70 usa ford torino
[5 rows x 9 columns]
Full description of the dataset is here 🏎️
The columns we will focus on are:
mpg.describe().applymap(lambda x: round (x))
mpg cylinders displacement ... weight acceleration model_year
count 392 392 392 ... 392 392 392
mean 23 5 194 ... 2978 16 76
std 8 2 105 ... 849 3 4
min 9 3 68 ... 1613 8 70
25% 17 4 105 ... 2225 14 73
50% 23 4 151 ... 2804 16 76
75% 29 8 276 ... 3615 17 79
max 47 8 455 ... 5140 25 82
[8 rows x 7 columns]
<string>:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
Regress weight
on horsepower
?
sns.scatterplot(x= 'horsepower' , y= 'weight' , data= mpg);
plt.show()
Objective
Find a regression line \(\hat{y} = (\beta_0 + \beta_1 horsepower)\) that is the closest to the weights
🤓 Find
\(\beta = (\beta_0, \beta_1)\)
that minimizes the norm
\(\Vert weights - (\beta_0 + \beta_1 horsepowers)\Vert\)
Ordinary Least Square (OLS) regression
\[\underset{\beta}{\mathrm{argmin}} \sum_{i=1}^{392} \left(weight_i - (\beta_0 + \beta_1 horsepowers) \right)^2\]
sns.regplot(x= 'horsepower' , y= 'weight' , data= mpg);
plt.show()
Statsmodels
👉 statsmodels.org
pip install statsmodels
import statsmodels.formula.api as smf
model = smf.ols(formula = 'weight ~ horsepower' , data= mpg).fit()
print (model.params)
Intercept 984.500327
horsepower 19.078162
dtype: float64
Interpretation
Intercept 984.500327
horsepower 19.078162
dtype: float64
👉 Horsepower \(\beta_1\) : “For each increase of 1 horsepower, a car’s weight increase on average by 19 lbs (pounds)”
👉 Intercept \(\beta_0\) : “a car with 0 horsepower should weight 984 lbs”
❓Am I confident this relationship generalizes well to all car models in the world❓
Measured by 👇
confidence intervals
p-values associated with hypothesis testing
Inferential Analysis
: can I trust my coefficients \(\beta\) ?
📖 Under certain conditions (randomness of sampling, etc…):
The Distribution of plausible values for the real \(\beta\) can be estimated via the sample mpg
\(\beta_1\) = 19.07 [17.9 - 20.1] with 95% confidence interval
\[ p(\hat{b_1}) \sim \mathcal T_{n=390}\sim \mathcal N(coef = 19.07,\ std\ err = 0.562)\]
std err on the slope \(\beta_1\) ?
p-value
= “probability that what you observed is just due to pure chance”
= Proba of observing a sample slope of \(\beta_1\) =19.0728 or bigger…
assuming that H0 is true
i.e. assuming that the real slope \(\beta_1\) was actually 0
Since \(n\) = 390, we can use a Gaussian Distribution
= Proba of observing \(\beta_1>19.0728\) if it was sampled from a distribution \(N(0,0.562)\)
= Proba of observing \(t > 33.0927\) from a standard distribution \(N(0,1)\) \(\approx 0\)
p-value
\(\approx 0 \leq < 0.05\)
F-statistic
F-statistic = overall statistical significance of the regression
The F-Statistics reprensents the combined p-value
of all your coefficients
It measures the null hypothesis \(H_0\) : all coefs are null
\(F \subset [1, \infty]\)
\(F \sim 1 \implies H_0\) not be ruled out
\(F >> 1 \implies least one coef\) p-value
< 0.05
\(F >> 1 \implies\) the regression is statistically significant
4. Checking the assumptions for inferential analysis
✔︎ Random sampling
✔︎ Independent sampling (sample with replacement, or n < 10% global pop.)
⚠️ Residuals normally distributed and of equal variance
Are residuals normally distributed ?
predicted_weights = model.predict(mpg['horsepower' ])
predicted_weights
# OR
#model.fittedvalues
0 3464.661329
1 4132.396983
2 3846.224560
3 3846.224560
4 3655.442944
...
393 2625.222220
394 1976.564728
395 2587.065897
396 2491.675089
397 2548.909574
Length: 392, dtype: float64
residuals = predicted_weights - mpg['weight' ]
residuals
# OR avaiable via
#model.resid
0 -39.338671
1 439.396983
2 410.224560
3 413.224560
4 206.442944
...
393 -164.777780
394 -153.435272
395 292.065897
396 -133.324911
397 -171.090426
Length: 392, dtype: float64
# visual check
sns.histplot(residuals, kde= True , edgecolor= 'w' );
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
Are residuals of equal variance?
# Check with Residuals vs. Fitted scatterplot
sns.scatterplot(x= predicted_weights, y= residuals)
plt.xlabel('Predicted weight' )
plt.ylabel('Residual weight' );
plt.show()
👀 Beware of heteroscedasticity
👀 Beware also of autoregressive residuals
👉 If a pattern is seen, a factor might be a missing in your model!
👉 Frequent issue in Time Series (ex: inflation, weekly patterns etc…)
What if my residuals are really not random ?
✅ R-squared
remains perfecly valid (deterministic coef)
⚠️ However, inferencial coefs cannot be trusted
💡 Fixes?
Try to create/add new features that explain the residual patterns?
Try to model a transformed version of Y instead (e.g. log(Y)…)?
Try other statistical representations than linear ones (next module…)
5. Multivariate Linear Regressions
✏️ Let’s run a second OLS model where we regress variable weight
on both horsepower
and cylinders
\[ weight∼\beta_0+\beta_1 horsepower+\beta_2 cylinders\]
# run OLS model
model2 = smf.ols(formula= 'weight ~ horsepower + cylinders' , data= mpg).fit()
model2.rsquared
R-squared
84% of the variance in the cars’ weights can be explained by the combined variations in horsepower
and cylinders
In order for the \(R^2\) to be meaningful, the regression must contain an “intercept” (i.e. the matrix X of features must contain column vector of ones)
Contrary to simple linear regression,\(R^2 \neq Corr(Y,X_i)^2\)
\[R^{2}= 1 - {\frac {\sum (y_{i}-{\hat{y}_i})^{2}}{\sum (y_{i}-{\overline {y}})^{2}}} = 1-{\frac {\color {blue}{SS_{\text{residuals}}}}{\color {red}{SS_{\text{mean}}}}}\]
\(R^2\) by how much is my model better than a “simple mean” prediction ❓
🚀\(R^2=1\) best case scenario where the target is 100% explained by the features
🤨 \(R^2=0\) simple mean
😱 \(R^2<1\) can exist and in this worst case scenario, predicting the mean would be even better than running a Linear Model!
Intercept 528.876711
horsepower 8.231070
cylinders 290.356425
dtype: float64
Each increase in horsepower
increases the weight
by 8, holding cylinders
’ number constant.
Controlling for the cylinders
number, each increase in horsepower
increases the weight
by 8 lbs
Categorical features?
array(['usa', 'japan', 'europe'], dtype=object)
# Use C(variable) in the formula
model3 = smf.ols(formula= 'weight ~ C(origin)' , data= mpg).fit()
model3.params
Intercept 2433.470588
C(origin)[T.japan] -212.242740
C(origin)[T.usa] 939.019208
dtype: float64
A car made in Japan is on average 212 lbs lighter than a European one
When passing a categorical variable, statsmodel use the first variable as the reference.
The intercept is equal to the mean of the reference (here origin==europe
)
Each coefficient corresponds to the difference with the mean of the reference
mpg.groupby('origin' ).agg({'weight' :'mean' })
weight
origin
europe 2433.470588
japan 2221.227848
usa 3372.489796
# Drop the intercept if you want to
model3 = smf.ols(formula= 'weight ~ C(origin) -1' , data= mpg).fit()
model3.params
C(origin)[europe] 2433.470588
C(origin)[japan] 2221.227848
C(origin)[usa] 3372.489796
dtype: float64
Interaction Between Variables
An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.
In more complex study areas, the independent variables might interact with each other.
The relationship between an independent and dependent variable changes depending on the value of a third variable .
Interaction effects in practice (cont*cat)
# run OLS model
model2 = smf.ols(formula= 'weight ~ horsepower *C(origin)' , data= mpg).fit()
model2.params
Intercept 1231.344663
C(origin)[T.japan] -255.910544
C(origin)[T.usa] 151.340370
horsepower 14.922337
horsepower:C(origin)[T.japan] 0.682182
horsepower:C(origin)[T.usa] 1.791832
dtype: float64
Interaction effects in practice (cont*cont)
# run OLS model
model2 = smf.ols(formula= 'weight ~ horsepower * cylinders' , data= mpg).fit()
model2.params
Intercept -727.793925
horsepower 22.111233
cylinders 482.754926
horsepower:cylinders -1.987972
dtype: float64
Higher-order interactions
A three-way interaction has three variables in the term, such as \(X_1*X_2*X_3\) \((horsepower * cylinders * origin)\) .
In this case, the relationship between Weight and horsepower depends on both cylinders and origin.
However, this type of effect is challenging to interpret . In practice, analysts use them infrequently.
However, in some models, they might be necessary to provide an adequate fit and to capture non-linear relationships in the data.
Tree-based algorithms leverage the power of non-parametric modelling to capture non-linear relationships finding interactions between variables.
Regression Diagnostic Cheat Sheet
Goodness-of-fit
The model explains a good deal of the observed variance of the dependent variable
R-square
Statistical significance
Can we trust the regression coefficients of the model - do they generalize?
p-values and F-statistic
Inference conditions
Random Residuals: zero-mean, constant variance, not correlated