Linear Regression

Plan

  1. Motivations about Linear Regression

  2. Linear regression (visual approach with seaborn)

  3. Linear regression (with statsmodels)

  4. Conditions for Inference

  5. Multivariate Linear Regression

1. Motivations

Linear Regression is the most used algorithm in Data Science and more in general in any field using statistic analysis.

Many advanced algorithms that we will see in the coming days can be more easily understood using Linear Regression as a reference (Tree-based algorithms and Neuronal Networks).

Linear Regression is the main example of white-box models.

  • inherently transparent
  • easy to interpret and communicate

Linear Regression will help us analyse:

  • What features impact an outcome of interest
  • How to control for confounding factors.

2. Simple Linear Regression (visual approach with seaborn)

The mpg (miles per gallon) dataset

🥋 Let’s take an example!

🚗 The mpg dataset

👉 Contains ~400 models of cars statistics from 1970 to 1982

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

mpg = sns.load_dataset("mpg").dropna()
mpg.head()
    mpg  cylinders  displacement  ...  model_year  origin                       name
0  18.0          8         307.0  ...          70     usa  chevrolet chevelle malibu
1  15.0          8         350.0  ...          70     usa          buick skylark 320
2  18.0          8         318.0  ...          70     usa         plymouth satellite
3  16.0          8         304.0  ...          70     usa              amc rebel sst
4  17.0          8         302.0  ...          70     usa                ford torino

[5 rows x 9 columns]

Full description of the dataset is here 🏎️

The columns we will focus on are:

  • mpg: miles per gallon

  • cylinders

  • displacement: volume of all the pistons (in cc)

  • horsepower

  • weight: pounds (lbs)

  • acceleration: zero to sixty miles per hour (in seconds)

mpg.describe().applymap(lambda x: round(x))
       mpg  cylinders  displacement  ...  weight  acceleration  model_year
count  392        392           392  ...     392           392         392
mean    23          5           194  ...    2978            16          76
std      8          2           105  ...     849             3           4
min      9          3            68  ...    1613             8          70
25%     17          4           105  ...    2225            14          73
50%     23          4           151  ...    2804            16          76
75%     29          8           276  ...    3615            17          79
max     47          8           455  ...    5140            25          82

[8 rows x 7 columns]

<string>:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Regress weight on horsepower ?

sns.scatterplot(x='horsepower', y='weight', data=mpg);
plt.show()

Objective

Find a regression line \(\hat{y} = (\beta_0 + \beta_1 horsepower)\) that is the closest to the weights

🤓 Find

\(\beta = (\beta_0, \beta_1)\)

that minimizes the norm

\(\Vert weights - (\beta_0 + \beta_1 horsepowers)\Vert\)

Ordinary Least Square (OLS) regression

  • uses the “natural” \(L_2\) Euclidian Distance

  • solves the \(\beta\) that minimizes Sum of Squared Residuals (SSR)

\[\underset{\beta}{\mathrm{argmin}} \sum_{i=1}^{392} \left(weight_i - (\beta_0 + \beta_1 horsepowers) \right)^2\]

👉 Dynamic vizualisation

⚠️ OLS is very sensitive to outliers!

sns.regplot(x='horsepower', y='weight', data=mpg);
plt.show()

Statsmodels

👉 statsmodels.org

pip install statsmodels

  • Simple Linear ML models + Statistical Inference

  • Very easy to use

  • ~ Replace R in Python

import statsmodels.formula.api as smf
model = smf.ols(formula = 'weight ~ horsepower', data=mpg).fit()
print(model.params)
Intercept     984.500327
horsepower     19.078162
dtype: float64

Interpretation

print(model.params)
Intercept     984.500327
horsepower     19.078162
dtype: float64

👉 Horsepower \(\beta_1\): “For each increase of 1 horsepower, a car’s weight increase on average by 19 lbs (pounds)”

👉 Intercept \(\beta_0\): “a car with 0 horsepower should weight 984 lbs”

print(model.rsquared)
0.7474254996898198

R2 (explanation of the variance)

  • % of the variance of weights that is explainable by the variance of horsepower

  • Ranges from 0 (explains nothing) to 1 (perfect relationship)

📚 Interpret R-squared - Statistics by Jim

Am I confident this relationship generalizes well to all car models in the world❓

Measured by 👇

  • confidence intervals
  • p-values associated with hypothesis testing

Inferential Analysis: can I trust my coefficients \(\beta\)?

📖 Under certain conditions (randomness of sampling, etc…):

  • The Distribution of plausible values for the real \(\beta\) can be estimated via the sample mpg

  • \(\beta_1\)= 19.07 [17.9 - 20.1] with 95% confidence interval

  • \[ p(\hat{b_1}) \sim \mathcal T_{n=390}\sim \mathcal N(coef = 19.07,\ std\ err = 0.562)\]

std err on the slope \(\beta_1\)?

p-value = “probability that what you observed is just due to pure chance”

= Proba of observing a sample slope of \(\beta_1\)=19.0728 or bigger…

assuming that H0 is true

  • i.e. assuming that the real slope \(\beta_1\) was actually 0
  • Since \(n\) = 390, we can use a Gaussian Distribution

= Proba of observing \(\beta_1>19.0728\) if it was sampled from a distribution \(N(0,0.562)\)

= Proba of observing \(t > 33.0927\) from a standard distribution \(N(0,1)\) \(\approx 0\)

p-value \(\approx 0 \leq < 0.05\)

  • It is almost impossible that the feature wouldn’t be correlated with the target variable

  • The relationship between weight and horsepower is statistically significant

F-statistic

F-statistic = overall statistical significance of the regression

  • The F-Statistics reprensents the combined p-value of all your coefficients

  • It measures the null hypothesis \(H_0\): all coefs are null

  • \(F \subset [1, \infty]\)

  • \(F \sim 1 \implies H_0\) not be ruled out

  • \(F >> 1 \implies least one coef\) p-value < 0.05

  • \(F >> 1 \implies\) the regression is statistically significant

4. Checking the assumptions for inferential analysis

✔︎ Random sampling

✔︎ Independent sampling (sample with replacement, or n < 10% global pop.)

⚠️ Residuals normally distributed and of equal variance

Are residuals normally distributed ?

predicted_weights = model.predict(mpg['horsepower'])
predicted_weights

# OR 
#model.fittedvalues
0      3464.661329
1      4132.396983
2      3846.224560
3      3846.224560
4      3655.442944
          ...     
393    2625.222220
394    1976.564728
395    2587.065897
396    2491.675089
397    2548.909574
Length: 392, dtype: float64
residuals = predicted_weights - mpg['weight']
residuals
# OR avaiable via 
#model.resid
0      -39.338671
1      439.396983
2      410.224560
3      413.224560
4      206.442944
          ...    
393   -164.777780
394   -153.435272
395    292.065897
396   -133.324911
397   -171.090426
Length: 392, dtype: float64

# visual check
sns.histplot(residuals, kde=True, edgecolor='w');
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
plt.show()

Are residuals of equal variance?

# Check with Residuals vs. Fitted scatterplot
sns.scatterplot(x=predicted_weights, y=residuals)
plt.xlabel('Predicted weight')
plt.ylabel('Residual weight');
plt.show()

👀 Beware of heteroscedasticity