Linear Regression is the most used algorithm in Data Science and more in general in any field using statistic analysis.
Many advanced algorithms that we will see in the coming days can be more easily understood using Linear Regression as a reference (Tree-based algorithms and Neuronal Networks).
Linear Regression is the main example of white-box models.
inherently transparent
easy to interpret and communicate
Linear Regression will help us analyse:
What features impact an outcome of interest
How to control for confounding factors.
2. Simple Linear Regression (visual approach with seaborn)
The mpg (miles per gallon) dataset
🥋 Let’s take an example!
🚗 The mpg dataset
👉 Contains ~400 models of cars statistics from 1970 to 1982
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsmpg = sns.load_dataset("mpg").dropna()mpg.head()
mpg cylinders displacement ... model_year origin name
0 18.0 8 307.0 ... 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 ... 70 usa buick skylark 320
2 18.0 8 318.0 ... 70 usa plymouth satellite
3 16.0 8 304.0 ... 70 usa amc rebel sst
4 17.0 8 302.0 ... 70 usa ford torino
[5 rows x 9 columns]
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
plt.show()
Are residuals of equal variance?
# Check with Residuals vs. Fitted scatterplotsns.scatterplot(x=predicted_weights, y=residuals)plt.xlabel('Predicted weight')plt.ylabel('Residual weight');plt.show()