Let’s remember Linear Regression:
\[Y = \beta_0 + \beta_1 X_1 + \epsilon\]
❓What about non-linear behavior as below?
👉 If we add a new transformed feature \(X_1^2\) we have a better fit (in red)
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + ... + \epsilon\]
🤔 We could engineer very complex features: \(X^3, X^6, (X_1^2 * X_2^5), ...\)
Our \(R^2\) will keep increasing with every additional feature!
We just solved Data Science!
What’s wrong with the first model? With the third?
The first one does not capture all the information in the data (high bias model)
The third one finds signals that aren’t there (noise); it will not generalize well to new data points (high variance model)
One of the most important concepts in Data Science!
\[TotalError = \color{red}{Bias^2} + \color{teal}{Variance} + \color{grey}{Irreducible Error}\]
Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.
Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model.
Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models’ predictions are from the correct value.
Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.
An optimal balance of bias and variance leads to a model that is neither overfit nor underfit.
❓How do we find the optimal model complexity?
👉 Simple models for simple questions
Complex models for complex questions
👉 Use Regularization to decrease the complexity of the model
👉 Use Cross-Validation to tune the hyperparameters of the model
The model that minimizes the test error on an unseen dataset.
Simplify your model relative to your data
Get more observations
Feature selection (manual or automated)
Dimensionality reduction (Unsupervised Learning)
Early stopping (Deep Learning)
Regularization of your Loss function
Regularization means adding a penalty term to the Loss that increases with \(\beta\)
\[Regularized Loss = Loss(X,y, \color{blue}{\beta})\ + Penalty(\color{blue}{\beta})\]
👉 Penalizes large values for \(\beta_i\)
👉 Forces model to shrink certain coefficients or even select less features
👉 Prevents overfitting
\[\hat{y} = \color{blue}{\beta_0} + \color{blue}{\beta_1} X_1 + \color{blue}{\beta_2} X_1^2 + \color{blue}{\beta_3} X_1^3 + ...\]
The two most famous Regularization penalties are:
\[Ridge(\color{green}{X},\color{magenta}{y},\color{blue}{\beta}) = \underbrace{\sum_{i=0}^{n-1}(\color{magenta}{y_i}-(\color{blue}{\beta_0} + \color{blue}{\beta_1} \color{green}{x_1} + ... \color{blue}{\beta_p} \color{green}{x_p} ))^2}_{Loss(\color{green}{X},\color{magenta}{y},\color{blue}{\beta})} + \color{red}{\alpha} \sum_{\color{green}{j=1}}^{\color{green}{p}} \color{blue}{\beta_j}^{2}\]
\[Lasso(\color{green}{X},\color{magenta}{y},\color{blue}{\beta}) = \underbrace{\sum_{i=0}^{n-1}(\color{magenta}{y_i}-(\color{blue}{\beta_0} + \color{blue}{\beta_1} \color{green}{x_1} + ... \color{blue}{\beta_p} \color{green}{x_p} ))^2}_{Loss(\color{green}{X},\color{magenta}{y},\color{blue}{\beta})} + \color{red}{\alpha} \sum_{\color{green}{j=1}}^{\color{green}{p}} |\color{blue}{\beta_j}|\]
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
from sklearn.linear_model import Ridge, Lasso, LinearRegression
linreg = LinearRegression().fit(X, y)
ridge = Ridge(alpha=0.2).fit(X, y)
lasso = Lasso(alpha=0.2).fit(X, y)
coefs = pd.DataFrame({
"coef_linreg": pd.Series(linreg.coef_, index = X.columns),
"coef_ridge": pd.Series(ridge.coef_, index = X.columns),
"coef_lasso": pd.Series(lasso.coef_, index= X.columns)})\
coefs\
.applymap(lambda x: int(x))\
.style.applymap(lambda x: 'color: red' if x == 0 else 'color: black')
❗️ Warning ❗️
When regularizing a Regression with a L2 penalty or with a L1 penalty:
We regularize \(\color{blue}{\beta_\color{green}{1}}, ..., \color{blue}{\beta_{\color{green}{p}}}\)
We do NOT regularize the intercept \(\color{blue}{\beta_0}\)
Keep in mind that Ridge reduces the influence of features with a non-significant coefficient
Keep in mind that Lasso shrinks them to zero
But \(\color{blue}{\beta_0}\) is not associated with any feature!
New hyper-parameter \(\color{red}{\alpha}\)
Dictates how much the model is regularized
Large \(\color{red}{\alpha}\) values force model complexity to decrease 👉 ⤵ variance, ⤴ bias
The sum starts from j = 1, we do not penalize the intercept coefficient
Ridge regression augments the OLS loss function as to not only minimize the sum of squared residuals, but also penalize the size of the parameter estimates, shrinking them towards zero.
The unique feature of Ridge regression, as compared to other penalization techniques, is that coefficients can be shrunk over and over but will never reach 0. In other words, all covariates will always remain in the model, and Ridge does not provide any form of variable selection
Lasso penalizes the sum of their absolute values (L1 penalty). As a results, for high values of λ, many coefficients are exactly zeroed under lasso.
Coefficients that are shrunk to a point where they exactly equal 0 are clearly excluded from the model.
\[L=\|y-\hat{y} \|^{2}+ \color{red}{\alpha} (\color{orange}{\lambda}|\beta |+(1 - \color{orange}{\lambda})\|\beta \|^2)\]
☝️ Regularization tends to penalize features that are not statistically significant but DO NOT use p-values (do not even look at them). They cannot be correctly interpreted in this case.
At this stage we do not care anymore about p-values. The fundamental difference between classic stats and ML is that in the former we were trying to infer something from the data, in the latter we are trying to predict something in the most precise and efficient way.
👉 Regularize when you think you are overfitting (e.g. Learning Curves not converging)
👉 Ridge when you believe all coefficients may have an impact
👉 Lasso as a feature selection tool (much better for interpretability!)
✅ Regularization is almost always appropriate
Ridge is often turned on by default in most Machine Learning models
You just have to tune the regularization parameter
❓ How to choose the best hyper-parameters (e.g: alpha)
Explores different hyperparameter value combinations to find the combination which optimizes performance
Hold out a validation set (never use the test set for model tuning!)
Select which grid of values of hyper-parameters to try out
For each combination of values, measure your performance on the validation set
Select hyper-parameters that produce the best performance
# Train and CV-score model for each combination
from sklearn.linear_model import ElasticNet
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
for hyperparam in hyperparams:
alpha = hyperparam[0]
l1_ratio = hyperparam[1]
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
r2 = cross_val_score(model, X_train, y_train, cv=5).mean()
print(f"alpha: {alpha}, l1_ratio: {l1_ratio}, r2: {r2}")
Randomy split your training set into k
folds of same size
Make fold #1
a val_set, train model on other k-1
folds & mesure val_score
Make fold #2
a val_set and repeat
…
Compute average val_score over all folds
☝️ This is your cross-validated score for one given set of hyper-parameters
Repeat for each value of hyper-param to test
Save the test set for final evaluation only (AFTER hyper-params are chosen)
GridSearchCV
🚀from sklearn.model_selection import GridSearchCV
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
# Instantiate model
model = ElasticNet()
# Hyperparameter Grid
grid = {
'alpha': [0.01, 0.1, 1],
'l1_ratio': [0.2, 0.5, 0.8]
}
# Instantiate Grid Search
search = GridSearchCV(
model,
grid,
scoring = 'r2',
cv = 5,
n_jobs=-1 # parallelize computation
)
# Fit data to Grid Search
search.fit(X_train, y_train);
👎 Limitations of Grid Search:
Computationally costly
The optimal hyperparameter value can be missed
Can overfit hyperparameters to the training set if too many combinations are tried out for too small a dataset
Randomly explore hyperparameter values from:
A hyperparameter space to randomly sample from
The specified number of samples to be tested
RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats
# Instantiate model
model = ElasticNet()
# Hyperparameter Grid
grid = {'l1_ratio': stats.uniform(0, 1), 'alpha': [0.001, 0.01, 0.1, 1]}
# Instantiate Grid Search
search = RandomizedSearchCV(
model,
grid,
scoring='r2',
n_iter=100, # number of draws
cv=5, n_jobs=-1
)
# Fit data to Grid Search
search.fit(X_train, y_train)
search.best_estimator_
Choose hyperparameter probability distribution:
Can be generated with scipy.stats.distributions
from scipy import stats
plt.figure(figsize=(5, 3))
dist = stats.norm(10, 2) # if you have a best guess (say: 10)
dist = stats.randint(1,100) # if you have no idea
dist = stats.uniform(1, 100) # same
dist = stats.loguniform(0.01, 1) # Coarse grain search
r = dist.rvs(size=10000) # Random draws
plt.hist(r);
loguniform
is great for coarse-grain search across several orders of magnitude
e.g. loguniform(0.01, 1)
search over [10−210−2, 10−110−1, 100100]
RandomizedSearch
vs GridSearch
👉 Randomized Search:
In any case:
✅ Always start with a coarse grain approach (can use Grid or RandomSearch)
✅ Then afterward, fine-tune your search
Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective.
It works by building a probabilistic model of the objective function, called the surrogate function (posterior probability), that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.
Bayes theorem provides a framework that can be used to quantify the beliefs about the “location” of the optimal parameter set in the search-space.
Tuning hyperparameters using Bayesian logic—helps reduce the time required to obtain an optimal parameter set.
pip install hyperopt
Currently three algorithms are implemented in hyperopt:
TREE-STRUCTURED PARZEN ESTIMATOR (TPE) The Tree-structured Parzen Estimator approach is a sequential model-based optimization (SMBO) algorithm that finds the inputs to an objective function that maximizes expected improvement.
from sklearn.model_selection import train_test_split,cross_val_score
from hyperopt import hp, fmin, tpe, Trials
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from hyperopt.pyll import scope
import pandas as pd
import numpy as np
# Load raw data
data = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/houses_train_raw.csv', index_col="Id")
# Only keep numerical columns and raws without NaN
data = data.select_dtypes(include=np.number).dropna()
X = data.drop(columns=['SalePrice'])
y = data['SalePrice']
# Perform train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2021)
scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train),
columns = X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test),
columns = X_test.columns)
# Define the objective function to minimize (negative mean cross-validated R^2 score)
def objective(params):
# KNN regressor with hyperparameters to optimize
knn = KNeighborsRegressor(
n_neighbors=params['n_neighbors'])
# Cross-validated R^2 score
score = np.mean(cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='r2'))
# We want to maximize R^2, so we minimize the negative score
return -score
# Define parameter space using hyperopt random variables
# Define the hyperparameter search space
space = {
'n_neighbors': hp.choice('n_neighbors', np.arange(1, 11,1)) # Number of neighbors
#"n_neighbors": scope.int(hp.quniform("n_neighbors", 2, 30, 1)),
}
# Set up trials for tracking
trials = Trials()
# Use the Tree Parzen Estimator (TPE) algorithm for optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50,
trials=trials,
rstate=np.random.default_rng(123)) # fixing random state for the reproducibility)
PyCaret is a simple, easy to learn, low-code machine learning library in Python. With PyCaret, you spend less time coding and more time on analysis.
pip install pycaret