Model Tuning

Plan

Model Complexity
Regularization
Model Tuning: Grid Search & Random Search
Bayesian Hyperparameter optimization
PYCARET

The Tuning Stage

1. Model Complexity

Let’s remember Linear Regression:

\[Y = \beta_0 + \beta_1 X_1 + \epsilon\]

❓What about non-linear behavior as below?

👉 If we add a new transformed feature \(X_1^2\) we have a better fit (in red)

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + ... + \epsilon\]

🤔 We could engineer very complex features: \(X^3, X^6, (X_1^2 * X_2^5), ...\)

Our \(R^2\) will keep increasing with every additional feature!

We just solved Data Science!

def train_best_model_ever(X,y):
    while True:
        model = LinearRegression()
        model.fit(X, y)
        if calculate_r2(model, X, y) < 0.999999:
            X = add_more_crazy_features(X)
        else:
            return model

Overfitting and Underfitting

What’s wrong with the first model? With the third?

The first one does not capture all the information in the data (high bias model)
The third one finds signals that aren’t there (noise); it will not generalize well to new data points (high variance model)

The Bias-Variance Tradeoff

One of the most important concepts in Data Science!

\[TotalError = \color{red}{Bias^2} + \color{teal}{Variance} + \color{grey}{Irreducible Error}\]

Conceptual Definition

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.

Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model.

Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models’ predictions are from the correct value.

Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

An optimal balance of bias and variance leads to a model that is neither overfit nor underfit.

📚 Great read

❓How do we find the optimal model complexity?

👉 Simple models for simple questions
Complex models for complex questions

👉 Use Regularization to decrease the complexity of the model
👉 Use Cross-Validation to tune the hyperparameters of the model

The model that minimizes the test error on an unseen dataset.

Solutions for Overfitting

Simplify your model relative to your data

Get more observations
Feature selection (manual or automated)
Dimensionality reduction (Unsupervised Learning)
Early stopping (Deep Learning)
Regularization of your Loss function

2. Regularization

Regularization means adding a penalty term to the Loss that increases with \(\beta\)

\[Regularized Loss = Loss(X,y, \color{blue}{\beta})\ + Penalty(\color{blue}{\beta})\]

👉 Penalizes large values for \(\beta_i\)
👉 Forces model to shrink certain coefficients or even select less features
👉 Prevents overfitting

\[\hat{y} = \color{blue}{\beta_0} + \color{blue}{\beta_1} X_1 + \color{blue}{\beta_2} X_1^2 + \color{blue}{\beta_3} X_1^3 + ...\]

The two most famous Regularization penalties are:

\[Ridge(\color{green}{X},\color{magenta}{y},\color{blue}{\beta}) = \underbrace{\sum_{i=0}^{n-1}(\color{magenta}{y_i}-(\color{blue}{\beta_0} + \color{blue}{\beta_1} \color{green}{x_1} + ... \color{blue}{\beta_p} \color{green}{x_p} ))^2}_{Loss(\color{green}{X},\color{magenta}{y},\color{blue}{\beta})} + \color{red}{\alpha} \sum_{\color{green}{j=1}}^{\color{green}{p}} \color{blue}{\beta_j}^{2}\]

\[Lasso(\color{green}{X},\color{magenta}{y},\color{blue}{\beta}) = \underbrace{\sum_{i=0}^{n-1}(\color{magenta}{y_i}-(\color{blue}{\beta_0} + \color{blue}{\beta_1} \color{green}{x_1} + ... \color{blue}{\beta_p} \color{green}{x_p} ))^2}_{Loss(\color{green}{X},\color{magenta}{y},\color{blue}{\beta})} + \color{red}{\alpha} \sum_{\color{green}{j=1}}^{\color{green}{p}} |\color{blue}{\beta_j}|\]

from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X.head()

        age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]

from sklearn.linear_model import Ridge, Lasso, LinearRegression

linreg = LinearRegression().fit(X, y)
ridge = Ridge(alpha=0.2).fit(X, y)
lasso = Lasso(alpha=0.2).fit(X, y)

coefs = pd.DataFrame({
    "coef_linreg": pd.Series(linreg.coef_, index = X.columns),
    "coef_ridge": pd.Series(ridge.coef_, index = X.columns),
    "coef_lasso": pd.Series(lasso.coef_, index= X.columns)})\

coefs\
    .applymap(lambda x: int(x))\
    .style.applymap(lambda x: 'color: red' if x == 0 else 'color: black')

❗️ Warning ❗️

When regularizing a Regression with a L2 penalty or with a L1 penalty:

We regularize \(\color{blue}{\beta_\color{green}{1}}, ..., \color{blue}{\beta_{\color{green}{p}}}\)
We do NOT regularize the intercept \(\color{blue}{\beta_0}\)
- Keep in mind that Ridge reduces the influence of features with a non-significant coefficient
- Keep in mind that Lasso shrinks them to zero
- But \(\color{blue}{\beta_0}\) is not associated with any feature!

New hyper-parameter \(\color{red}{\alpha}\)

Dictates how much the model is regularized
Large \(\color{red}{\alpha}\) values force model complexity to decrease 👉 ⤵ variance, ⤴ bias

The sum starts from j = 1, we do not penalize the intercept coefficient

Ridge regression

Ridge regression augments the OLS loss function as to not only minimize the sum of squared residuals, but also penalize the size of the parameter estimates, shrinking them towards zero.

The unique feature of Ridge regression, as compared to other penalization techniques, is that coefficients can be shrunk over and over but will never reach 0. In other words, all covariates will always remain in the model, and Ridge does not provide any form of variable selection

Lasso regression

Lasso penalizes the sum of their absolute values (L1 penalty). As a results, for high values of λ, many coefficients are exactly zeroed under lasso.

Coefficients that are shrunk to a point where they exactly equal 0 are clearly excluded from the model.

ElasticNet = Lasso & Ridge Weighted Average

\[L=\|y-\hat{y} \|^{2}+ \color{red}{\alpha} (\color{orange}{\lambda}|\beta |+(1 - \color{orange}{\lambda})\|\beta \|^2)\]

☝️ Regularization tends to penalize features that are not statistically significant but DO NOT use p-values (do not even look at them). They cannot be correctly interpreted in this case.

ML vs stats

At this stage we do not care anymore about p-values. The fundamental difference between classic stats and ML is that in the former we were trying to infer something from the data, in the latter we are trying to predict something in the most precise and efficient way.

Conclusions

👉 Regularize when you think you are overfitting (e.g. Learning Curves not converging)

👉 Ridge when you believe all coefficients may have an impact

👉 Lasso as a feature selection tool (much better for interpretability!)

✅ Regularization is almost always appropriate

Ridge is often turned on by default in most Machine Learning models
You just have to tune the regularization parameter

3. Model Tuning

❓ How to choose the best hyper-parameters (e.g: alpha)

The Grid Search Method

Explores different hyperparameter value combinations to find the combination which optimizes performance

Hold out a validation set (never use the test set for model tuning!)
Select which grid of values of hyper-parameters to try out
For each combination of values, measure your performance on the validation set
Select hyper-parameters that produce the best performance

💻 Let’s (manually) fine-tune a linear model with ElasticNet regularization

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=1)

# Select hyperparam values to try

alphas = [0.01, 0.1, 1] # L1 + L2 
l1_ratios = [0.2, 0.5, 0.8] # L1 / L2 ratio

# create all combinations [(0.01, 0.2), (0.01, 0.5), (...)]
import itertools
hyperparams = itertools.product(alphas, l1_ratios)

# Train and CV-score model for each combination
from sklearn.linear_model import ElasticNet
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

for hyperparam in hyperparams:
    alpha = hyperparam[0]
    l1_ratio = hyperparam[1]

    model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)

    r2 = cross_val_score(model, X_train, y_train, cv=5).mean()

    print(f"alpha: {alpha}, l1_ratio: {l1_ratio},   r2: {r2}")

🔥 Grid Search CV

Randomy split your training set into k folds of same size
Make fold #1 a val_set, train model on other k-1 folds & mesure val_score
Make fold #2 a val_set and repeat
…
Compute average val_score over all folds

☝️ This is your cross-validated score for one given set of hyper-parameters

Repeat for each value of hyper-param to test
Save the test set for final evaluation only (AFTER hyper-params are chosen)

Welcome to Sklearn `GridSearchCV` 🚀

from sklearn.model_selection import GridSearchCV

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Instantiate model
model = ElasticNet()

# Hyperparameter Grid
grid = {
    'alpha': [0.01, 0.1, 1], 
    'l1_ratio': [0.2, 0.5, 0.8]
}

# Instantiate Grid Search
search = GridSearchCV(
    model,
    grid, 
    scoring = 'r2',
    cv = 5,
    n_jobs=-1 # parallelize computation
) 

# Fit data to Grid Search
search.fit(X_train, y_train);

# Best score
search.best_score_

# Best Params
search.best_params_

# Best estimator
search.best_estimator_

👎 Limitations of Grid Search:

Computationally costly
The optimal hyperparameter value can be missed
Can overfit hyperparameters to the training set if too many combinations are tried out for too small a dataset

Random Search

Randomly explore hyperparameter values from:

A hyperparameter space to randomly sample from
The specified number of samples to be tested

Sklearn’s `RandomizedSearchCV`

from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

# Instantiate model
model = ElasticNet()

# Hyperparameter Grid
grid = {'l1_ratio': stats.uniform(0, 1), 'alpha': [0.001, 0.01, 0.1, 1]}

# Instantiate Grid Search
search = RandomizedSearchCV(
    model,
    grid, 
    scoring='r2',
    n_iter=100,  # number of draws
    cv=5, n_jobs=-1
)

# Fit data to Grid Search
search.fit(X_train, y_train)
search.best_estimator_

Choose hyperparameter probability distribution:

Can be generated with scipy.stats.distributions

from scipy import stats

plt.figure(figsize=(5, 3))

dist = stats.norm(10, 2) # if you have a best guess (say: 10)

dist = stats.randint(1,100) # if you have no idea
dist = stats.uniform(1, 100) # same

dist = stats.loguniform(0.01, 1) # Coarse grain search

r = dist.rvs(size=10000) # Random draws
plt.hist(r);

loguniform is great for coarse-grain search across several orders of magnitude

e.g. loguniform(0.01, 1) search over [10−210−2, 10−110−1, 100100]

`RandomizedSearch` vs `GridSearch`

👉 Randomized Search:

Less typing, if you want to try many values
Control for the number of combinations to try & time spent searching
Useful when some hyper-parameters are more important than others

In any case:
✅ Always start with a coarse grain approach (can use Grid or RandomSearch)
✅ Then afterward, fine-tune your search

4. Bayesian Hyperparameter optimization

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective.

It works by building a probabilistic model of the objective function, called the surrogate function (posterior probability), that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

Bayes theorem provides a framework that can be used to quantify the beliefs about the “location” of the optimal parameter set in the search-space.

Tuning hyperparameters using Bayesian logic—helps reduce the time required to obtain an optimal parameter set.