Model Tuning

Plan

  1. Model Complexity
  2. Regularization
  3. Model Tuning: Grid Search & Random Search
  4. Bayesian Hyperparameter optimization
  5. PYCARET

The Tuning Stage

1. Model Complexity

Let’s remember Linear Regression:

\[Y = \beta_0 + \beta_1 X_1 + \epsilon\]

❓What about non-linear behavior as below?

👉 If we add a new transformed feature \(X_1^2\) we have a better fit (in red)

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + ... + \epsilon\]

🤔 We could engineer very complex features: \(X^3, X^6, (X_1^2 * X_2^5), ...\)

Our \(R^2\) will keep increasing with every additional feature!

We just solved Data Science!

def train_best_model_ever(X,y):
    while True:
        model = LinearRegression()
        model.fit(X, y)
        if calculate_r2(model, X, y) < 0.999999:
            X = add_more_crazy_features(X)
        else:
            return model

Overfitting and Underfitting

What’s wrong with the first model? With the third?

  • The first one does not capture all the information in the data (high bias model)

  • The third one finds signals that aren’t there (noise); it will not generalize well to new data points (high variance model)

The Bias-Variance Tradeoff

One of the most important concepts in Data Science!

\[TotalError = \color{red}{Bias^2} + \color{teal}{Variance} + \color{grey}{Irreducible Error}\]

Conceptual Definition

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.


Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model.


Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models’ predictions are from the correct value.

Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

An optimal balance of bias and variance leads to a model that is neither overfit nor underfit.

📚 Great read

❓How do we find the optimal model complexity?

👉 Simple models for simple questions
Complex models for complex questions

👉 Use Regularization to decrease the complexity of the model
👉 Use Cross-Validation to tune the hyperparameters of the model

The model that minimizes the test error on an unseen dataset.

Solutions for Overfitting

Simplify your model relative to your data

  • Get more observations

  • Feature selection (manual or automated)

  • Dimensionality reduction (Unsupervised Learning)

  • Early stopping (Deep Learning)

  • Regularization of your Loss function

2. Regularization

Regularization means adding a penalty term to the Loss that increases with \(\beta\)

\[Regularized Loss = Loss(X,y, \color{blue}{\beta})\ + Penalty(\color{blue}{\beta})\]

👉 Penalizes large values for \(\beta_i\)
👉 Forces model to shrink certain coefficients or even select less features
👉 Prevents overfitting

\[\hat{y} = \color{blue}{\beta_0} + \color{blue}{\beta_1} X_1 + \color{blue}{\beta_2} X_1^2 + \color{blue}{\beta_3} X_1^3 + ...\]

The two most famous Regularization penalties are:

\[Ridge(\color{green}{X},\color{magenta}{y},\color{blue}{\beta}) = \underbrace{\sum_{i=0}^{n-1}(\color{magenta}{y_i}-(\color{blue}{\beta_0} + \color{blue}{\beta_1} \color{green}{x_1} + ... \color{blue}{\beta_p} \color{green}{x_p} ))^2}_{Loss(\color{green}{X},\color{magenta}{y},\color{blue}{\beta})} + \color{red}{\alpha} \sum_{\color{green}{j=1}}^{\color{green}{p}} \color{blue}{\beta_j}^{2}\]

\[Lasso(\color{green}{X},\color{magenta}{y},\color{blue}{\beta}) = \underbrace{\sum_{i=0}^{n-1}(\color{magenta}{y_i}-(\color{blue}{\beta_0} + \color{blue}{\beta_1} \color{green}{x_1} + ... \color{blue}{\beta_p} \color{green}{x_p} ))^2}_{Loss(\color{green}{X},\color{magenta}{y},\color{blue}{\beta})} + \color{red}{\alpha} \sum_{\color{green}{j=1}}^{\color{green}{p}} |\color{blue}{\beta_j}|\]

from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X.head()
        age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]

from sklearn.linear_model import Ridge, Lasso, LinearRegression

linreg = LinearRegression().fit(X, y)
ridge = Ridge(alpha=0.2).fit(X, y)
lasso = Lasso(alpha=0.2).fit(X, y)

coefs = pd.DataFrame({
    "coef_linreg": pd.Series(linreg.coef_, index = X.columns),
    "coef_ridge": pd.Series(ridge.coef_, index = X.columns),
    "coef_lasso": pd.Series(lasso.coef_, index= X.columns)})\

coefs\
    .applymap(lambda x: int(x))\
    .style.applymap(lambda x: 'color: red' if x == 0 else 'color: black')

❗️ Warning ❗️

When regularizing a Regression with a L2 penalty or with a L1 penalty:

  • We regularize \(\color{blue}{\beta_\color{green}{1}}, ..., \color{blue}{\beta_{\color{green}{p}}}\)

  • We do NOT regularize the intercept \(\color{blue}{\beta_0}\)

    • Keep in mind that Ridge reduces the influence of features with a non-significant coefficient

    • Keep in mind that Lasso shrinks them to zero

    • But \(\color{blue}{\beta_0}\) is not associated with any feature!

New hyper-parameter \(\color{red}{\alpha}\)

  • Dictates how much the model is regularized

  • Large \(\color{red}{\alpha}\) values force model complexity to decrease 👉 ⤵ variance, ⤴ bias

The sum starts from j = 1, we do not penalize the intercept coefficient

Ridge regression

Ridge regression augments the OLS loss function as to not only minimize the sum of squared residuals, but also penalize the size of the parameter estimates, shrinking them towards zero.


The unique feature of Ridge regression, as compared to other penalization techniques, is that coefficients can be shrunk over and over but will never reach 0. In other words, all covariates will always remain in the model, and Ridge does not provide any form of variable selection

Lasso regression

Lasso penalizes the sum of their absolute values (L1 penalty). As a results, for high values of λ, many coefficients are exactly zeroed under lasso.


Coefficients that are shrunk to a point where they exactly equal 0 are clearly excluded from the model.

ElasticNet = Lasso & Ridge Weighted Average

\[L=\|y-\hat{y} \|^{2}+ \color{red}{\alpha} (\color{orange}{\lambda}|\beta |+(1 - \color{orange}{\lambda})\|\beta \|^2)\]

☝️ Regularization tends to penalize features that are not statistically significant but DO NOT use p-values (do not even look at them). They cannot be correctly interpreted in this case.

ML vs stats

At this stage we do not care anymore about p-values. The fundamental difference between classic stats and ML is that in the former we were trying to infer something from the data, in the latter we are trying to predict something in the most precise and efficient way.

Conclusions

👉 Regularize when you think you are overfitting (e.g. Learning Curves not converging)

👉 Ridge when you believe all coefficients may have an impact

👉 Lasso as a feature selection tool (much better for interpretability!)

✅ Regularization is almost always appropriate

  • Ridge is often turned on by default in most Machine Learning models

  • You just have to tune the regularization parameter

3. Model Tuning

❓ How to choose the best hyper-parameters (e.g: alpha)

The Grid Search Method

Explores different hyperparameter value combinations to find the combination which optimizes performance

  1. Hold out a validation set (never use the test set for model tuning!)

  2. Select which grid of values of hyper-parameters to try out

  3. For each combination of values, measure your performance on the validation set

  4. Select hyper-parameters that produce the best performance

💻 Let’s (manually) fine-tune a linear model with ElasticNet regularization

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=1)

# Select hyperparam values to try

alphas = [0.01, 0.1, 1] # L1 + L2 
l1_ratios = [0.2, 0.5, 0.8] # L1 / L2 ratio

# create all combinations [(0.01, 0.2), (0.01, 0.5), (...)]
import itertools
hyperparams = itertools.product(alphas, l1_ratios) 

# Train and CV-score model for each combination
from sklearn.linear_model import ElasticNet
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

for hyperparam in hyperparams:
    alpha = hyperparam[0]
    l1_ratio = hyperparam[1]

    model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)

    r2 = cross_val_score(model, X_train, y_train, cv=5).mean()

    print(f"alpha: {alpha}, l1_ratio: {l1_ratio},   r2: {r2}")

🔥 Grid Search CV

  1. Randomy split your training set into k folds of same size

  2. Make fold #1 a val_set, train model on other k-1 folds & mesure val_score

  3. Make fold #2 a val_set and repeat

  4. Compute average val_score over all folds

☝️ This is your cross-validated score for one given set of hyper-parameters

  • Repeat for each value of hyper-param to test

  • Save the test set for final evaluation only (AFTER hyper-params are chosen)

Welcome to Sklearn GridSearchCV 🚀

from sklearn.model_selection import GridSearchCV

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Instantiate model
model = ElasticNet()

# Hyperparameter Grid
grid = {
    'alpha': [0.01, 0.1, 1], 
    'l1_ratio': [0.2, 0.5, 0.8]
}

# Instantiate Grid Search
search = GridSearchCV(
    model,
    grid, 
    scoring = 'r2',
    cv = 5,
    n_jobs=-1 # parallelize computation
) 

# Fit data to Grid Search
search.fit(X_train, y_train);

# Best score
search.best_score_

# Best Params
search.best_params_

# Best estimator
search.best_estimator_

👎 Limitations of Grid Search:

  • Computationally costly

  • The optimal hyperparameter value can be missed

  • Can overfit hyperparameters to the training set if too many combinations are tried out for too small a dataset

Sklearn’s RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

# Instantiate model
model = ElasticNet()

# Hyperparameter Grid
grid = {'l1_ratio': stats.uniform(0, 1), 'alpha': [0.001, 0.01, 0.1, 1]}

# Instantiate Grid Search
search = RandomizedSearchCV(
    model,
    grid, 
    scoring='r2',
    n_iter=100,  # number of draws
    cv=5, n_jobs=-1
)

# Fit data to Grid Search
search.fit(X_train, y_train)
search.best_estimator_

Choose hyperparameter probability distribution:

Can be generated with scipy.stats.distributions

from scipy import stats

plt.figure(figsize=(5, 3))

dist = stats.norm(10, 2) # if you have a best guess (say: 10)

dist = stats.randint(1,100) # if you have no idea
dist = stats.uniform(1, 100) # same

dist = stats.loguniform(0.01, 1) # Coarse grain search

r = dist.rvs(size=10000) # Random draws
plt.hist(r);

loguniform is great for coarse-grain search across several orders of magnitude

e.g. loguniform(0.01, 1) search over [10−210−2, 10−110−1, 100100]

RandomizedSearch vs GridSearch

👉 Randomized Search:

  • Less typing, if you want to try many values
  • Control for the number of combinations to try & time spent searching
  • Useful when some hyper-parameters are more important than others

In any case:
✅ Always start with a coarse grain approach (can use Grid or RandomSearch)
✅ Then afterward, fine-tune your search

4. Bayesian Hyperparameter optimization

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective.


It works by building a probabilistic model of the objective function, called the surrogate function (posterior probability), that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

Bayes theorem provides a framework that can be used to quantify the beliefs about the “location” of the optimal parameter set in the search-space.

Tuning hyperparameters using Bayesian logic—helps reduce the time required to obtain an optimal parameter set.

📚 Great read

HyperOpt

pip install hyperopt

Currently three algorithms are implemented in hyperopt:

  • Random Search
  • Tree of Parzen Estimators (TPE)
  • Adaptive TPE

TREE-STRUCTURED PARZEN ESTIMATOR (TPE) The Tree-structured Parzen Estimator approach is a sequential model-based optimization (SMBO) algorithm that finds the inputs to an objective function that maximizes expected improvement.

from sklearn.model_selection import train_test_split,cross_val_score
from hyperopt import hp, fmin, tpe, Trials
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from hyperopt.pyll import scope
import pandas as pd
import numpy as np

# Load raw data
data = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/houses_train_raw.csv', index_col="Id")

# Only keep numerical columns and raws without NaN
data = data.select_dtypes(include=np.number).dropna()

X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

# Perform train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2021)

scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train),
                              columns = X_train.columns)

X_test_scaled = pd.DataFrame(scaler.transform(X_test),
                              columns = X_test.columns)

Objective function

# Define the objective function to minimize (negative mean cross-validated R^2 score)
def objective(params):
    # KNN regressor with hyperparameters to optimize
    knn = KNeighborsRegressor(
        n_neighbors=params['n_neighbors'])

    # Cross-validated R^2 score
    score = np.mean(cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='r2'))
    
    # We want to maximize R^2, so we minimize the negative score
    return -score

Search Space

# Define parameter space using hyperopt random variables
# Define the hyperparameter search space
space = {
    'n_neighbors': hp.choice('n_neighbors', np.arange(1, 11,1))  # Number of neighbors
    #"n_neighbors": scope.int(hp.quniform("n_neighbors", 2, 30, 1)),

}

# Set up trials for tracking
trials = Trials()

# Use the Tree Parzen Estimator (TPE) algorithm for optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50,
            trials=trials,
            rstate=np.random.default_rng(123)) # fixing random state for the reproducibility)

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(best)

5. PYCARET

PyCaret is a simple, easy to learn, low-code machine learning library in Python. With PyCaret, you spend less time coding and more time on analysis.



pip install pycaret

### load sample dataset from pycaret dataset module
from pycaret.datasets import get_data
data = get_data('insurance')
# import pycaret regression and init setup
from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123)

# compare baseline models
best = compare_models()
# predict on test set
holdout_pred = predict_model(best)

💻 Tutorial