scikit-learn – Azure Data Ninjago & dqops

Hyper-parameters tuning is one common but time-consuming task that aims to select the hyper-parameter values that maximise the accuracy of the model. Normally, cross validation is used to support hyper-parameters tuning that splits the data set to training set for learner training and the validation set to test the model. Python scikit-learn package provides the GridSearchCV class that can simplify the task for machine learning practitioners.

This blog post introduces how to use GridSeachCV class to tuning hyper-parameters using a predictive maintenance dataset as example.

Firstly, we need to import the required packages. Apart from the scikit-learn, we also need to import pandas for the data preprocessing, and LightGBM package for the GBDT model we are going to use as the model.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

Then, we need to load and preprocessing the training data.

# read data
train = pd.read_csv('E:\Data\predicitivemaintance_processed.csv')

# preparing the predictor set and target set
train = train.drop(['Date', 'FailureDate'],axis=1)
target = 'FailNextWeek'
feature_categorical = ['Model']
train = pd.get_dummies(train, columns=feature_categorical)
X = train.drop(target, 1)
y = train[target]

Next, we create a GBDT model with a set of baseline parameters.

model = lgb.LGBMClassifier( 
    boosting_type="gbdt",
    is_unbalance=True, 
    random_state=10, 
    n_estimators=50,
    num_leaves=30, 
    max_depth=8,
    feature_fraction=0.5,  
    bagging_fraction=0.8, 
    bagging_freq=15, 
    learning_rate=0.01,    
)

Now, we move to the key part for the hyper-parameter tuning. In this example, we will tuning the hyper-parameters for ‘n_estimators’ and the ‘num_leaves’. We need to configure the range of parameter values we want to cover for each of them, and the grid search will cover a Cartesian product of those two set of parameter values. For each parameter pair, in this example, 3-fold cross validation will be conducted with AUC used for scoring. In this example, 20 pairs of parameter will be evaluated and 3 fold for each of them. In total, 60 fits will be conducted.

params_opt = {'n_estimators':range(200, 600, 80), 'num_leaves':range(20,60,10)}
gridSearchCV = GridSearchCV(estimator = model, 
    param_grid = params_opt, 
    scoring='roc_auc',
    n_jobs=4,
    iid=False, 
    verbose=1,
    cv=3)
gridSearchCV.fit(X,y)
gridSearchCV.grid_scores_, gridSearchCV.best_params_, gridSearchCV.best_score_

After the code is run we can get the mean AUC value for each pair of parameter.

As we can see the best hyper-parameter for ‘num_leaves’ is 30 and for ‘n_estimators’ is 360. We can then use those value to replace the baseline value used in the GBDT model and continue to tuning other hyper-parameters.

Share this: