Tag: Machine Learning / Data Mining

Scaffolding Azure Machine Learning Experiments

*please download the source code here

Microsoft has released the public preview of their newest data science service, Azure Machine Learning, that contains a collection of components to support the end-to-end machine learning solution. The Azure Machine Learning Workbench and the Azure Machine Learning Experimentation service are the two main components offered to machine learning practitioners to support them on exploratory data analysis, feature engineering and model selection and tuning.

This blog post describes how to conduct machine learning experiments with the supports of Azure Machine Learning Workbench and Azure Machine Learning Experimentation service. As the term “Experiment” implies, the process of building a machine learning model is not a waterfall process but instead an iterative process that involves multiple iteration of exploratory analysis, feature engineering, model selection and parameter tuning. To simplify the iterative experiment process and keep the experiment code in a neat structure, we can create some scaffolding code that takes care of the repeated operations for each iteration. Combining the scaffolding code and the job run history dashboard and version control feature offered by Azure Machine Learning, machine learning practitioners can conduct their experiments in a more organised style. There are many ways and patterns to construct the scaffolding code. This blog post will give an example and you can design your scaffolding code based on your own use cases.

Setup Azure Machine Learning environment

Firstly, we need to setup Azure Machine Learning environment, including creating experimentation accounts in Azure Machine Learning and installing required development tools on your computer. You can find the detailed guides from Microsoft official documentations here.

At the end of the setup, you should have the experimentation account created in your Azure tenant and installed Azure Machine Learning Workbench, Visual Studio Code Tools for AI, CLI tool and Python on your computer. In this blog post, I will use the Titanic survival dataset as the example that aims to predict the survival chance of a passenger based on a set of attributes of this passenger. You can find the dataset here.

Create Scaffolding Code and Make the Baseline (Iteration 0) Run

In this example, the following python files will be created to support the iterative experiment, including:

  • EDA & Preprocessing Jupyter notebook for EDA, data preprocessing and feature engineering
  • Experiment file for conducting the model evaluation, parameter tuning and output results to the job run dashboard
  • Individual model files to create the candidate model instance and the parameter options for tuning. In this example, three models are used as candidates, including Logistic Regression, Random Forest, and GBDT.


EDA & Preprocessing.ipynb

In the scaffolding version of the EDA & Preprocessing notebook, we only include the minimum data handling that is just enough to support the baseline run. As you can see from the snapshot below, only one-hot encoding is conducted, and the null values are just simply dropped.


In this example, we will experiment on three models, logistic regression, random forest, and GBDT. We create a separate python file for each model with a single function getModel(). This function will return the model name, model object, the dictionary of parameter options for randomised search cross-validation, and the number of iteration of the random searches.


from sklearn.linear_model import LogisticRegression
from scipy.stats import randint

def getModel():
    # create logistic regression classifier
    lr = LogisticRegression(random_state = 2)

    # create parameter distribution for parameter tuning
    param_dist = {'penalty': ['l1','l2'], 
                  'C': [0.001,0.01,0.1,1,10,100,1000]}

    # return model dict
    return {'name':"Logistic Regression", 'model':lr, 'param_dist':param_dist, 'n_iter': 10}


from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

def getModel():
    # create random forest classifer
    rf = RandomForestClassifier(n_estimators=20)

    # create parameter distribution for parameter tuning
    param_dist = {"max_depth": randint(6,9),
                  "max_features": ['auto', 12],
                  'n_estimators': [20, 50, 100, 150, 200],
                  "min_samples_split": randint(2, 10),
                  "min_samples_leaf": randint(2, 8),
                  "bootstrap": [True, False],
                  "criterion": ["gini", "entropy"]}

    # return model dict
    return {'name':"Random Forest", 'model':rf, 'param_dist':param_dist, 'n_iter': 20}


import lightgbm as lgb
from scipy.stats import randint

def getModel():
    # create GBDT model
    gbm = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', is_unbalance=True, random_state=2, n_jobs=5)

    # create parameter distribution for parameter tuning
    param_dist = {
        'learning_rate': [0.005, 0.01, 0.1],
        'n_estimators': randint(50,300),
        'num_leaves': randint(20, 80),
        'feature_fraction':[0.5, 0.6, 0.7, 0.8],
        'bagging_fraction':[0.5, 0.6,0.7,0.8],
        'bagging_freq': randint(10,20)

    # return model dict
    return {'name':"GBDT", 'model':gbm, 'param_dist':param_dist, 'n_iter': 20}

Optional – for each model file, you can also append the following code that enables you to perform the parameter tuning individually on each model through directly running of the individual python file.

import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score

if __name__ == '__main__':
    # load preprocessed training dataset
    train = pd.read_csv('Data/train_processed.csv')

    # specify predictors and target columns
    target = "Survived"
    predictors =  [x for x in train.columns if x not in [target]]

    # fit model with random parameter search
    model = getModel()
    random_search = RandomizedSearchCV(model['model'], param_distributions=model['param_dist'], n_iter=model['n_iter'])
    random_search.fit(train[predictors], train[target])

    # Print top 5 scores and related param options
    results = random_search.cv_results_
    for i in range(1, 6):
        scores = np.flatnonzero(results['rank_test_score'] == i)
        for score in scores:
            print("Rank: {0}".format(i))
            print("score - mean: {0:.3f}, std: {1:.3f}".format(
            print("Parameters: {0}".format(results['params'][score]))


The experiment file loads the data outputted from the EDA & Preprocessing notebook and fits into the models loaded from model_lr, model_RF, and model_GBDT files. RandomizedSearchCV is used to search the best parameters for each model (from the pre-defined parameter options). The best score for each model will then be logged into the job run history dashboard.

import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score

from azureml.logging import get_azureml_logger
run_logger = get_azureml_logger()

import model_GBDT
import model_lr
import model_RF 

def runExperiment():
    # load preprocessed training dataset
    train = pd.read_csv('Data/train_processed.csv')

    # specify predictors and target columns
    target = "Survived"
    predictors =  [x for x in train.columns if x not in [target]]

    # get models from model files
    models = [model_GBDT.getModel(), model_lr.getModel(), model_RF.getModel()]

    # fit models with random parameter search and log the best score for each model to AML job run dashboard
    for model in models:
        random_search = RandomizedSearchCV(model['model'], param_distributions=model['param_dist'], n_iter=model['n_iter'])
        random_search.fit(train[predictors], train[target])
        results = random_search.cv_results_
        scores = np.flatnonzero(results['rank_test_score'] == 1)
        score = results['mean_test_score'][scores[0]]
        run_logger.log(model['name'], round(score, 3))

if __name__ == '__main__':

In the Azure Machine Learning Workbench, we can run the Experiment file. The job run history dashboard will show the results for each experiment iteration.  The snapshot below shows the results after the baseline (iteration 0) run.


Experiment – Iteration 1…n

After the scaffolding code is in place and the baseline evaluation scores are available, we can start our formal experiment iterations to improve the model performances. For each iteration, we may conduct various operations on data preprocessing, feature engineering and parameters tuning, and we can then run the Experiment file to generate the result on the job run history dashboard.


All the experiment iteration job run will be version controlled by the Azure Machine Learning Experimentation service. You can restore the code for any previous experiment iteration.


Exploratory Data Analysis in Python

I have written a Jupyter notebook describing the Exploratory Data Analysis using Python as shown below:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Questions to Ask when Starting a Predictive Maintenance Project

One of the major use cases of industrial IoT is predictive maintenance that continuously monitors the condition and performance of equipment during normal operation and predict future equipment failure based on previous equipment failure and maintenance history. With an accurate equipment failure prediction organisations can reduce cost from unplanned breakdown and unnecessary preventive maintenance. Driven by the temptation of large cost saving many organisations are interested to deploy their predictive maintenance solutions.

When starting a predictive maintenance project a number of questions need to be raised to the business to help making the solution design decision.

Firstly, we need to know what type of prediction the organisation is aiming at. There are three types of prediction we can normally do for predictive maintenance:

RUL (Remaining Useful Life)  – This is a regression type prediction that estimates the remaining usable time of an equipment before it runs to a failure. This type of prediction is suitable for equipment that does not run in a fixed time pattern.

Failure within next period – This is a two-class classification type prediction that estimates whether or not the equipment will fail within the next period (e.g., next week). This type of prediction can alert the engineers the potential failure for them to arrange maintenance in time to avoid the failure.

Failure within which next period – This is a multi-class classification type prediction. Instead of predicting weather the equipment will fail within the next period it estimates within which of the next period (e.g., next week, next bi-week, or next month) the equipment will fail.

Secondly, we need to ask what the time window (e.g., hour, day, or week) is to use for the prediciton. The reason to ask this question is to help us decide the granularity of the training dataset. Depending on the type of equipment and the way they use, some equipment failures may be predicted weeks before they happen, but some failures can only be predicted hours before. Therefore, we need to choose the suitable level of granularity of the time windows and aggregate the raw per sensor reading data accordingly.

Based on the answers to the first two questions we can work out a list of pre-requisites for the predictive maintenance solution. Some history data has to be available before we can train the predictive model, for example:

  • History data of equipment states (e.g., the measurement values of the components and unusual events such as liquid leaks)
  • Equipment reference data (e.g., the normal value range of a component state such as the min and max level of temperature in normal condition). We need the reference data to extract the exception states of the equipment that may contribute to the predictive model
  • Equipment failure history. This is the necessary data for predictive maintenance modelling, otherwise we cannot establish the relationship between the equipment states and the failure event.
  • Equipment maintenance history. We need to know how long since the machine is lat maintained that can be an important predictor for the potential failure. In addition the frequencies of equipment maintenance can be a candidate indicator of the health status of the equipment.

The missing of necessary history data can be the game-killer. If that happens we need to go back to the square one and start to systematically plan the data collection.


Evaluate Feature Importance using Tree-based Model

Tree-based model can be used to evaluate the importance of features. In this blog post I go through the steps of evaluating feature importance using the GBDT model in LightGBM. LightGBM is the gradient boosting framework released by Microsoft with high accuracy and speed (some test shows LightGBM can produce as accurate prediction as XGBoost but can reach 25x faster).

Firstly, we import the required packages: pandas for the data preprocessing, LightGBM for the GBDT model, and matplotlib for build the feature importance bar chart.

import pandas as pd
import matplotlib.pylab as plt
import lightgbm as lgb

Then, we need to load and preprocessing the training data. In this example, we use a predictive maintenance dataset.

# read data
train = pd.read_csv('E:\Data\predicitivemaintance_processed.csv')

# drop the columns that are not used for the model
train = train.drop(['Date', 'FailureDate'],axis=1)

# set the target column
target = 'FailNextWeek'

# One-hot encoding
feature_categorical = ['Model']
train = pd.get_dummies(train, columns=feature_categorical)

Next, we train the GBDT model with the training data

lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'num_leaves': 30,
    'num_round': 360,
    'learning_rate': 0.01,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.8,
    'bagging_freq': 12
lgb_train = lgb.Dataset(train.drop(target, 1), train[target])
model = lgb.train(lgb_params, lgb_train)

After the model is trained, we can then call the plot_importance function of the trained model to get the importance of the features.

lgb.plot_importance(model, max_num_features=30)
plt.title("Feature importances")


Tuning Hyper-Parameters using Grid Search

Hyper-parameters tuning is one common but time-consuming task that aims to select the hyper-parameter values that maximise the accuracy of the model. Normally, cross validation is used to support hyper-parameters tuning that splits the data set to training set for learner training and the validation set to test the model. Python scikit-learn package provides the GridSearchCV class that can simplify the task for machine learning practitioners.

This blog post introduces how to use GridSeachCV class to tuning hyper-parameters using a predictive maintenance dataset as example.

Firstly, we need to import the required packages. Apart from the scikit-learn, we also need to import pandas for the data preprocessing, and LightGBM package for the GBDT model we are going to use as the model.

import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

Then, we need to load and preprocessing the training data.

# read data
train = pd.read_csv('E:\Data\predicitivemaintance_processed.csv')

# preparing the predictor set and target set
train = train.drop(['Date', 'FailureDate'],axis=1)
target = 'FailNextWeek'
feature_categorical = ['Model']
train = pd.get_dummies(train, columns=feature_categorical)
X = train.drop(target, 1)
y = train[target]

Next, we create a GBDT model with a set of baseline parameters.

model = lgb.LGBMClassifier( 

Now, we move to the key part for the hyper-parameter tuning. In this example, we will tuning the hyper-parameters for ‘n_estimators’ and the ‘num_leaves’. We need to configure the range of parameter values we want to cover for each of them, and the grid search will cover a Cartesian product of those two set of parameter values.  For each parameter pair, in this example, 3-fold cross validation will be conducted with AUC used for scoring. In this example, 20 pairs of parameter will be evaluated and 3 fold for each of them. In total, 60 fits will be conducted.

params_opt = {'n_estimators':range(200, 600, 80), 'num_leaves':range(20,60,10)}
gridSearchCV = GridSearchCV(estimator = model, 
    param_grid = params_opt, 
gridSearchCV.grid_scores_, gridSearchCV.best_params_, gridSearchCV.best_score_

After the code is run we can get the mean AUC value for each pair of parameter.


As we can see the best hyper-parameter for ‘num_leaves’ is 30 and for ‘n_estimators’ is 360. We can then use those value to replace the baseline value used in the GBDT model and continue to tuning other hyper-parameters.

Extracting Features from IoT Sensor Data using R

In my previous blog I introduced the common patterns to extract features from IoT sensor data using Python. Although R is not my primary machine learning language it is becoming ubiquitous in Microsoft’s data analytics ecosystem after they acquired Revolution Analytics, the major commercial distributor of R. Considering the increasing popularity of R on Microsoft data platforms, I will create the R version of code for IoT data feature extraction in this blog.

This blog post is also organised based on the three common patterns for extracting feature from IoT sensor data:

  • Window-based descriptive statistics
  • Seasonal pattern
  • Trend pattern

Also, the examples use the same IoT sample data that stores the hourly reading from sensor A.


  1. Window-based descriptive statistics

We can use the rollapply function in the zoo library to calculate the descriptive statistics values in a rolling window. As there is no function for Skewness in the core R packages we have to use the e1071 library that contains the Skewness and Kurtosis function.

data <- data %>%
        mutate(SensorA_Mean_12h=rollapply(SensorA, width=12, FUN=mean, by=1, fill=NA, align='right'),
               SensorA_SD_12h=rollapply(SensorA, width=12, FUN=sd, by=1, fill=NA, align='right'),
               SensorA_Skew_12h=rollapply(SensorA, width=12, FUN=skewness, by=1, fill=NA, align='right'),
               SensorA_Mean_24h=rollapply(SensorA, width=24, FUN=mean, by=1, fill=NA, align='right'),
               SensorA_SD_24h=rollapply(SensorA, width=24, FUN=sd, by=1, fill=NA, align='right'),
               SensorA_Skew_24h=rollapply(SensorA, width=24, FUN=skewness, by=1, fill=NA, align='right'),
               SensorA_Mean_72h=rollapply(SensorA, width=72, FUN=mean, by=1, fill=NA, align='right'),
               SensorA_SD_72h=rollapply(SensorA, width=72, FUN=sd, by=1, fill=NA, align='right'),
               SensorA_Skew_72h=rollapply(SensorA, width=72, FUN=skewness, by=1, fill=NA, align='right')
tail(data, 5)

The code above will generate the following features:


  1. Seasonal pattern

A date + time is represented in R as an object of class POSIXct. Once we convert the DateTime column into POSIXct, we can easily extract the parts of the datatime.

data$Date <- as.POSIXct(data$Date, "%Y-%m-%dT%H:%M:%S", tz="UTC")

data$DayOfWeek <- as.numeric(format(data$Date, "%u"))
data$IsWeekend <- ifelse (data$DayOfWeek>5, 1, 0)
data$Hour <- as.numeric(format(data$Date, "%H"))
data$IsWorkingHour <- ifelse (data$Hour>=9 & data$Hour<=17, 1, 0)
data$Year <- as.numeric(format(data$Date, "%Y"))
data$Month <- as.numeric(format(data$Date, "%m"))
data$DayOfMonth <- as.numeric(format(data$Date, "%d"))
tail(data, 5)


  1. Trend pattern

In Python, we can use shift function to extract the features for representing the trend pattern in a time-series dataset. In R, a similar function is slide provided by DataCombine library.

data <- slide(data, Var = "SensorA", slideBy = -1:-7, 
      NewVar=c('SensorA_lag_1h', 'SensorA_lag_2h', 'SensorA_lag_3h', 'SensorA_lag_4h',
               'SensorA_lag_5h', 'SensorA_lag_6h', 'SensorA_lag_7h')
tail(data, 5)

We can the output as: