Category: Python/R

Scaffolding Azure Machine Learning Experiments

*please download the source code here

Microsoft has released the public preview of their newest data science service, Azure Machine Learning, that contains a collection of components to support the end-to-end machine learning solution. The Azure Machine Learning Workbench and the Azure Machine Learning Experimentation service are the two main components offered to machine learning practitioners to support them on exploratory data analysis, feature engineering and model selection and tuning.

This blog post describes how to conduct machine learning experiments with the supports of Azure Machine Learning Workbench and Azure Machine Learning Experimentation service. As the term “Experiment” implies, the process of building a machine learning model is not a waterfall process but instead an iterative process that involves multiple iteration of exploratory analysis, feature engineering, model selection and parameter tuning. To simplify the iterative experiment process and keep the experiment code in a neat structure, we can create some scaffolding code that takes care of the repeated operations for each iteration. Combining the scaffolding code and the job run history dashboard and version control feature offered by Azure Machine Learning, machine learning practitioners can conduct their experiments in a more organised style. There are many ways and patterns to construct the scaffolding code. This blog post will give an example and you can design your scaffolding code based on your own use cases.

Setup Azure Machine Learning environment

Firstly, we need to setup Azure Machine Learning environment, including creating experimentation accounts in Azure Machine Learning and installing required development tools on your computer. You can find the detailed guides from Microsoft official documentations here.

At the end of the setup, you should have the experimentation account created in your Azure tenant and installed Azure Machine Learning Workbench, Visual Studio Code Tools for AI, CLI tool and Python on your computer. In this blog post, I will use the Titanic survival dataset as the example that aims to predict the survival chance of a passenger based on a set of attributes of this passenger. You can find the dataset here.

Create Scaffolding Code and Make the Baseline (Iteration 0) Run

In this example, the following python files will be created to support the iterative experiment, including:

  • EDA & Preprocessing Jupyter notebook for EDA, data preprocessing and feature engineering
  • Experiment file for conducting the model evaluation, parameter tuning and output results to the job run dashboard
  • Individual model files to create the candidate model instance and the parameter options for tuning. In this example, three models are used as candidates, including Logistic Regression, Random Forest, and GBDT.

2

EDA & Preprocessing.ipynb

In the scaffolding version of the EDA & Preprocessing notebook, we only include the minimum data handling that is just enough to support the baseline run. As you can see from the snapshot below, only one-hot encoding is conducted, and the null values are just simply dropped.

1t1

In this example, we will experiment on three models, logistic regression, random forest, and GBDT. We create a separate python file for each model with a single function getModel(). This function will return the model name, model object, the dictionary of parameter options for randomised search cross-validation, and the number of iteration of the random searches.

model_lr.py

from sklearn.linear_model import LogisticRegression
from scipy.stats import randint

def getModel():
    # create logistic regression classifier
    lr = LogisticRegression(random_state = 2)

    # create parameter distribution for parameter tuning
    param_dist = {'penalty': ['l1','l2'], 
                  'C': [0.001,0.01,0.1,1,10,100,1000]}

    # return model dict
    return {'name':"Logistic Regression", 'model':lr, 'param_dist':param_dist, 'n_iter': 10}

model_RF.py

from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

def getModel():
    # create random forest classifer
    rf = RandomForestClassifier(n_estimators=20)

    # create parameter distribution for parameter tuning
    param_dist = {"max_depth": randint(6,9),
                  "max_features": ['auto', 12],
                  'n_estimators': [20, 50, 100, 150, 200],
                  "min_samples_split": randint(2, 10),
                  "min_samples_leaf": randint(2, 8),
                  "bootstrap": [True, False],
                  "criterion": ["gini", "entropy"]}

    # return model dict
    return {'name':"Random Forest", 'model':rf, 'param_dist':param_dist, 'n_iter': 20}

model_GBDT.py

import lightgbm as lgb
from scipy.stats import randint

def getModel():
    # create GBDT model
    gbm = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', is_unbalance=True, random_state=2, n_jobs=5)

    # create parameter distribution for parameter tuning
    param_dist = {
        'learning_rate': [0.005, 0.01, 0.1],
        'n_estimators': randint(50,300),
        'num_leaves': randint(20, 80),
        'feature_fraction':[0.5, 0.6, 0.7, 0.8],
        'bagging_fraction':[0.5, 0.6,0.7,0.8],
        'bagging_freq': randint(10,20)
    }

    # return model dict
    return {'name':"GBDT", 'model':gbm, 'param_dist':param_dist, 'n_iter': 20}

Optional – for each model file, you can also append the following code that enables you to perform the parameter tuning individually on each model through directly running of the individual python file.

import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score

if __name__ == '__main__':
    # load preprocessed training dataset
    train = pd.read_csv('Data/train_processed.csv')

    # specify predictors and target columns
    target = "Survived"
    predictors =  [x for x in train.columns if x not in [target]]

    # fit model with random parameter search
    model = getModel()
    random_search = RandomizedSearchCV(model['model'], param_distributions=model['param_dist'], n_iter=model['n_iter'])
    random_search.fit(train[predictors], train[target])

    # Print top 5 scores and related param options
    results = random_search.cv_results_
    for i in range(1, 6):
        scores = np.flatnonzero(results['rank_test_score'] == i)
        for score in scores:
            print("Rank: {0}".format(i))
            print("score - mean: {0:.3f}, std: {1:.3f}".format(
                  results['mean_test_score'][score],
                  results['std_test_score'][score]))
            print("Parameters: {0}".format(results['params'][score]))

Experiment.py

The experiment file loads the data outputted from the EDA & Preprocessing notebook and fits into the models loaded from model_lr, model_RF, and model_GBDT files. RandomizedSearchCV is used to search the best parameters for each model (from the pre-defined parameter options). The best score for each model will then be logged into the job run history dashboard.

import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score

from azureml.logging import get_azureml_logger
run_logger = get_azureml_logger()

import model_GBDT
import model_lr
import model_RF 

def runExperiment():
    # load preprocessed training dataset
    train = pd.read_csv('Data/train_processed.csv')

    # specify predictors and target columns
    target = "Survived"
    predictors =  [x for x in train.columns if x not in [target]]

    # get models from model files
    models = [model_GBDT.getModel(), model_lr.getModel(), model_RF.getModel()]

    # fit models with random parameter search and log the best score for each model to AML job run dashboard
    for model in models:
        random_search = RandomizedSearchCV(model['model'], param_distributions=model['param_dist'], n_iter=model['n_iter'])
        random_search.fit(train[predictors], train[target])
        results = random_search.cv_results_
        scores = np.flatnonzero(results['rank_test_score'] == 1)
        score = results['mean_test_score'][scores[0]]
        run_logger.log(model['name'], round(score, 3))


if __name__ == '__main__':
    runExperiment()

In the Azure Machine Learning Workbench, we can run the Experiment file. The job run history dashboard will show the results for each experiment iteration.  The snapshot below shows the results after the baseline (iteration 0) run.

1t12

Experiment – Iteration 1…n

After the scaffolding code is in place and the baseline evaluation scores are available, we can start our formal experiment iterations to improve the model performances. For each iteration, we may conduct various operations on data preprocessing, feature engineering and parameters tuning, and we can then run the Experiment file to generate the result on the job run history dashboard.

1t12

All the experiment iteration job run will be version controlled by the Azure Machine Learning Experimentation service. You can restore the code for any previous experiment iteration.

4

Exploratory Data Analysis in Python

I have written a Jupyter notebook describing the Exploratory Data Analysis using Python as shown below:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

R Visual – Building Component Cycle Timeline

One common approach to detect exceptions of a machine is to monitor the correlative status of components in the machine. For example, in normal condition, two or more components should be running at the same time, or some components should be running in sequential order. When the components are not running in the way as expected, that indicates potential issues with the machine which need attentions from engineers.

Thanks to IoT sensors, it is easy to capture the data of component status in a machine. To help engineers to easily pick up the potential issues, a machine components timeline chart will be very helpful. However, Power BI does not provide this kind of timeline chart, and it can be time consuming to build custom Javascript-based visuals. Fortunately, we have R visual in Power BI. With a little help from ggplot2 library, we can easily build a timeline chart (as the one shown below) in four lines code.

f1

Firstly, we need add a R visual onto Power BI canvas, and set the data fields required for the chart.

f2

The minimal fields required are the name/code of the component, start date and end date of the component running cycle, such as:

ComponentCode Start DateTime End DateTime
Pump 2017-01-01 09:12:35 2017-01-01 09:18:37
Motor 2017-01-01 09:12:35 2017-01-01 09:18:37

In the R script editor, we use the geom_segment to draw each component running cycles. The x axis is for the time of component running, and the y axis is for the component name/code.

f3

Apart from showing a Power BI timeline chart for engineers to detect the machine issue as this blog post introduced, we can also send alerts to engineers using Azure Functions.

Extracting Features from IoT Sensor Data using R

In my previous blog I introduced the common patterns to extract features from IoT sensor data using Python. Although R is not my primary machine learning language it is becoming ubiquitous in Microsoft’s data analytics ecosystem after they acquired Revolution Analytics, the major commercial distributor of R. Considering the increasing popularity of R on Microsoft data platforms, I will create the R version of code for IoT data feature extraction in this blog.

This blog post is also organised based on the three common patterns for extracting feature from IoT sensor data:

  • Window-based descriptive statistics
  • Seasonal pattern
  • Trend pattern

Also, the examples use the same IoT sample data that stores the hourly reading from sensor A.

a1

  1. Window-based descriptive statistics

We can use the rollapply function in the zoo library to calculate the descriptive statistics values in a rolling window. As there is no function for Skewness in the core R packages we have to use the e1071 library that contains the Skewness and Kurtosis function.

data <- data %>%
        mutate(SensorA_Mean_12h=rollapply(SensorA, width=12, FUN=mean, by=1, fill=NA, align='right'),
               SensorA_SD_12h=rollapply(SensorA, width=12, FUN=sd, by=1, fill=NA, align='right'),
               SensorA_Skew_12h=rollapply(SensorA, width=12, FUN=skewness, by=1, fill=NA, align='right'),
               SensorA_Mean_24h=rollapply(SensorA, width=24, FUN=mean, by=1, fill=NA, align='right'),
               SensorA_SD_24h=rollapply(SensorA, width=24, FUN=sd, by=1, fill=NA, align='right'),
               SensorA_Skew_24h=rollapply(SensorA, width=24, FUN=skewness, by=1, fill=NA, align='right'),
               SensorA_Mean_72h=rollapply(SensorA, width=72, FUN=mean, by=1, fill=NA, align='right'),
               SensorA_SD_72h=rollapply(SensorA, width=72, FUN=sd, by=1, fill=NA, align='right'),
               SensorA_Skew_72h=rollapply(SensorA, width=72, FUN=skewness, by=1, fill=NA, align='right')
               ) 
tail(data, 5)

The code above will generate the following features:

a1

  1. Seasonal pattern

A date + time is represented in R as an object of class POSIXct. Once we convert the DateTime column into POSIXct, we can easily extract the parts of the datatime.

data$Date <- as.POSIXct(data$Date, "%Y-%m-%dT%H:%M:%S", tz="UTC")

data$DayOfWeek <- as.numeric(format(data$Date, "%u"))
data$IsWeekend <- ifelse (data$DayOfWeek>5, 1, 0)
data$Hour <- as.numeric(format(data$Date, "%H"))
data$IsWorkingHour <- ifelse (data$Hour>=9 & data$Hour<=17, 1, 0)
data$Year <- as.numeric(format(data$Date, "%Y"))
data$Month <- as.numeric(format(data$Date, "%m"))
data$DayOfMonth <- as.numeric(format(data$Date, "%d"))
tail(data, 5)

a2

  1. Trend pattern

In Python, we can use shift function to extract the features for representing the trend pattern in a time-series dataset. In R, a similar function is slide provided by DataCombine library.

data <- slide(data, Var = "SensorA", slideBy = -1:-7, 
      NewVar=c('SensorA_lag_1h', 'SensorA_lag_2h', 'SensorA_lag_3h', 'SensorA_lag_4h',
               'SensorA_lag_5h', 'SensorA_lag_6h', 'SensorA_lag_7h')
               )                                 
tail(data, 5)

We can the output as:
a3

Extracting Features from IoT Sensor Data using Python

The previous blog post discusses three common patterns for extracting feature from IoT sensor data:

  • Window-based descriptive statistics
  • Seasonal pattern
  • Trend pattern

This blog post introduces how to implement those three patterns in Python.

  1. Window-based descriptive statistics

There are three main types of descriptive statistics based on what they describe: distribution (e.g., skewness and kurtosis), central tendency (e.g., mean, median, and mode) and dispersion (e.g., standard deviation, variance, and Range). Python pandas package provides functions to a comprehensive list of descriptive statistics. You can find the reference to those functions here.

The descriptive statistics need to be calculated within a time window context, e.g., the last 12, 24, 72 hours. We can use the rolling method in pandas to get the rolling time window.

For example, we have the hourly reading data from sensor A:a1

We can get the rolling window sizing as 12, 24, 72 hours and calculate the mean, sd, and skew of each window size.

data['SensorA_mean_12h'] = data['SensorA'].rolling(12).mean()
data['SensorA_sd_12h'] = data['SensorA'].rolling(12).std()
data['SensorA_skew_12h'] = data['SensorA'].rolling(12).skew()
data['SensorA_mean_24h'] = data['SensorA'].rolling(24).mean()
data['SensorA_sd_24h'] = data['SensorA'].rolling(24).std()
data['SensorA_skew_24h'] = data['SensorA'].rolling(24).skew()
data['SensorA_mean_72h'] = data['SensorA'].rolling(72).mean()
data['SensorA_sd_72h'] = data['SensorA'].rolling(72).std()
data['SensorA_skew_72h'] = data['SensorA'].rolling(72).skew()
data.tails(5)

The python code above will generate the features as:

a1

  1. Seasonal pattern

As discussed in last blog post, the features representing seasonal pattern can be extracted from the timestamp of the IoT sensor data using the built-in Python datatime class, such as:

data['DayOfWeek']=data['DateTime'].dt.weekday
data['IsWeekend']=np.where(data['DateTime'].dt.weekday>4, 1, 0)
data['IsWorkingHour']=np.where((data['DateTime'].dt.hour>=9) & (data['DateTime'].dt.hour<=17), 1, 0)
data['Year']=data['DateTime'].dt.year
data['Month']=data['DateTime'].dt.month
data['DayOfMonth']=data['DateTime'].dt.day
data.tail(5)

We can get the output as:

a2

  1. Trend pattern

We can use shift function to extract the features for representing the trend pattern in a time-series dataset.

data['SensorA_lag_1h'] = data['SensorA'].shift(1)
data['SensorA_lag_2h'] = data['SensorA'].shift(2)
data['SensorA_lag_3h'] = data['SensorA'].shift(3)
data['SensorA_lag_4h'] = data['SensorA'].shift(4)
data['SensorA_lag_5h'] = data['SensorA'].shift(5)
data['SensorA_lag_6h'] = data['SensorA'].shift(6)
data['SensorA_lag_7h'] = data['SensorA'].shift(7)
data.tail(5)

We can the output as:
a3