Category: Machine Learning / Data Mining

dqops Data Quality Rules (Part 2) – CFD, Machine Learning

dqops Data Quality Rules (Part 2) – CFD, Machine Learning

The previous blog post introduces a list of basic data quality rules that have been developed for my R&D data quality improvement initiative. Those rules are fundamental and essential for detecting data quality problems. However, those rules have existed since a long, long time ago and they are neither innovative nor exciting. More importantly, those rules are only capable to detect the Type I, “syntax” data quality problems but are not able to find the Type II, “semantic” data quality problems.

Type II “Semantic” Data Quality Problems

Data quality rule is expected to be an effective tool for automating the process of detecting and assessing data quality issues, especially for a large dataset that is labour-intensive to do manually. The basic rules introduced in the previous blog post are effective to detect the Type I problems that require “know what” knowledge of the data, such as the problems falling in the completeness, consistency, uniqueness and validity data quality dimensions.

Type II data quality problems are those “semantic” problems that require “know-how” domain knowledge and experience to detect and solve. This type of problem is more enigmatic. The data looks all fine on the surface and serves well most of the time. However, the damage caused by them to businesses can be much bigger.

Martin Doyle’s 80:20 rule states around 80% of data quality problems are Type I problems and only 20% of data quality problems are Type II problems. However, the Type II problems cost 80% of effort and budget to solve while the Type I problems only takes 20% of effort and budget.  

Under the pressure of limited project time and budgets, plus that the type II data quality problems are normally less obvious than the Type I problems, Type II data quality problems are often overlooked and keep unnoticeable until damages are made to the business by the bad data. The later the problems are found, the severe the damages could be. Based on the 1:10:100 rule initially developed by George Labovitz and Yu Sang Chang in 1992, the costs of solving quality problems increase exponentially over time. When it might only cost you $1 to detect and solve data quality problems at the early stage, the costs of failure caused by hidden data quality problems could be $100 or more.

Therefore, it is important to detect and solve Type II problems before they have made damage to the business. Unfortunately, the job of solving Type II data quality problems is not easy that requires domain experts to invest a significant amount of time to study the data, identify patterns and locate the issues. This could become mission impossible when the data volume is getting larger and larger. Not only the current manual approach is expensive and time-consuming but also it is highly dependant on the domain experts’ knowledge and experiences in the business processes, including the “devils in the details” type of knowledge that only exists in those domain experts’ heads.

So, can we find an alternative approach to solve the Type II problems without the dependency on the domain experts’ knowledge and experiences? can we identify the patterns and relations existing in a dataset automatically with the help of the computation power from modern computers to exhaustively scan through all possibilities? The Machine Learning and Data Mining techniques might provide an answer.

Based on my literature review, a number of ML/DM techniques are promising for handling Type II data quality problems. In my dqops R&D project, I have been experimenting with those techniques and exploring the potential ways to adapt them in a software app for non-technical DQ analysts. In the rest of the blog post, I am going to describe the ways I have been exploring in my toy dqops app, focusing on Conditional Functional Dependency and self-service supervised machine learning, with a brief introduction of anomaly detection and PII detection features.

Conditional Functional Dependency (CFD)

CFD, originally discovered by professor Wenfei Fan from the University of Edinburg, is an extension to traditional functional dependencies (FD). FD is a class of constraints where the values of a set of columns X determine the values of another set of columns Y that can be represented as X → Y. CFD extends the FD by incorporating bindings of semantically related values that is capable to express the semantics of data attribute dependencies. CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD allows to automatically identify context-dependent quality rules. Please refer to my previous blog post for a detailed explanation of CFD

CFD has demonstrated its potentials for commercial adoption in the real world. However, to use it commercially, effective and efficient methods for discovering the CFDs in a dataset are required. Even for a relatively small dataset, the workload to manually identify and validate all the CFDs is overwhelming. Therefore, the CFD discovery process has to be automated. A number of CFD discovery algorithms have been developed since the concept of CFD was introduced. In my dqops app, I have implemented the discovery algorithm from Chiang & Miller in Python. Please refer to the paper on this algorithm from Chiang & Miller for details.

The snapshot below shows the dependencies discovered from Kaggle Adult Income dataset.

Based on the discovered dependencies, DQ analysts can create DQ rules to detect any data row that does not comply with the dependency. In dqops DQ studio, there are two rule templates that can be used to define the dependency constraint, the “Business Logics” rule and the “Custom SQL” rule. The snapshot below shows the definition panel of the “Business Logics” rule.

Self-serivce Supervised Machine Learning

One traditional but effective method for detecting data quality problems is to use “Golden Dataset” as a reference and search for records that conflict with the pattern presented in the “Golden Dataset”. This method is feasible for a “narrow” dataset with a manageable number of columns/attributes. However, for a dataset with a large number of attributes for each record, it becomes impossible to manually identify the patterns in the dataset. Fortunately, this happens to be what Machine Learning is good at.

In my dqops R&D project, I have been exploring the potential approaches to take advantage of Machine Learning techniques to identify Type II data quality problems. There are two prerequisites required for this idea to work:

  • A “Golden Dataset”
  • A self-service machine learning model builder

The main steps for this idea are:

  1. DQ analysts prepare the “Golden Dataset”
  2. DQ analysts train and verify a regression or classification model with the “Golden Dataset”
  3. DQ analysts apply the trained model to test target datasets

In the dqops app, I have developed a self-service machine learning model builder that allows DQ analysts without machine learning skills to train machine learning models by themselves. Under the hood, the auto-sklearn package is used for the model training.

I have been trying my best to make the self-service model builder as easy to use as possible. DQ analysts just need to upload the “Golden Dataset”, specify the target column (i.e. the dependent variable) and select the task type, regression or classification. Some advanced settings are also enabled for further tunning for advanced users.

Once submitted, the model will be trained backend. When the model is ready for use, DQ analysts can create an “ML Regression” rule on the datasets with the same schema of the “Golden Dataset”.

The execution of the rule load the model and predict the target column (dependent variable) and calculate the tolerant range of expected value. When a field value falls out of the expected value range, that value is treated as inconsistent.

The decisive factor for this idea to work is the accuracy of the models trained by the self-service builder. Auto-sklearn is able to choose the most performant algorithms and the most optimal hyperparameters based on the cross-validation results automatically. Therefore, the quality of the “Golden Dataset” decides whether the trained model would work or not.

Anomaly Detection

Anomaly detection is a popular use case for commercial AI-enabled data quality monitoring software. It logs the history of data quality observations and detects the anomaly. Anomaly detection is an effective technique to raise awareness of data quality issues. I have also implemented an anomaly detection feature in the dqops app. The approach I have been using is to conduct a time-series analysis of the historical observations using fbprophet package and check whether or not the current observation is in the predicted value range.

PII Detection

Strictly speaking, data privacy is a regulation issue instead of a data quality issue. However, PII handling is often put into the data quality assurance process and treated as a data clean task. In dqops app, I created a PII detection rule that can be applied to a text column and detect any person name. The implementation of PII detection rule is rather simple with SpaCy package.

Data Quality Improvement – Conditional Functional Dependency (CFD)

Data Quality Improvement – Conditional Functional Dependency (CFD)

To fulfil the promise I made before, I dedicate this blog post to cover the topic of Conditional Functional Dependency (CFD). The reason that I dedicate a whole blog post to this topic is that CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD allows to automatically identifying context-dependent quality rules. That makes it a promising tool for automatic data quality rule discovery. CFD was originally developed by professor Wenfei Fan from the University of Edinburg. I have listed some of his papers regarding CFD [1][4][5] in the reference section at the end of this post.

Conditional Functional Dependency (CFD)

CFD is an extension to traditional functional dependencies (FD). FD is a class of constraints where the values of a set of columns X determine the values of another set of columns Y that can be represented as X → Y. FD that was developed mainly for schema design [1] is required to hold on entire relation(s) that makes it less useful for detecting data inconsistencies for the real-world datasets. CFD extends the FD by incorporating bindings of semantically related values that is capable to express the semantics of data fundamental to data cleaning [1].

Here I will borrow the classic ‘cust’ relation example from [1] that I found is possibly the simplest way to explain the CFD concept. Figure 1 shows an instance r0 of ‘cust’ relation that specifies a customer in terms of the customer’s phone (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)).

Figure 1. An instance of the cust relation
Traditional Functional Dependency (FD)

From the customer records shown in Figure 1, we can see that all the customers with the same country code (CC) and area code (AC) are located in the same city (CT) as well. For example, all customers with CC as ’01’ and AC as ‘908’ are located in the CT ‘MH’ (t1, t2, and t4). Here we have the functional dependency, f1: [CC, AC ] → [CT ], that the CC and the AC can determine the value of CT. As this dependency applies to the entire relation, this is a traditional functional dependency. In addition, we can find another traditional functional dependency in the relation, f2: [CC, AC, PN ] → [STR, CT, ZIP ], that represents the dependency that phone country code, area code and number can determine the street name, city, and zip code.

Conditional Functional Dependency (CFD)

Let’s take a look at another example, f: [ZIP] → [STR]. For the customers (t5 and t6) with the same ZIP code ‘EH4 1DT’, they do have the same STR value ‘High St.’. Now let’s check the dependency on the entire relation for all customers. We can see that the dependency applies to t1 and t2 with ZIP as ‘07974’ and STR as ‘Tree Ave.’. However, this dependency does not apply to t4 that has ZIP as ‘07974’ but STR as ‘Elm Str.’. In this case, the ZIP code ‘07974’ cannot determine a unique STR value. Therefore, the constraint [ZIP] → [STR] does not hold on the entire relation so it is not a functional dependency. However, the constraint [ZIP] → [STR] does hold on for the customers with country code (CC) as ’44’, i.e. a UK address. In other words, the postcode can determine street in the UK address system but not in the US address system, and the constraint [ZIP] → [STR] only hold on when the country code is 44. This type of constraints is conditional functional dependency (CFD) that can be notated as ([CC, ZIP] → STR, (44, _ || _ )) where [CC, ZIP] → STR refers to the functional dependency and (44, _ || _ ) specifies the condition, i.e. when CC is 44, ZIP uniquely determines STR. The notation of a CFD can be generalised as (X → A, tp) where X → A is an FD and tp is a tuple pattern specifying the attributes in X and A that defines the subset where the FD holds on.

Constant Conditional Functional Dependency (CCFD)

As mentioned above, a CFD can be expressed as (X → A, tp) where tp is a tuple pattern, such as the example mentioned earlier (44, _ || _ ) where the symbol ‘||’ separate the left-hand side (LHS) from the right-hand side (RHS) of the dependency and the symbol ‘_’ represents any possible value, i.e. ‘_’ is a variable in the language of programming. One special case of CFD is that all the attributes for defining the tuple pattern are constants. For example, for the CDF, ([CC, AC] → CT, (01, 908 MH)), all the attributes defining the tuple pattern are constants: CC=01; AC=908; CT=MH. In plain English, the phone country code 01 and phone area code 908 of a customer determine the city where this customer is located to be MH. Compared to normal CFD, CCFD attracts more attention from the data quality community as CCFD expresses the semantics at the most detailed level and is easy to be used to check data consistencies.

CFD Discovery

Compared to FD, CFD is more effective in detecting and repairing inconsistencies of data [2][3]. CFD has demonstrated its potentials for commercial adoption in the real world. However, to use it commercially, effective and efficient methods for discovering the CFDs in a dataset are required. Even for a relatively small dataset, the workload to manually identify and validate all the CFDs is overwhelming. Therefore, the CFD discovery process has to be automated. A number of CFD discovery algorithms [2][5][6] have been developed since the concept of CFD was introduced.

In this section, I will introduce the CFD discovery algorithm developed by Chiang & Miller [2]. This method has been proved by Chiang & Miller [2] as effective to capture semantic data quality rules to enforce a broad range of domain constraints. In addition, as the redundant candidates are pruned as early as possible, the search space for the set of minimal CDFs is reduced as well as the computation time for discovering the CDFs.

Chiang & Miller have used 12 double-column pages in their paper [2] to elaborate their algorithm. That paper is informative and worth reading in detail if you are interested at data quality rule discovery. In the rest of this blog post, I will try my best to use simple language to explain how this algorithm works.

Firstly, let’s assume we have a relation with four attributes, A, B, C, D. We want to find the CFDs in the relation. The result set of CFDs needs to be minimum with redundant CFDs disregarded. In addition, we expect the algorithm takes as little computation time as possible. From the four attributes, A, B, C, D in the relation, the search space for the CFDs can be illustrated in an attribute search lattice (Figure 2). Each node in the lattice represents an attribute set. For example, ‘AC’ represents an attribute set including attribute ‘A’ and attribute ‘C’. The single-direction edge between two nodes represents a candidate CFD. For example, the edge (AC, ABC) represents ([A, C] → B, (x, x || x)) and the pattern tuples are unknown and needs to be mined out by the algorithm.

Figure 2: Attribute search lattice

The candidate CDFs are evaluated by traversing the lattice using a breadth-first search (BFS) algorithm, i.e. starting at the tree root and traversing all nodes at the present depth level before moving to the next depth level. In our example, we first traverse through the first level (k =1) that consists of single attribute sets (i.e. [A], [B], [C], [D]), followed by the second level (k = 2) that consists of 2-attribute sets (i.e. [A, B], [A, C], [A, D], [B, C], [B, D], [C, D]), and the next level until level k = total levels -1 or all CDFs in a minimum set have been found.

To achieve optimised algorithm efficiency, not all nodes in the lattice are visited. Instead, this algorithm prunes nodes that are supersets of already discovered rules and nodes that do not satisfy the threshold. In this way, the search space is reduced by skipping the evaluations of descendant nodes. For example, when the candidate ([A, C] → B, (A=’x’, _ || _)) is found to be a CDF, then it is no point to evaluate the candidate ([A, C, D] → B, (A=’x’, _, _ || _)) as the CDF ([A, C] → B, (A=’x’, _ || _)) already covers the semantics of ([A, C, D] → B, (A=’x’, _, _ || _)).

To evaluate whether or not a candidate in the lattice is a valid CDF, we can identify the values of left-hand side (LHS) of the candidate and check whether the same values of left-hand side (LHS) maps to the same value of right-hand side (RHS). To model this in the language of programming, we group the LHS attribute value set and RHS attribute value set, for a valid CDF, the tuples in the LHS group do not split into two or more RHS groups. Generally speaking, the more tuples fallen in a group that represents a valid CDF, this CDF is more preferable. A threshold can be imposed to the number of tuples a CDF need to hold on for it is qualified as a valid CDF.

For the tuple groups that fail the CDF validity test, they can still be refined by considering additional variable and conditional attributes.

Figure 3: A portion of the attribute search lattice

For example, as Figure 3 shown, we assume that no CFD can be materialised on a candidate [A, C] → B, of edge (AC, ABC). We can generate a new candidate in the next level of the lattice by considering an additional attribute such as D so that we have the new candidate as ([A, C, D] → B) as shown as an edge (X’, Y’) on Figure 3. If the new candidate is still not able to materialise to a valid CFD after adding a variable attribute, we can condition on an additional attribute. For example, we can condition on attribute D on top of an existing conditional attribute such as A. In Chiang & Miller’s algorithm, these new candidates are added into a global candidate list and the corresponding lattice edges are marked. The algorithm only searches the nodes with marked edges to ensure only minimal rules are returned.

References

[1] P. Bohannon, W. Fan, F. Geerts, X. Jia and A. Kementsietsidis, “Conditional Functional Dependencies for Data Cleaning,” 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp. 746-755, doi: 10.1109/ICDE.2007.367920.

[2] F. Chiang, & R.J. Miller (2008). Discovering data quality rules. Proc. VLDB Endow., 1, 1166-1177.

[3] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Consistency and accuracy,” in VLDB, 2007.

[4] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” TODS, vol. 33, no. 2, June, 2008.

[5] W. Fan, F. Geerts, J. Li and M. Xiong, “Discovering Conditional Functional Dependencies” in IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683-698, May 2011, doi: 10.1109/TKDE.2010.154.

[6] M. Li, H. Wang and J. Li, “Mining conditional functional dependency rules on big data,” in Big Data Mining and Analytics, vol. 3, no. 1, pp. 68-84, March 2020, doi: 10.26599/BDMA.2019.9020019.

[7] V.S. Santana, F.S. Lopes, (2019) Method for the Assessment of Semantic Accuracy Using Rules Identified by Conditional Functional Dependencies. In: Garoufallou E., Fallucchi F., William De Luca E. (eds) Metadata and Semantic Research. MTSR 2019. Communications in Computer and Information Science, vol 1057. Springer, Cham. https://doi.org/10.1007/978-3-030-36599-8_25

Scaffolding Azure Machine Learning Experiments

*please download the source code here

Microsoft has released the public preview of their newest data science service, Azure Machine Learning, that contains a collection of components to support the end-to-end machine learning solution. The Azure Machine Learning Workbench and the Azure Machine Learning Experimentation service are the two main components offered to machine learning practitioners to support them on exploratory data analysis, feature engineering and model selection and tuning.

This blog post describes how to conduct machine learning experiments with the supports of Azure Machine Learning Workbench and Azure Machine Learning Experimentation service. As the term “Experiment” implies, the process of building a machine learning model is not a waterfall process but instead an iterative process that involves multiple iteration of exploratory analysis, feature engineering, model selection and parameter tuning. To simplify the iterative experiment process and keep the experiment code in a neat structure, we can create some scaffolding code that takes care of the repeated operations for each iteration. Combining the scaffolding code and the job run history dashboard and version control feature offered by Azure Machine Learning, machine learning practitioners can conduct their experiments in a more organised style. There are many ways and patterns to construct the scaffolding code. This blog post will give an example and you can design your scaffolding code based on your own use cases.

Setup Azure Machine Learning environment

Firstly, we need to setup Azure Machine Learning environment, including creating experimentation accounts in Azure Machine Learning and installing required development tools on your computer. You can find the detailed guides from Microsoft official documentations here.

At the end of the setup, you should have the experimentation account created in your Azure tenant and installed Azure Machine Learning Workbench, Visual Studio Code Tools for AI, CLI tool and Python on your computer. In this blog post, I will use the Titanic survival dataset as the example that aims to predict the survival chance of a passenger based on a set of attributes of this passenger. You can find the dataset here.

Create Scaffolding Code and Make the Baseline (Iteration 0) Run

In this example, the following python files will be created to support the iterative experiment, including:

  • EDA & Preprocessing Jupyter notebook for EDA, data preprocessing and feature engineering
  • Experiment file for conducting the model evaluation, parameter tuning and output results to the job run dashboard
  • Individual model files to create the candidate model instance and the parameter options for tuning. In this example, three models are used as candidates, including Logistic Regression, Random Forest, and GBDT.

2

EDA & Preprocessing.ipynb

In the scaffolding version of the EDA & Preprocessing notebook, we only include the minimum data handling that is just enough to support the baseline run. As you can see from the snapshot below, only one-hot encoding is conducted, and the null values are just simply dropped.

1t1

In this example, we will experiment on three models, logistic regression, random forest, and GBDT. We create a separate python file for each model with a single function getModel(). This function will return the model name, model object, the dictionary of parameter options for randomised search cross-validation, and the number of iteration of the random searches.

model_lr.py

from sklearn.linear_model import LogisticRegression
from scipy.stats import randint

def getModel():
    # create logistic regression classifier
    lr = LogisticRegression(random_state = 2)

    # create parameter distribution for parameter tuning
    param_dist = {'penalty': ['l1','l2'], 
                  'C': [0.001,0.01,0.1,1,10,100,1000]}

    # return model dict
    return {'name':"Logistic Regression", 'model':lr, 'param_dist':param_dist, 'n_iter': 10}

model_RF.py

from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

def getModel():
    # create random forest classifer
    rf = RandomForestClassifier(n_estimators=20)

    # create parameter distribution for parameter tuning
    param_dist = {"max_depth": randint(6,9),
                  "max_features": ['auto', 12],
                  'n_estimators': [20, 50, 100, 150, 200],
                  "min_samples_split": randint(2, 10),
                  "min_samples_leaf": randint(2, 8),
                  "bootstrap": [True, False],
                  "criterion": ["gini", "entropy"]}

    # return model dict
    return {'name':"Random Forest", 'model':rf, 'param_dist':param_dist, 'n_iter': 20}

model_GBDT.py

import lightgbm as lgb
from scipy.stats import randint

def getModel():
    # create GBDT model
    gbm = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', is_unbalance=True, random_state=2, n_jobs=5)

    # create parameter distribution for parameter tuning
    param_dist = {
        'learning_rate': [0.005, 0.01, 0.1],
        'n_estimators': randint(50,300),
        'num_leaves': randint(20, 80),
        'feature_fraction':[0.5, 0.6, 0.7, 0.8],
        'bagging_fraction':[0.5, 0.6,0.7,0.8],
        'bagging_freq': randint(10,20)
    }

    # return model dict
    return {'name':"GBDT", 'model':gbm, 'param_dist':param_dist, 'n_iter': 20}

Optional – for each model file, you can also append the following code that enables you to perform the parameter tuning individually on each model through directly running of the individual python file.

import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score

if __name__ == '__main__':
    # load preprocessed training dataset
    train = pd.read_csv('Data/train_processed.csv')

    # specify predictors and target columns
    target = "Survived"
    predictors =  [x for x in train.columns if x not in [target]]

    # fit model with random parameter search
    model = getModel()
    random_search = RandomizedSearchCV(model['model'], param_distributions=model['param_dist'], n_iter=model['n_iter'])
    random_search.fit(train[predictors], train[target])

    # Print top 5 scores and related param options
    results = random_search.cv_results_
    for i in range(1, 6):
        scores = np.flatnonzero(results['rank_test_score'] == i)
        for score in scores:
            print("Rank: {0}".format(i))
            print("score - mean: {0:.3f}, std: {1:.3f}".format(
                  results['mean_test_score'][score],
                  results['std_test_score'][score]))
            print("Parameters: {0}".format(results['params'][score]))

Experiment.py

The experiment file loads the data outputted from the EDA & Preprocessing notebook and fits into the models loaded from model_lr, model_RF, and model_GBDT files. RandomizedSearchCV is used to search the best parameters for each model (from the pre-defined parameter options). The best score for each model will then be logged into the job run history dashboard.

import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score

from azureml.logging import get_azureml_logger
run_logger = get_azureml_logger()

import model_GBDT
import model_lr
import model_RF 

def runExperiment():
    # load preprocessed training dataset
    train = pd.read_csv('Data/train_processed.csv')

    # specify predictors and target columns
    target = "Survived"
    predictors =  [x for x in train.columns if x not in [target]]

    # get models from model files
    models = [model_GBDT.getModel(), model_lr.getModel(), model_RF.getModel()]

    # fit models with random parameter search and log the best score for each model to AML job run dashboard
    for model in models:
        random_search = RandomizedSearchCV(model['model'], param_distributions=model['param_dist'], n_iter=model['n_iter'])
        random_search.fit(train[predictors], train[target])
        results = random_search.cv_results_
        scores = np.flatnonzero(results['rank_test_score'] == 1)
        score = results['mean_test_score'][scores[0]]
        run_logger.log(model['name'], round(score, 3))


if __name__ == '__main__':
    runExperiment()

In the Azure Machine Learning Workbench, we can run the Experiment file. The job run history dashboard will show the results for each experiment iteration.  The snapshot below shows the results after the baseline (iteration 0) run.

1t12

Experiment – Iteration 1…n

After the scaffolding code is in place and the baseline evaluation scores are available, we can start our formal experiment iterations to improve the model performances. For each iteration, we may conduct various operations on data preprocessing, feature engineering and parameters tuning, and we can then run the Experiment file to generate the result on the job run history dashboard.

1t12

All the experiment iteration job run will be version controlled by the Azure Machine Learning Experimentation service. You can restore the code for any previous experiment iteration.

4

Exploratory Data Analysis in Python

I have written a Jupyter notebook describing the Exploratory Data Analysis using Python as shown below:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Questions to Ask when Starting a Predictive Maintenance Project

One of the major use cases of industrial IoT is predictive maintenance that continuously monitors the condition and performance of equipment during normal operation and predict future equipment failure based on previous equipment failure and maintenance history. With an accurate equipment failure prediction organisations can reduce cost from unplanned breakdown and unnecessary preventive maintenance. Driven by the temptation of large cost saving many organisations are interested to deploy their predictive maintenance solutions.

When starting a predictive maintenance project a number of questions need to be raised to the business to help making the solution design decision.

Firstly, we need to know what type of prediction the organisation is aiming at. There are three types of prediction we can normally do for predictive maintenance:

RUL (Remaining Useful Life)  – This is a regression type prediction that estimates the remaining usable time of an equipment before it runs to a failure. This type of prediction is suitable for equipment that does not run in a fixed time pattern.

Failure within next period – This is a two-class classification type prediction that estimates whether or not the equipment will fail within the next period (e.g., next week). This type of prediction can alert the engineers the potential failure for them to arrange maintenance in time to avoid the failure.

Failure within which next period – This is a multi-class classification type prediction. Instead of predicting weather the equipment will fail within the next period it estimates within which of the next period (e.g., next week, next bi-week, or next month) the equipment will fail.

Secondly, we need to ask what the time window (e.g., hour, day, or week) is to use for the prediciton. The reason to ask this question is to help us decide the granularity of the training dataset. Depending on the type of equipment and the way they use, some equipment failures may be predicted weeks before they happen, but some failures can only be predicted hours before. Therefore, we need to choose the suitable level of granularity of the time windows and aggregate the raw per sensor reading data accordingly.

Based on the answers to the first two questions we can work out a list of pre-requisites for the predictive maintenance solution. Some history data has to be available before we can train the predictive model, for example:

  • History data of equipment states (e.g., the measurement values of the components and unusual events such as liquid leaks)
  • Equipment reference data (e.g., the normal value range of a component state such as the min and max level of temperature in normal condition). We need the reference data to extract the exception states of the equipment that may contribute to the predictive model
  • Equipment failure history. This is the necessary data for predictive maintenance modelling, otherwise we cannot establish the relationship between the equipment states and the failure event.
  • Equipment maintenance history. We need to know how long since the machine is lat maintained that can be an important predictor for the potential failure. In addition the frequencies of equipment maintenance can be a candidate indicator of the health status of the equipment.

The missing of necessary history data can be the game-killer. If that happens we need to go back to the square one and start to systematically plan the data collection.

 

Evaluate Feature Importance using Tree-based Model

Tree-based model can be used to evaluate the importance of features. In this blog post I go through the steps of evaluating feature importance using the GBDT model in LightGBM. LightGBM is the gradient boosting framework released by Microsoft with high accuracy and speed (some test shows LightGBM can produce as accurate prediction as XGBoost but can reach 25x faster).

Firstly, we import the required packages: pandas for the data preprocessing, LightGBM for the GBDT model, and matplotlib for build the feature importance bar chart.

import pandas as pd
import matplotlib.pylab as plt
import lightgbm as lgb

Then, we need to load and preprocessing the training data. In this example, we use a predictive maintenance dataset.

# read data
train = pd.read_csv('E:\Data\predicitivemaintance_processed.csv')

# drop the columns that are not used for the model
train = train.drop(['Date', 'FailureDate'],axis=1)

# set the target column
target = 'FailNextWeek'

# One-hot encoding
feature_categorical = ['Model']
train = pd.get_dummies(train, columns=feature_categorical)

Next, we train the GBDT model with the training data

lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'num_leaves': 30,
    'num_round': 360,
    'max_depth':8,
    'learning_rate': 0.01,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.8,
    'bagging_freq': 12
}
lgb_train = lgb.Dataset(train.drop(target, 1), train[target])
model = lgb.train(lgb_params, lgb_train)

After the model is trained, we can then call the plot_importance function of the trained model to get the importance of the features.

plt.figure(figsize=(12,6))
lgb.plot_importance(model, max_num_features=30)
plt.title("Feature importances")
plt.show()

s1