The raw telemetry data collected from IoT sensor is normally event-based, e.g., a “Device On” message when the device starts to run, and a “Device Off” message when the device stops.
One common data preprocessing task is to transform the raw “On/Off” telemetry data into device cycle records with the start time, end time and the duration of a device’s running cycle.
This post introduces how to implement this kind of data preprocessing task using Azure Data Lake Analytics (ADLA) job.
Firstly, we need to create an U-SQL file in Visual Studio. U-SQL is the SQL-like language for creating ADLA jobs, which takes advantage of both C# and SQL language features.
In the U-SQL file, we can use Extract expression to load the raw telemetry data file.
U-SQL provides the LAG function that accesses to a row at a given offset that comes before the current row in a given order. For the “Device Off” event message, we can use the LAG function, LAG(EventDateTime, 1) OVER(ORDER BY DeviceId, EventDateTime), to location the corresponding “Device On” event message and get the EventDataTime (the time when the event happens, i.e. cycle start time). The cycle end time is the EventDataTime of the “Device Off” message, and cycle duration is the time difference between the cycle start time and the cycle end time.
After the data is processed, we can use U-SQL Output expression to output the results.