Tag: Power Query

Power Query – Parameterised Files Loading from a Year-Month-Day folder Hierarchy

In one of my previous blog post, I described an approach to load text files from Azure Data Lake Store to Power BI filtered by the date period specified by users through setting the period start and period end parameters. That approach is capable to load text files located at the same folder. However, there is another common folder structure pattern that is often used to organise the files in Azure Data Lake Store, i.e. the Year-Month-Day folder hierarchy, such as:

1t1 It raises a challenge when users want to load files between two dates into Power BI. Firstly, the Power Query query we created needs to be able to load files from different folders. In addition, the query should be able to conditionally load the files based on the data period specified by users through filtering the Year-Month-Day folder hierarchy. This blog post describes how to create Power Query queries for this type of requirements with the following steps.

  • Step 1 – Define the StartDate parameter and EndDate parameter for users to specify the date period
  • Step 2 – Define a custom function to load the metadata of all the files stored in a folder. The custom function takes folder path as argument and returns a table value which stores the file metadata
  • Step 3 – Generate the sequence of dates between the specified StartDate and EndDate
  • Step 4 – Derive the folder path from the generated date
  • Step 5 – Invoke the custom function created at step 2 and expand the returned table field to get the list of files we aim to extract
  • Step 6 – Combine the files and convert to table format

Step 1 – Define parameters

Firstly, we define two Date type of parameters, StartDate and EndDate.

stream-analytics-window-functions-sliding-intro

Step 2 – Define custom function

We then define a custom function that takes the path of a folder as argument and return a table value with the metadata of all files stored at that folder.

1t1

2

Step 3 – Generate date list

We then need to create the query to extract the files from the Year-Month-Day folder hierarchy filtered by the StartDate and EndDate parameters. Firstly, we generate the sequence of dates between the StartDate and the EndDate using built-in List.Dates function:

4

Step 4 – Derive the folder path

5

We then convert the dates list into a table and add the “DatePathPart” column that generate the Year/Month/Day part of the folder path such as “2008/06/09”. We then add the “FolderPath” column that makes the full folder path through concatenating the root folder path and the Year/Month/Day part.

After this step, we should have a table like this:

7

Step 5 – Invoke the custom function and expand returned table

6

We can then invoke the custom function defined at step 2 that will add a table type column. Once we expand the table, we will have the list of all the files we want to query based on the specified StartDate and EndDate parameters.

ml1

This is the full piece of code for this query:

Step 6 – Combine the files

The last step is to combine the list of files we queried at step 1 – 5, extract the data records from the binary csv format and load into a single table for downstream process.

1t1

2

Advertisements

Power Query – Extract Multiple Tags Stored in a Single Text Field

Problem

It is not rare to see that multiple attributes are stored in a single text field especially for tagging enabled applications where an unfixed number of tags may associated with an article or post. Those tags are often stored in a single text filed with a delimiter to separate them. When reporting, we often need to category the articles or posts by the tags, e.g., counting the articles or posts by each tag.

1t1

To fulfil this reporting requirement, We need to reshape our dataset from something like:

1t1

to something like:

2

Solution

It is actually very easy to conduct this kind of transformation using Power Query with only three lines of M code.

3

line 1 – split the tag field into unfixed number of tag columns using Splitter.SplitTextByDelimiter function.

line 2 – use Table.UnpivotOtherColumns function to unpivot all the tag columns. As we don’t have a fixed number of tag columns, we need to use UnpivotOtherColumns function and specify the known columns (“PostID”, “Title” in this example) as arguments.

line 3 – remove the column to store the generated tag column names which will not be used in reporting

Supported Types in M Language

M language supports a variety of primitive and complex value types, including the common value types (e.g.,  Number, Text and DateTime) that can be found in most of general programming languages and the value types specifically designed to support dataset handling (e.g.,  Table, Record and Binary).

Primitive Values

A primitive value is a single piece, atomic value of a single value type. M supports the following primitive value types:

ml1

Table

A table value is arguably the most important object in a Power Query query. Many dataset transformations are operated on a table value. A table value is a 2-dimensional structure consisting of an ordered sequence of rows (Records) in vertical direction and an a list of columns (Lists) in horizontal direction.

2

A table value is normally created by the data extraction operation that reads data from a data source such as relational databases or text files. Alternatively, a table value can be created manually using #table.

1t1

2

Once a table value is defined, a collection of Table functions can be called to handle the operations on the table value. For example, we can filter the table by rows using the SelectRows function or by columns using the SelectColumns function. The full list of Table functions can be found here.

4

Record

From the table value, we can access a specific row in the table.

1t1

2

The returned row is in Record type. A record is an ordered sequence of fields and a field is a key-value pair with the column name as the key and the value of that column in the record as the value, such as [Name= “Mike”, Age=39]. A field value can be any kind of type, not only the primitive types but also the structured types.

List

From the table value, we can also access a specific column in the table.

4

5

If we check the type of the return column, we can find that the column with in a table is a List value.

6

List values can be very useful at the intermediate steps of dataset transformations. For example, we can calculate the average value of a column and use it as benchmark for other calculations.

Apart from accessing list values from a table value, we can also create lists from scratch. One common use case is to generate a sequence of consecutive date list for building the Date dimension table.

1t1

2

Binary

A binary value represents a sequence of bytes that is often used for file format data, such as csv files.

1t1

M language supports a collection of functions for transforming the binary value data to table value data, such as Csv.Document.

Function

M is a functional programming language that construct a program through the evaluation of functions.  A function value is a mapping between a set of input arguments and an output value that can be define as:

(arg1, arg2) => 
     Let
         result = do something
     In
         result

Functions can be very useful when working with complex transformation where some logic needs to be called multiple times.

Power Query – Parameterised Files Loading from Azure Data Lake Store within a Given Date Range

Power BI now supports data load from Azure Data Lake Store. We can connect to a folder in the Azure Data Lake Store and load all files from that folder.
1t1

3

However, we often don’t want to or aren’t able to load all the files in the Azure Data Lake Store folder into Power BI due to the volume of the data. Instead, we want to be able to specify a data range and only load the files fallen into that data range.

The Azure Data Lake Store connector in Power BI doesn’t provide direct support for conditional data loading based on a given criteria. However, with some help from the M language, we can easily implement this feature through customising the query scripts.

Firstly, we create two Power BI parameters, the StartDate (for the start data of the given data range) and EndDate (for the end data of the give data range).

stream-analytics-window-functions-sliding-intro

When we connect the Power BI to the Azure Data Lake Store folder, the DataLake.Contents(“{ADLS folder path}”) will return the metadata of the list of files stored in that folder. Then we can use Table.SelectRows function to go through each files, extract the date string from the name and convert it to date type, and then check whether the date falls into the give data range though comparing the date to the StartDate parameter and EndDate parameter.

5

A prerequisite for this solution to work is that the date info needs to be included in the file name. It is the common practise (and also good practise for performance reason) to partition the files stored in Azure Data Lake Store by date.

Now we have only loaded the files as we needed into the Power BI and we can combine those files into a single table for the downstream visualisation in Power BI. As the snapshot below showing, we can open the “Combine Files” dialog to combine the files.

6

7

After the files are combined and loaded into Power BI dataset, we can build Power BI visualisations using the data in the dataset. In future, when we need to load files within other data ranges, we don’t have to edit the query again, instead, we just need to set the StartDate and the EndDate parameters and the dataset will be automatically refreshed with data in the new given data range.

5

4

 

Inspecting SharePoint Metadata using Power Query

Inspecting SharePoint Metadata using Power Query

Power Query is a self-service ETL tool from Microsoft which is able to extract data from a broad range of data sources, transform the data, and finally load the data to PowerPivot. However, I have recently found that Power Query could also be a very handy SharePoint Metadata viewer to inspect the schema of SharePoint sites, lists, content types… What we need is to make use of the Power Query OData connection to SharePoint Rest API and use the relation drill down feature to navigate through the SharePoint objects.

For example, we can connect to a SharePoint site (“https://linxiaodev.sharepoint.com/sites/POC/_api/web”) using the OData connection, and the Power Query will load all attributes of the site and also the links to drill down to the related Record or Table.

01

For example, we can drill down to the Lists table which load the metadata of all the lists in the site. One tip here is that you can select which columns of attributes to show instead of show a long list of all columns.

02

03

Then, you can further drill down to the Table or Record related to a list, e.g., the list of related Fields of the ‘Documents’ library list. In one of my previous blog post, I need to inspect the Hidden attribute of the LikedBy field and the tool I was using is SharePoint Manager, and now Power Query can be another option.

04