End-to-End Azure Data Factory Pipeline for Star Schema ETL (Part 1)

This blog series demonstrates how to build an end-to-end ADF pipeline for extracting data from Azure SQL DB/Azure Data Lake Store and load to a star-schema data warehouse database with considerations of  SCD (slow changing dimensions) and incremental loading.

The final pipeline will look as:

1

The machine cycle records will be load from the csv files stored in a Azure Data Lake store, and the reference data of customers and machines will be load form the Reference DB (Azure SQL DB). The ADF pipeline will first load the data into the staging tables in the target DW, and the ADF pipeline will then execute the SQL stored procedures to transform the data into star schema and implement the type 2 SCD.

A number of services within the Azure Cortana Intelligence Suite will be employed to build the end-to-end pipeline, including Azure Data Factory, Azure Data Lake Store, Azure Data Lake Analytics (U-SQL), Azure SQL Database, and the development tools (Visual Studio + SSDT). Four Visual Studio projects will be created, including the ReferenceDB for provisioning the Azure SQL reference database, the DW for provisioning the target Data Wareshourse, the ADLA for incrementally extracting the machine cycle data, and the ADF for the build of the pipeline. You can download the source code here.

This blog post covers the first part of the series to give a brief introduction of the example used in this demonstration and provides the steps for preparing the required demo databases/data lake directory and the sample data.

Preparation

1.Create Visual Studio solution

Create a Visual Studio solution with the SSDT tool to host all the projects required in this demonstration, including the DB projects for ReferenceDB and the target DW, and the ADF project.

2. Create ReferenceDB project

1

Create ReferenceDB project and add the scripts for creating Customer table and Machine table.

Customer table:

CREATE TABLE [dbo].[Customer]
(
    [Id] INT NOT NULL PRIMARY KEY IDENTITY, 
    [Code] VARCHAR(10) NOT NULL,
    [Name] VARCHAR(50) NOT NULL, 
    [LastModified] DATETIME2 NOT NULL
)

Machine table:

CREATE TABLE [dbo].[Machine]
(
    [Id] INT NOT NULL PRIMARY KEY IDENTITY, 
    [Code] VARCHAR(10) NOT NULL,
    [Name] VARCHAR(50) NOT NULL, 
    [Condition] VARCHAR(10) NOT NULL, 
    [LastModified] DATETIME2 NOT NULL
)

Create a post deployment script to add the sample data:

INSERT INTO [dbo].[Customer]
           ([Code]
           ,[Name]
           ,[LastModified])
     VALUES
           ('C0001','Customer1', '2017-11-08 21:21:30'),
           ('C0002','Customer2', '2017-11-08 21:21:30')

INSERT INTO [dbo].[Machine]
           ([Code]
           ,[Name]
           ,[Condition]
           ,[LastModified])
     VALUES
           ('M0001','Machine1', 'Perfect', '2017-11-08 21:21:30'),
           ('M0002','Machine2', 'Good', '2017-11-08 21:21:30')

Publish the project to the Azure tenant to provision the Azure SQL DB.

3. Create the target DW project

2

Create the target DW project and add the scripts for schema, staging tables, dimension tables and fact tables:

Schema – prod:

CREATE SCHEMA [prod]

Schema – stg:

CREATE SCHEMA [stg]

Staging – Customer:

CREATE TABLE [stg].[Customer]
(
    [Code] int NOT NULL,
    [Name] VARCHAR(50) NOT NULL, 
    [LastModified] DATETIME2 NOT NULL
)

Staging – Machine:

CREATE TABLE [stg].[Machine]
(
    [Code] int NOT NULL,
    [Name] VARCHAR(50) NOT NULL, 
    [Condition] VARCHAR(10) NOT NULL, 
    [LastModified] DATETIME2 NOT NULL
)

Staging – MachineCycle:

CREATE TABLE [stg].[MachineCycle]
(
    [CycleId] int NOT NULL,
    [MachineName] varchar(50) NOT NULL,
    [CustomerName] varchar(50) NOT NULL,
    [StartDateTime] datetime2 NOT NULL,
    [EndDateTime] datetime2 NOT NULL,
        [LastModified] DATETIME2 NOT NULL
)

Dimension – DimCustomer

CREATE TABLE [prod].[DimCustomer]
(
    [Id] int NOT NULL PRIMARY KEY IDENTITY,
    [Code] int NOT NULL PRIMARY KEY, 
    [Name] VARCHAR(50) NOT NULL, 
    [CurrentRow] bit NOT NULL,
    [ValidTo] DATETIME2 NULL,
    [LastModified] DATETIME2 NOT NULL
)

Dimension – DimMachine

CREATE TABLE [prod].[DimMachine]
(
    [Id] int NOT NULL PRIMARY KEY IDENTITY,
    [Code] int NOT NULL PRIMARY KEY,
    [Name] VARCHAR(50) NOT NULL, 
    [Condition] VARCHAR(10) NOT NULL, 
    [CurrentRow] bit NOT NULL,
    [ValidTo] DATETIME2 NULL,
    [LastModified] DATETIME2 NOT NULL
)

Facts – FactMachineCycle

CREATE TABLE [prod].[FactMachineCycle]
(
    [CycleId] int NOT NULL PRIMARY KEY,
    [MachineId] int NOT NULL,
    [CustomerId] int NOT NULL,
    [DateKey] int NOT NULL,
    [Duration] float NOT NULL,
        [RowAdded] DATETIME2 NOT NULL
)

In addition to the tables above we also need a Date Dimension table and the scripts to generate the table. There are plenty of scripts you can find from Google, e.g., https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho. The generated DimDate table should look like:

6

Finally, we need to publish the project to Azure and will get the DB deployed.

7

4. Create Azure Data Lake directory

Create a directory in Azure Data Lake, e.g., “ADFDW”, and organise the sub-directory hierarchy as year -> month -> day.

3

The sample csv data includes the columns of cycleID, CustomerCode, MachineCode, StartDateTime, EndDateTime, and EventTime.

3

Now we have done our preparation work, the next blog post will cover the steps to build ADF pipelines for dimensional tables ETL.

Advertisements

Workaround of the Power BI Guest User License Assignment Issue

I have recently been working with an insurance client to help them design a solution to distribute Power BI app to external guest users with Azure AD B2B. I have designed this solution based on Microsoft Power BI content external distribution guideline whitepaper: Distribute Power BI content to external guest users using Azure Active Directory B2B.  However, I have met an issue for assigning external guest user the Power BI pro license.

To allow external access to the Power BI app, guest users must have a Power BI pro license. Power BI supports three approaches to license external users, using Power BI Premium, assigning Power BI pro licenses to guest users, using guest users’ existing Power BI pro license if they have. This client decided to assign Power BI pro licenses to those guest users who do not have one already.

Based on the guideline whitepaper, we should be able to assign the Power BI pro license to guest user through Office 365 admin portal.

Capture.PNG

However, on the Office 365 admin portal, the product licenses management option is only available for the internal users but not for the Azure AD B2B guest users. After a google research online, there seems an issue currently with guest user license assignment through Office 365 admin portal. It looks the only option for now is to assign guest user license through API.

Here is the scripts for assign Power BI license:

Connect-MsolService
Set-MsolUser -UserPrincipalName "{User principal name}" -UsageLocation GB
Set-MsolUserLicense - UserPrincipalName "{User principal name}" -AddLicenses “{Tenant Name}:POWER_BI_PRO"

The {User principal name} is the guest account name of the invited external user. You can find it on the Username field of the guest user on Office 365 admin portal.

The -UsageLocation parameter in the Set-MsolUser cmdlet specifies the location country of a user with a two-letter ISO code. You can lookup the country code through this link: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2.

The -AddLicenses parameter in the Set-MsolUserLicense cmdlet specifies the license to assign to the user which consists two part: {Tenant Name} (the name of the tenant) and POWER_BI_PRO (license code for Power BI pro).

SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2

*The source code created for this blog post can be found here.

In the previous blog posts in the SSIS in Azure series, we created a SSIS package to periodically ingests data from Azure SQL database to Azure Data Lake Store and deployed the package in the Azure-SSIS Integrated Runtime. Up to this point, we have achieved two goals in the SSIS in Azure series:

  • Using SSIS to move data between cloud storages
  • Host and run SSIS packages in cloud

The last goal we need to achieve is to schedule the execution of the SSIS package in cloud. Traditionally, we can schedule the SSIS package execution by creating a SQL Server Agent job. However, we can only do that through an on-premises SQL Server instance or provision a SQL Server vm in Azure. Thanks to the support of Stored Procedure activity in ADF, we can now schedule the SSIS package execution using a cloud-based ADF pipeline. This blog post will walk through the steps to achieve that.

You can find the source code created for this blog post in my Github. The ningago.demo.adf.ssis.adfv2 project contains the json and powershell files to create the ADF pipeline.

4.PNG

 

The key to this solution is to call the sp_executesql stored procedure in the RunSSISPackagePipeline pipeline that execute the sql script for triggering the SSIS package execution. We can define a Stored Procedure activity in the pipeline to call the sp_executesql stored procedure and pass in the sql script for triggering the SSIS package execution as parameter.

5

We don’t have to manually author the sql script from scratch, but instead we can generate the script using SSMS. First, we connect to the SSISDB catalog in our Azure-SSIS integrated runtime (please refer to SSIS in Azure #2 for how to do that), select the SSIS project or package we can execute, and click the Execute button to open the Execute Package dialog window.

6

On the Execute Package dialog window, we select the package to execute and also set a value (can be any value) for the package parameter (the reason is set any value to the parameter is to ensure the parameter setting statement will be generated in the sql script).

2

We then click the “Script” button on the top-left that will generate the sql script for executing the selected SSIS package.

3

We can then copy the script into the stored procedure activity defined in the RunSSISPackagePipeline pipeline.

5

As you may have noticed that the SSIS package parameter, DateScliceToLoad, has been manually set by us with a random value. The next step we need to take is to pass the date of pipeline run into the sql script in order to make the SSIS package only move the data in the give day. Firstly, we need to define the “scheduledRunTime” parameter in the RunSSISPipeline which will receive the scheduled pipeline run time from the pipeline trigger we will create later.

6.PNG

In the sql script we just generated to trigger SSIS package execution, we replace the random date we have manually set with the ADF expression which points to the scheduleRunTime parameter in the pipeline.

7.PNG

Next, we will need to create the trigger json file to schedule the execution of the pipeline. After we specify the pipeline to execute in the trigger file, we need to pass the scheduledTime of the trigger which is a system variable to the scheduledRunTime parameter defined in the pipeline.

{
  "properties": {
    "name": "RunSSISPackageScheduler",
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Hour",
        "interval": 1,
        "startTime": "2018-01-07T00:00:00-22:00"
      }
    },
    "pipelines": [
      {
        "pipelineReference": {
          "type": "PipelineReference",
          "referenceName": "RunSSISPackagePipeline"
        },
        "parameters": {
          "scheduledRunTime": "@trigger().scheduledTime"
        }
      }
    ]
  }
}

After we have created the json files to define linked service, pipeline and the trigger, we need to deploy them into our data factory instance. As the V2 version of ADF does not support deployment through UI yet, we need to create a deployment scripts using PowerShell.

$DataFactoryName = "ninjago3843adf"
$ResourceGroupName = "SSIS"

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $DataFactoryName -ResourceGroupName $ResourceGroupName -Name "SSISDBLinkedService" -File ".\SSISDBLinkedService.json"
Set-AzureRmDataFactoryV2Pipeline -DataFactoryName $DataFactoryName -ResourceGroupName $ResourceGroupName -Name "RunSSISPackagePipeline" -DefinitionFile ".\RunSSISPackagePipeline.json"

Stop-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler"
Set-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler" -DefinitionFile ".\RunSSISPackageScheduler.json"
Start-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler"

After the pipeline is deployed and scheduled by the trigger, the pipeline will execute our SSIS package based on the schedule.

8.PNG

 

SSIS in Azure #2 – Deploy SSIS Packages to Azure-SSIS Integration Runtime in ADF V2

In the first blog post of the SSIS in Azure series, I gave a demonstration on how to create SSIS packages to move data in cloud, using a common use case that periodically ingests data from Azure SQL database to Azure Data Lake Store.  In the pre-ADF V2 era, we can only deploy SSIS packages in on-premises SQL Servers or SQL Servers in Azure VM. Thanks to the Azure-SSIS integration runtime that is available for public preview in Azure Data Factory V2, we can now directly deploy and execute our SSIS packages in Azure without needing to provision and manage an Azure SQL Server VM (One thing to note, under the hood, Azure-SSIS integrated runtime itself is a cluster of Azure VMs that are dedicated to run SSIS packages).

This blog post will walk through the steps to deploy, execute and monitor the SSIS package we have created in previous blog post in the Azure-SSIS integration runtime.

First of all, we need to have an Azure-SSIS integration runtime created in our Azure tenant. This can be achieved using the PowerShell scripts provided by Microsoft.

I have met an issue when running the scripts. Here is the hint in case you run into the same issue. If the Set-AzureRmDataFactoryV2 and the Set-AzureRmDataFactoryV2IntegrationRuntime cmdlet in the PowerShell scripts fail to execute and raise the “HTTP Status Code: NotFound” error even after you have installed the Azure PowerShell modules (we can see the AzureRM.DataFactoryV2 module is installed after checking the C:\Program Files\WindowsPowerShell\Modules folder), it may be caused by the incompatible versions between AzureRM modules. As the “Install-Module AzureRM -AllowClobber” cmdlet only installs the Azure modules that are not already installed in your computer, the Azure modules on which the AzureRM.DataFactoryV2 module is dependent are not updated. The simplest solution to this issue is to remove all AzureRM modules in the C:\Program Files\WindowsPowerShell\Modules folder and then run the Install-Module AzureRM -AllowClobber cmdlet.

After the PowerShell scripts is successfully run, an Azure Data Factory V2 instance and a SSISDB Azure SQL database are created in the resource group we specified in the scripts.

2

The SSISDB is the Azure-hosted SSIS catalog where you can deploy your SSIS packaged into. We can use SSMS to connect to the SSISDB. On the Connect to Server dialogue in the SSMS, we need to specify the Server name as the Azure SQL server where the SSISDB SQL database is hosted.

3

After we specified the server name and the login details, don’t click the Connect button to connect to the SQL Server, but instead click the Options button to open the Connection Properties tab. On the Connection Properties tab, set “Connect to database” field as SSISDB database. This settings will allow you to connect to the SSISDB SSIS catalog after clicked the Connect button.

4

5

We can deploy our SSIS package into the SSISdB catalog either by using Deploy Project option in SSMS

6

or using Deploy option in you SSIS SSDT project in Visual Studio.

7

Both options will launch the Integration Services Deployment Wizard dialog to guide us through the steps for the SSIS package deployment.

8

13

We can execute the deployed SSIS packages in SSMS by right-clicking a SSIS project or SSIS package to launch the Execute Package dialog. The SSIS package we have created to periodically load data from Azure SQL database to Azure Data Lake requires a DateSclieToLoad parameter passed in that specifies which day of data to move. We can manually set the value of this parameter in the Execute Package dialog.

9

We can also monitor the SSIS package executions in SSMS through the SSIS catalog reports.

14.PNG

1112

Up to this point, we have created the Azure-SSIS Integration Runtime and deployed our SSIS package into the runtime. We can execute the package and monitor the execution results manually in SSMS. In next blog post, I will walk through the steps to schedule the SSIS package execution using an ADF V2 pipeline.  The ADF V2 pipeline will be scheduled to execute every day. The date of the current pipeline execution will be passed into the SSIS package, and the SSIS package will move the data by the given date parameter.

 

 

Power BI – GMROI Measure

GMROI (Gross Margin Return On Investment) is one of the most popular metrics, commonly used in retail industry, for inventory management. Retailers are short-term investors and the “buy and hold” strategy does not work in retail industry. Instead of having the cash frozen in inventory, it is crucial to keep cash flowing to continually purchase, mark up, and sell in order to generate profits and expand the business. GMROI is a profitability metric that helps a retailer to analyse how efficiently the inventories are being converted into cash.

GMROI is defined as the gross profit a retailer makes in return for their investment in inventory.  A common formula used to calculate GMROI is to divide the gross profit by the average inventory cost.

GMROI = Gross Profit / Average Inventory Cost

Gross Profit is calculated by subtracting the COGS (Cost of Goods Sold) from the revenue:

Gross Margin = Revenue – COGS

Inventory cost is a semiadditive measure as it is not additive on the date dimension. Average Inventory Cost is calculated by dividing the sum of the inventory cost over a specified period by the total number of days of the period:

Average Inventory Cost = Sum of Inventory Cost over a Period/Number of Days of the Period 

Due to the semiadditive nature of the inventory measures, the GMROI analysis in a BI solution is normally conducted on a periodic snapshot data model, including an inventory fact table, a date dimension table, and a number of other dimension tables that are applicable to the analysis, such as vendors, stores, and products.

The snapshot below shows a sample data schema for GMROI analysis, created in Power BI. The design of the data schema can be variant, depending on the retailers’ specific business rules and LOB database design. You can download the demo pbix file here.

1

The Inventory table in the sample data schema records the daily snapshot of the inventory level,  the quantity of sold products, inventory cost and retail  price of the products over the vendor, store and product dimensions.

To create the Gross Profit measure, we can calculate the gross profit for each sold item and multiply it by the quantity of items sold in a day. We can then use the SUMX function to roll-up the total gross profit, depending on the evaluation context.

Gross Profit = SUMX('Inventory',
                    'Inventory'[Quantity Sold]*
                       ('Inventory'[Retail Price] - 'Inventory'[Cost]) 
                )

To create the Average Inventory Cost measure, we can sum up the inventory cost of all days in a period and divide it by the number of days in the period.

Average Inventory Cost = 
    DIVIDE(
        SUMX('Inventory',
             'Inventory'[Inventory Level]*'Inventory'[Cost]),
             COUNTROWS('Inventory')
        )

After the Gross Profit measure and the Average Inventory Cost measure are created, we can simply calculate the GMROI measure by dividing the Gross Profit measure by the Average Inventory Cost Measure.

GMROI = DIVIDE([Gross Profit], [Average Inventory Cost])

The GMROI measure can be used in different evaluation context depending on the specific requirements of the GMROI analysis. For example, we can use the measure to calculate the annual GMROI of the products from Vendor A and sold at Store X.

DAX – Find the Items Ranked in Top n for Multiple Periods (with Dynamic Slicing)

One of my previous blog post introduces how to find the items which are ranked in top n for multiple periods, using the INTERSET and TOPN functions. However, that approach needs to hard-code the periods and the number of top items in the DAX scripts. This blog post introduces an approach that allows users to dynamically specify the periods and the number of top items to evaluate, using the interactive dashboard slicers.

a1

In this blog post, we will still use the Eurovision dataset as example that contains the rows of country-to-country votes for each year.

1

We will create four measures, including “Rank”, “In Top N (This Year)”, “In Top N (All Selected Years)”, and “All Selected Years in Top N”. These measures will be used in an evaluation context made of the combination of each year and each country. To build the evaluation context, we can use a Power BI table visual and add the “Year” and “ToCountry” columns from the Eurovision dataset to the table. The four measures will be added to the table later that evaluates the rank and whether in top n of each country in each year.

a2

A “Year” slicer will be added to the dashboard that allows users to filter the table by the selected years. Any number of years can be selected and the selected years can be consecutive or nonconsecutive.

a4

Measure – Rank

The first measure to create is the “Rank” measure that computes the ranks of the countries in each selected year.

Rank = RANKX(ALL(data[Country]), CALCULATE(SUM(data[Points])))

Measure – In Top N (This Year)

Based on the “Rank” measure, we will create the “In Top N (This Year)” measure that compute whether the current country is ranked in top n in the current year-country evaluation context. Here we need to allow users to dynamically specify the N (the number of top items) to evaluate. We can achieve that using a disconnected parameter table that defines the options for the N.

a3

In the DAX measure, we can get the user selected N value using VALUES function which will be compared to the “Rank” measure we created earlier to evaluate whether the current country is in top N in current year context.

In Top N (This Year) = 
    IF([Rank]<
        IF(HASONEVALUE('TopN'[Top N ]),
            VALUES('TopN'[Top N ]),
            10
        ), 1, 0)

We will then filter the table using the “In Top N (This Year)” measure that only keeps the countries ranked in the top N in at least one of the selected years.

a5.PNG

a7

Measure – In Top N (All Selected Years)

After we applied the filter on the “In Top N (This Year)” measure, the table only contains the rows of countries ranked in top N in at least one selected years. If we count the rows in the filtered table by a country, we will  get the number of selected years when this country is ranked in top N. This is what the “In Top N (All Selected Years)” measure will do.

In Top N (All Selected Years) = 
   CALCULATE(
        DISTINCTCOUNT(data[Year]),
        ALLSELECTED(data[Year])
    )

a8

Measure – All Selected Years in Top N

Now that we have the “In Top N (All Selected Years)” measure which tells us how many of the select years  a country is ranked in top 10, we can then calculate the total number of the select years and compare it to the “In Top N (All Selected Years)” measure. If the value of the “In Top N (All Selected Years)” measure is equal to the total number of selected years, that means the country is  ranked in top 10 in all the selected years.

All Selected Years in Top N = 
    VAR NumberOfSelectedYears = 
        CALCULATE(
            DISTINCTCOUNT(data[Year]),
            ALLSELECTED(data[Year]),
            ALLSELECTED(data[Country])
        )
    RETURN
        [In Top N (All Selected Years)] = NumberOfSelectedYears

a9.PNG

Please find the pbix file here.