Author: Linxiao Ma

The Tip for Installing R packages on Azure Batch

In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing and need to be installed.

My first attempt to install the package through running install.packages(“tidyselect”) was failed. The error message revealed that I didn’t have write permission to install package. It actually makes sense that Azure Batch doesn’t allow the jobs submitted from the consumers to change the global environment of computer nodes. Therefore, the R packages can only be installed somewhere on the computer nodes where the submitted jobs have write permission.

After a bit research, it turns out that Azure Batch organised the submitted jobs in this structure. a

Azure Batch exposes a portion of the file system on a computer note for the submitted job to access. One file directory will be created for each job and each task associated with the job. Within each task directory, a working directory, wd, will be created which provides read/write access to the task. As the task has full permission to create, update and delete content within this directory, here should be the place where we can install the R packages to. Azure Batch provides an environment variable AZ_BATCH_TASK_WORKING_DIR to specify the path of the directory of the current task.

With the knowledge of the task working directory, we can install the R package with the specified installation target directory.

install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

When attaching the installed package, we need to explicitly specify the directory as well:

library(tidyselect, lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))
Advertisements

Why Bother to Use Pandas “Categorical” Type in Python

When we process data using Pandas library in Python, we normally convert the string type of categorical variables to the Categorical data type offered by the Pandas library. Why do we bother to do that, considering there is actually no difference with the output results no matter you are using the Pandas Categorical type or the string type? To answer this question, let’s first take a simple test.

In this test, we create a data frame with two columns, “Category” and “Value”, and generate 50 millions rows in this data frame. The values of “Category” column are generated from a list of six predefined categories, [‘category1’, ‘category2’ … ‘category6’], and the values of “Value” column are generated from the list integer of [0 … 9].

0102

We first time the execution of a group by operation against the “Category” column which is in the string type and also observe the memory usage of the “Category” column.

03

We then convert the “Category” column to the Pandas Categorical type, repeat the execution of the same group by operation and observe the memory usage of the “Category” column in Pandas Categorical type.

04

The result reveals a five times improvement on running speed and one eighth memory usage when converting the “Category” column to the Pandas Categorical data type.

05

This test result answers our original question that the reason to use Pandas Categorical data type is for the optimised memory usage and improved data processing speed. Then why does the Categorical data type have such magics? The answer is pretty simple, i.e. dictionary encoding.

If we open the source code of the Pandas Categorical class, we can see this class contains two properties, “categories” and “codes”.

06

After we printed the two properties for the “Category” column used in our test, we can see the “categories” property stores the dictionary of the six categories available for the “Category” column and the actual category information of the “Category” column for all the rows in the data frame is stored in the “codes” property in the format of integer number which points to the position of the corresponding category in the “categories” property.

07

08

In this way, Pandas Categorical data type takes much less memory space to store the category information in integer type compared to store in the original string type. The query operations on the category column scan less memory space and therefore the time used on the query is shorten.

Dictionary encoding is a common technique used for data compression. For example, Azure Analysis Service and Power BI also used dictionary encoding in their Vertipad engine to compress data to reduce memory usage and to increase query speed.

13fig04
*This figure is from  the book “Definitive Guide to DAX” authored  by Alberto Ferrari and Marco Russo

 

Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker

The first question pop out of my head when I heard Microsoft Bot Framework is how to build some sorts of bots, which is capable to replace me for any of my responsibilities in my work. Part of my Power BI consulting responsibilities is to answer all sorts of Power BI related questions from clients. That would be pretty cool to have a bot to take those responsibilities from me, on the basis that there is no risk to my job security :-).

After a quick research, I found it is very easy to create a Power BI knowledge base bot using Microsoft Bot Framework and QnA Maker service. This blog post walks through the steps to create the bot.

02

There are four major steps to build a Power BI knowledge base bot:

  • Step 1 – Create QnA maker and train the knowledge base
  • Step 2 – Create a bot with the Bot Builder SDK and deploy to an Azure App Service
  • Step 3 – Register the created bot with Azure Bot Service
  • Step 4 – Connect the bot to channels

Step 1 – Create QnA maker and train the knowledge base

QnA maker is one of Microsoft cognitive services that enables developers or business users to build, train and publish QnA type of bot based on FAQ urls, structured documents or editorial content. You can logon to the QnA maker services website through this link, https://qnamaker.ai/, where you can create, manage, test and publish your QnA bot.

To create a QnA maker, you need to click the “Create new service” link on the top navigation bar on the QnA maker service website to launch the “Create a QnA service” form.

0317

You need to provide the name of the QnA maker and to specify the urls of the knowledge base pages or to upload the files that contains your questions and answers.

After the QnA maker is created, you can access it from “My services” page. You can select to edit your QnA maker that brings you to the maker edit page where you can edit, retrain, test and publish your QnA maker.

 

04On the knowledge base tab in the maker edit page, all the questions and answers from the web pages of the knowledge base you specified or from the documents you manually uploaded will be extracted into a table, where you can manually delete the imported question/answer pairs or manually add and edit new pairs.

After you have edited your QnA maker, you can save and retrain it using the green “Save and retrain” button on the top right on the maker edit page, and then test the QnA maker on the “Test” tab.

05

After you are happy with your QnA maker, you can publish it as a service using the blue “Publish” button on the top right of the maker edit page. After the maker is successfully published, you will be provided with the sample HTTP request for building your bot. you need to note down the QnA knowledge base id and the subscription key (highlighted with red box in the snapshot below) that will be used later when create the bot.

06.PNG

Step 2 – Create a bot with the Bot Builder SDK and deploy to an Azure App Service

There is a couple of ways to create a bot. In this blog post, I will use Bot Builder SDK for .Net to create the bot in Visual Studio 2017.

Before you can use Visual Studio 2017 to create a bot with Microsoft Bot Framework, there are some prerequisite tasks you need to do, such as installing project and item templates. You can follow this doc for those prerequisite tasks: https://docs.microsoft.com/en-us/bot-framework/dotnet/bot-builder-dotnet-quickstart

After you have done all the prerequisite tasks, you can see a new “Bot Application” project template on the “New Project” dialog.

07

After you created a project using the “Bot Application”  template, you will have the predefined controller and dialog files, and the predefined configuration settings in the web.config file.

10

Open the web.config file, under the appSettings section, add two keys for QnA subscription key and Knowledge base id that you have noted down earlier when you published your QnA maker service.
08

Under the Dialogs folder in your project, create a new dialog file using the dialog item template and name it as PowerBIQnADialog. Add a reference to Microsoft.Bot.Builder.CognitiveService.QnAMaker, and change the dialog file to inherit from QnAMakerDialog. Call the base constructor of the class and pass the QnA maker attribute data over, including the QnA subscription key and QnA knowledge base id (specified in the web.config file), the bot message text when there is no answer matching the asked question, and the matching threshold for deciding whether an answer is matching the question.

09

After the dialog file is created, open the existing MessageController.cs file. In the “Post” method, change the Dialogs.RootDialog() to Dialogs.PowerBIQnAdialog() as the snapshot shown below.

11

After you referenced the right dialog class in the MessageController.cs file, you can publish the bot to an Azure App service. You can create the app service through Azure Portal, or create it in Visual Studio when select to publish the project.

12

Step 3 – Register the created bot with Azure Bot Service

After you have created a bot and deployed to the Azure app service, you can create an Azure bot service and register the bot you have created to the bot service.

Logon to the Azure Portal, create a new Bot Channels Registration service. Specify the general information such as name, subscription, location, etc. The most important attribute to specify is the “Message endpoint” attribute of your created and deployed bot, which is in the format as {url of the app service where your bot is deployed}/api/messages.

01

After the bot service has been created, you can change the messaging endpoint in the Settings tab if required.

14

Also, on the Settings tab, you can get the Microsoft App ID for the bot service and the password (you can click the “Manage” link to generate a new password. The password created previously is not visible if you have not noted it somewhere before).  You now need to update the web.config file in your bot Visual Studio project to set the Microsoft App ID and the password to the MicrosoftAppId and the MicrosfotAppPassword key. After the change, you need to republish your bot to update the Azure app service.

18

After the your bot is updated in the Azure app service, your bot service is ready to use. you can test it in the “Test in Web Chat” tab in your Bot Registration Channels service.

15

Step 4 – Connect the bot to channels

A bot channel is the connection between the Bot Framework and communication apps which can talk to your bot. You can create channels on the “Channels” tab in you Bot Registration Channels service page. There are quite a number of apps that have been supported by Bot Framework, such as Skype, Teams, Facebook Message etc.

19.PNG

In this blog post, I created a Microsoft Teams channel to communicate to our Power BI QnA bot through just clicking the Microsoft Teams icon on the channel page, and then channel is up running.

20

On the Chat tab in your Microsoft Teams app, start a new chat to the Microsoft App ID of your bot service.

21

And now you can start to ask your bot Power BI related questions from Microsoft Team app.

02

Workaround of the Power BI Guest User License Assignment Issue

I have recently been working with an insurance client to help them design a solution to distribute Power BI app to external guest users with Azure AD B2B. I have designed this solution based on Microsoft Power BI content external distribution guideline whitepaper: Distribute Power BI content to external guest users using Azure Active Directory B2B.  However, I have met an issue for assigning external guest user the Power BI pro license.

To allow external access to the Power BI app, guest users must have a Power BI pro license. Power BI supports three approaches to license external users, using Power BI Premium, assigning Power BI pro licenses to guest users, using guest users’ existing Power BI pro license if they have. This client decided to assign Power BI pro licenses to those guest users who do not have one already.

Based on the guideline whitepaper, we should be able to assign the Power BI pro license to guest user through Office 365 admin portal.

Capture.PNG

However, on the Office 365 admin portal, the product licenses management option is only available for the internal users but not for the Azure AD B2B guest users. After a google research online, there seems an issue currently with guest user license assignment through Office 365 admin portal. It looks the only option for now is to assign guest user license through API.

Here is the scripts for assign Power BI license:

Connect-MsolService
Set-MsolUser -UserPrincipalName "{User principal name}" -UsageLocation GB
Set-MsolUserLicense - UserPrincipalName "{User principal name}" -AddLicenses “{Tenant Name}:POWER_BI_PRO"

The {User principal name} is the guest account name of the invited external user. You can find it on the Username field of the guest user on Office 365 admin portal.

The -UsageLocation parameter in the Set-MsolUser cmdlet specifies the location country of a user with a two-letter ISO code. You can lookup the country code through this link: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2.

The -AddLicenses parameter in the Set-MsolUserLicense cmdlet specifies the license to assign to the user which consists two part: {Tenant Name} (the name of the tenant) and POWER_BI_PRO (license code for Power BI pro).

SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2

*The source code created for this blog post can be found here.

In the previous blog posts in the SSIS in Azure series, we created a SSIS package to periodically ingests data from Azure SQL database to Azure Data Lake Store and deployed the package in the Azure-SSIS Integrated Runtime. Up to this point, we have achieved two goals in the SSIS in Azure series:

  • Using SSIS to move data between cloud storages
  • Host and run SSIS packages in cloud

The last goal we need to achieve is to schedule the execution of the SSIS package in cloud. Traditionally, we can schedule the SSIS package execution by creating a SQL Server Agent job. However, we can only do that through an on-premises SQL Server instance or provision a SQL Server vm in Azure. Thanks to the support of Stored Procedure activity in ADF, we can now schedule the SSIS package execution using a cloud-based ADF pipeline. This blog post will walk through the steps to achieve that.

You can find the source code created for this blog post in my Github. The ningago.demo.adf.ssis.adfv2 project contains the json and powershell files to create the ADF pipeline.

4.PNG

 

The key to this solution is to call the sp_executesql stored procedure in the RunSSISPackagePipeline pipeline that execute the sql script for triggering the SSIS package execution. We can define a Stored Procedure activity in the pipeline to call the sp_executesql stored procedure and pass in the sql script for triggering the SSIS package execution as parameter.

5

We don’t have to manually author the sql script from scratch, but instead we can generate the script using SSMS. First, we connect to the SSISDB catalog in our Azure-SSIS integrated runtime (please refer to SSIS in Azure #2 for how to do that), select the SSIS project or package we can execute, and click the Execute button to open the Execute Package dialog window.

6

On the Execute Package dialog window, we select the package to execute and also set a value (can be any value) for the package parameter (the reason is set any value to the parameter is to ensure the parameter setting statement will be generated in the sql script).

2

We then click the “Script” button on the top-left that will generate the sql script for executing the selected SSIS package.

3

We can then copy the script into the stored procedure activity defined in the RunSSISPackagePipeline pipeline.

5

As you may have noticed that the SSIS package parameter, DateScliceToLoad, has been manually set by us with a random value. The next step we need to take is to pass the date of pipeline run into the sql script in order to make the SSIS package only move the data in the give day. Firstly, we need to define the “scheduledRunTime” parameter in the RunSSISPipeline which will receive the scheduled pipeline run time from the pipeline trigger we will create later.

6.PNG

In the sql script we just generated to trigger SSIS package execution, we replace the random date we have manually set with the ADF expression which points to the scheduleRunTime parameter in the pipeline.

7.PNG

Next, we will need to create the trigger json file to schedule the execution of the pipeline. After we specify the pipeline to execute in the trigger file, we need to pass the scheduledTime of the trigger which is a system variable to the scheduledRunTime parameter defined in the pipeline.

{
  "properties": {
    "name": "RunSSISPackageScheduler",
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Hour",
        "interval": 1,
        "startTime": "2018-01-07T00:00:00-22:00"
      }
    },
    "pipelines": [
      {
        "pipelineReference": {
          "type": "PipelineReference",
          "referenceName": "RunSSISPackagePipeline"
        },
        "parameters": {
          "scheduledRunTime": "@trigger().scheduledTime"
        }
      }
    ]
  }
}

After we have created the json files to define linked service, pipeline and the trigger, we need to deploy them into our data factory instance. As the V2 version of ADF does not support deployment through UI yet, we need to create a deployment scripts using PowerShell.

$DataFactoryName = "ninjago3843adf"
$ResourceGroupName = "SSIS"

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $DataFactoryName -ResourceGroupName $ResourceGroupName -Name "SSISDBLinkedService" -File ".\SSISDBLinkedService.json"
Set-AzureRmDataFactoryV2Pipeline -DataFactoryName $DataFactoryName -ResourceGroupName $ResourceGroupName -Name "RunSSISPackagePipeline" -DefinitionFile ".\RunSSISPackagePipeline.json"

Stop-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler"
Set-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler" -DefinitionFile ".\RunSSISPackageScheduler.json"
Start-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler"

After the pipeline is deployed and scheduled by the trigger, the pipeline will execute our SSIS package based on the schedule.

8.PNG

 

SSIS in Azure #2 – Deploy SSIS Packages to Azure-SSIS Integration Runtime in ADF V2

In the first blog post of the SSIS in Azure series, I gave a demonstration on how to create SSIS packages to move data in cloud, using a common use case that periodically ingests data from Azure SQL database to Azure Data Lake Store.  In the pre-ADF V2 era, we can only deploy SSIS packages in on-premises SQL Servers or SQL Servers in Azure VM. Thanks to the Azure-SSIS integration runtime that is available for public preview in Azure Data Factory V2, we can now directly deploy and execute our SSIS packages in Azure without needing to provision and manage an Azure SQL Server VM (One thing to note, under the hood, Azure-SSIS integrated runtime itself is a cluster of Azure VMs that are dedicated to run SSIS packages).

This blog post will walk through the steps to deploy, execute and monitor the SSIS package we have created in previous blog post in the Azure-SSIS integration runtime.

First of all, we need to have an Azure-SSIS integration runtime created in our Azure tenant. This can be achieved using the PowerShell scripts provided by Microsoft.

I have met an issue when running the scripts. Here is the hint in case you run into the same issue. If the Set-AzureRmDataFactoryV2 and the Set-AzureRmDataFactoryV2IntegrationRuntime cmdlet in the PowerShell scripts fail to execute and raise the “HTTP Status Code: NotFound” error even after you have installed the Azure PowerShell modules (we can see the AzureRM.DataFactoryV2 module is installed after checking the C:\Program Files\WindowsPowerShell\Modules folder), it may be caused by the incompatible versions between AzureRM modules. As the “Install-Module AzureRM -AllowClobber” cmdlet only installs the Azure modules that are not already installed in your computer, the Azure modules on which the AzureRM.DataFactoryV2 module is dependent are not updated. The simplest solution to this issue is to remove all AzureRM modules in the C:\Program Files\WindowsPowerShell\Modules folder and then run the Install-Module AzureRM -AllowClobber cmdlet.

After the PowerShell scripts is successfully run, an Azure Data Factory V2 instance and a SSISDB Azure SQL database are created in the resource group we specified in the scripts.

2

The SSISDB is the Azure-hosted SSIS catalog where you can deploy your SSIS packaged into. We can use SSMS to connect to the SSISDB. On the Connect to Server dialogue in the SSMS, we need to specify the Server name as the Azure SQL server where the SSISDB SQL database is hosted.

3

After we specified the server name and the login details, don’t click the Connect button to connect to the SQL Server, but instead click the Options button to open the Connection Properties tab. On the Connection Properties tab, set “Connect to database” field as SSISDB database. This settings will allow you to connect to the SSISDB SSIS catalog after clicked the Connect button.

4

5

We can deploy our SSIS package into the SSISdB catalog either by using Deploy Project option in SSMS

6

or using Deploy option in you SSIS SSDT project in Visual Studio.

7

Both options will launch the Integration Services Deployment Wizard dialog to guide us through the steps for the SSIS package deployment.

8

13

We can execute the deployed SSIS packages in SSMS by right-clicking a SSIS project or SSIS package to launch the Execute Package dialog. The SSIS package we have created to periodically load data from Azure SQL database to Azure Data Lake requires a DateSclieToLoad parameter passed in that specifies which day of data to move. We can manually set the value of this parameter in the Execute Package dialog.

9

We can also monitor the SSIS package executions in SSMS through the SSIS catalog reports.

14.PNG

1112

Up to this point, we have created the Azure-SSIS Integration Runtime and deployed our SSIS package into the runtime. We can execute the package and monitor the execution results manually in SSMS. In next blog post, I will walk through the steps to schedule the SSIS package execution using an ADF V2 pipeline.  The ADF V2 pipeline will be scheduled to execute every day. The date of the current pipeline execution will be passed into the SSIS package, and the SSIS package will move the data by the given date parameter.