Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service

One requirement I have been recently working with is to run R scripts for some complex calculations in an ADF (V2) data processing pipeline. My first attempt is to run the R scripts using Azure Data Lake Analytics (ADLA) with R extension. However, two limitations of ADLA R extension stopped me from adopting this approach. Firstly, ADLA R extension supports only one input dataframe and one output dataframe at the time when I write this blog post. However, the requirement I am working with needs the R scripts to take multiple dataframes as input and output multiple result dataframes. Secondly,  the total size for the input and output is limited to 500 MB.

Another attempt I have taken was to create an ADF custom activity to execute the R scripts in Azure Batch Service. This option turns out to be a pretty flexible, easy to implement and manage approach. This blog post highlights the key steps to implement this approach and also mentions the lessons I have learnt.

Step 1 – Preparing Azure Batch Service

Firstly, we need to add an Azure Batch pool in an Azure Batch service instance. If the Azure Batch service instance doesn’t exist yet, a new instance needs to be provisioned. Please refer to Microsoft official docs for the details on creating Azure Batch service and pools.

While adding the Azure Batch pool, we need to specify the VM image to provision as the computing nodes in the pool. We can use Data Science Virtual Machine image which ships with most of common R packages.

Capture1.PNG

Step 2- Creating Container in Azure Blob Storage to Host R Source Files and Input/Output Data

Create a Blob storage container in your Azure Storage account and then create an Input folder and an Output folder within the container root folder.

The R source files will be deployed into the container root folder. The input data files are the output of the upstream process activities in the ADF pipeline and pushed (copied) into the Input folder. The R source files and the input data files will be submitted to execute in the Azure Batch service (by the Azure Batch custom activity which will be created later in the ADF pipeline), the output will be written into the Output folder in the Azure Blob storage.

Step 3 – Authoring R Scripts to Communicate with the Azure Blob Storage

When the Azure Batch custom activity is triggered in the ADF pipeline, the R source files and the input data files will be submitted into the work directory created for the submitted task on the Azure Batch computer nodes. The R scripts will load the data in the data files into dataframes, run the calculations and transformations, and finally write the results to output data files and store them in the work directory. The output data files will then be written into the Output folder in the Azure Blob storage using the blob operation functions provided by rAzureBatch package. Here is a good sample from the doAzureParallel Github site on the blob operations with rAZureBatch.

Basically, we need first to create a rAzureBatch StorageServiceClient with the Azure Storage account credentials.

storageCredentials <- rAzureBatch::SharedKeyCredentials$new(
    name = "{name of the Azure Storage account}",
    key = "{access key of the Azure storage account }"
)

storageClient <- rAzureBatch::StorageServiceClient$new(
   authentication = storageCredentials,
   url = "{url of the Azure Blob storage}"
)

Then, we need to create a SAS token with write permission on the Output folder

writeSasToken <- storageClient$generateSasToken(permission = "w", "c", path = {the path of the Output folder})

Lastly, we can save the output file into the Output folder on Azure Blob storage with the uploadBlob function.

response <- storageClient$blobOperations$uploadBlob(
    containerName,
    fileDirectory = "{output file name}",
    sasToken = writeSasToken,
    accountName = storageAccountName)
if (response[['status_code']]!=201) {
    stop('Failed to save the output file.')
}

It is important to explicitly check the response status_code and throw error when the save action is failed. Otherwise, the ADF pipeline will not able to capture the error but instead treat the custom activity as running successfully and move on to the downstream activities.

Step 4 – Setup Azure Batch Custom Activity in ADF Pipeline

After the Azure Batch service part of work is done, we need to add and configure the Azure Batch Custom Activity in ADF Pipeline. This step is pretty straightforward, please refer to the Microsoft official docs for the more details. The only part needs to note is the settings of “Command” and “Folder path” of the Azure Batch Custom Activity. The “Command” should be “RScript {the entry R file you want to run}”, and the “Folder path” should be the container we have created earlier to host the R source files.

b

Those are the four main steps to setup the execution of R scripts in ADF pipelines.

Below lists a few of tips that might be helpful:

If you need to install additional R packages for your scripts, specify the lib path with the environmental variable, AZ_BATCH_TASK_WORKING_DIR,  in order to install the packages into the working directory of the current task. Please refer to my previous blog post for further explanation.

 install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

If the Azure Batch Custom Activity throw an UserError with message as “Hit unexpected expection and execution failed.”, you can find the detailed error message from the stderr.txt file on the working directory of the failed task.

Capture

To access the working directory, you need to go to the Azure Batch account –> Jobs –>  select job –> select the failed task, and select “Files on node”.

Capture01

Then you should be able to see all the files existing on the working directory of that task, including the stderr error output file.

Capture02.PNG

 

Advertisements

The Tip for Installing R packages on Azure Batch

In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing and need to be installed.

My first attempt to install the package through running install.packages(“tidyselect”) was failed. The error message revealed that I didn’t have write permission to install package. It actually makes sense that Azure Batch doesn’t allow the jobs submitted from the consumers to change the global environment of computer nodes. Therefore, the R packages can only be installed somewhere on the computer nodes where the submitted jobs have write permission.

After a bit research, it turns out that Azure Batch organised the submitted jobs in this structure. a

Azure Batch exposes a portion of the file system on a computer note for the submitted job to access. One file directory will be created for each job and each task associated with the job. Within each task directory, a working directory, wd, will be created which provides read/write access to the task. As the task has full permission to create, update and delete content within this directory, here should be the place where we can install the R packages to. Azure Batch provides an environment variable AZ_BATCH_TASK_WORKING_DIR to specify the path of the directory of the current task.

With the knowledge of the task working directory, we can install the R package with the specified installation target directory.

install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

When attaching the installed package, we need to explicitly specify the directory as well:

library(tidyselect, lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

Why Bother to Use Pandas “Categorical” Type in Python

When we process data using Pandas library in Python, we normally convert the string type of categorical variables to the Categorical data type offered by the Pandas library. Why do we bother to do that, considering there is actually no difference with the output results no matter you are using the Pandas Categorical type or the string type? To answer this question, let’s first take a simple test.

In this test, we create a data frame with two columns, “Category” and “Value”, and generate 50 millions rows in this data frame. The values of “Category” column are generated from a list of six predefined categories, [‘category1’, ‘category2’ … ‘category6’], and the values of “Value” column are generated from the list integer of [0 … 9].

0102

We first time the execution of a group by operation against the “Category” column which is in the string type and also observe the memory usage of the “Category” column.

03

We then convert the “Category” column to the Pandas Categorical type, repeat the execution of the same group by operation and observe the memory usage of the “Category” column in Pandas Categorical type.

04

The result reveals a five times improvement on running speed and one eighth memory usage when converting the “Category” column to the Pandas Categorical data type.

05

This test result answers our original question that the reason to use Pandas Categorical data type is for the optimised memory usage and improved data processing speed. Then why does the Categorical data type have such magics? The answer is pretty simple, i.e. dictionary encoding.

If we open the source code of the Pandas Categorical class, we can see this class contains two properties, “categories” and “codes”.

06

After we printed the two properties for the “Category” column used in our test, we can see the “categories” property stores the dictionary of the six categories available for the “Category” column and the actual category information of the “Category” column for all the rows in the data frame is stored in the “codes” property in the format of integer number which points to the position of the corresponding category in the “categories” property.

07

08

In this way, Pandas Categorical data type takes much less memory space to store the category information in integer type compared to store in the original string type. The query operations on the category column scan less memory space and therefore the time used on the query is shorten.

Dictionary encoding is a common technique used for data compression. For example, Azure Analysis Service and Power BI also used dictionary encoding in their Vertipad engine to compress data to reduce memory usage and to increase query speed.

13fig04
*This figure is from  the book “Definitive Guide to DAX” authored  by Alberto Ferrari and Marco Russo

 

Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker

The first question pop out of my head when I heard Microsoft Bot Framework is how to build some sorts of bots, which is capable to replace me for any of my responsibilities in my work. Part of my Power BI consulting responsibilities is to answer all sorts of Power BI related questions from clients. That would be pretty cool to have a bot to take those responsibilities from me, on the basis that there is no risk to my job security :-).

After a quick research, I found it is very easy to create a Power BI knowledge base bot using Microsoft Bot Framework and QnA Maker service. This blog post walks through the steps to create the bot.

02

There are four major steps to build a Power BI knowledge base bot:

  • Step 1 – Create QnA maker and train the knowledge base
  • Step 2 – Create a bot with the Bot Builder SDK and deploy to an Azure App Service
  • Step 3 – Register the created bot with Azure Bot Service
  • Step 4 – Connect the bot to channels

Step 1 – Create QnA maker and train the knowledge base

QnA maker is one of Microsoft cognitive services that enables developers or business users to build, train and publish QnA type of bot based on FAQ urls, structured documents or editorial content. You can logon to the QnA maker services website through this link, https://qnamaker.ai/, where you can create, manage, test and publish your QnA bot.

To create a QnA maker, you need to click the “Create new service” link on the top navigation bar on the QnA maker service website to launch the “Create a QnA service” form.

0317

You need to provide the name of the QnA maker and to specify the urls of the knowledge base pages or to upload the files that contains your questions and answers.

After the QnA maker is created, you can access it from “My services” page. You can select to edit your QnA maker that brings you to the maker edit page where you can edit, retrain, test and publish your QnA maker.

 

04On the knowledge base tab in the maker edit page, all the questions and answers from the web pages of the knowledge base you specified or from the documents you manually uploaded will be extracted into a table, where you can manually delete the imported question/answer pairs or manually add and edit new pairs.

After you have edited your QnA maker, you can save and retrain it using the green “Save and retrain” button on the top right on the maker edit page, and then test the QnA maker on the “Test” tab.

05

After you are happy with your QnA maker, you can publish it as a service using the blue “Publish” button on the top right of the maker edit page. After the maker is successfully published, you will be provided with the sample HTTP request for building your bot. you need to note down the QnA knowledge base id and the subscription key (highlighted with red box in the snapshot below) that will be used later when create the bot.

06.PNG

Step 2 – Create a bot with the Bot Builder SDK and deploy to an Azure App Service

There is a couple of ways to create a bot. In this blog post, I will use Bot Builder SDK for .Net to create the bot in Visual Studio 2017.

Before you can use Visual Studio 2017 to create a bot with Microsoft Bot Framework, there are some prerequisite tasks you need to do, such as installing project and item templates. You can follow this doc for those prerequisite tasks: https://docs.microsoft.com/en-us/bot-framework/dotnet/bot-builder-dotnet-quickstart

After you have done all the prerequisite tasks, you can see a new “Bot Application” project template on the “New Project” dialog.

07

After you created a project using the “Bot Application”  template, you will have the predefined controller and dialog files, and the predefined configuration settings in the web.config file.

10

Open the web.config file, under the appSettings section, add two keys for QnA subscription key and Knowledge base id that you have noted down earlier when you published your QnA maker service.
08

Under the Dialogs folder in your project, create a new dialog file using the dialog item template and name it as PowerBIQnADialog. Add a reference to Microsoft.Bot.Builder.CognitiveService.QnAMaker, and change the dialog file to inherit from QnAMakerDialog. Call the base constructor of the class and pass the QnA maker attribute data over, including the QnA subscription key and QnA knowledge base id (specified in the web.config file), the bot message text when there is no answer matching the asked question, and the matching threshold for deciding whether an answer is matching the question.

09

After the dialog file is created, open the existing MessageController.cs file. In the “Post” method, change the Dialogs.RootDialog() to Dialogs.PowerBIQnAdialog() as the snapshot shown below.

11

After you referenced the right dialog class in the MessageController.cs file, you can publish the bot to an Azure App service. You can create the app service through Azure Portal, or create it in Visual Studio when select to publish the project.

12

Step 3 – Register the created bot with Azure Bot Service

After you have created a bot and deployed to the Azure app service, you can create an Azure bot service and register the bot you have created to the bot service.

Logon to the Azure Portal, create a new Bot Channels Registration service. Specify the general information such as name, subscription, location, etc. The most important attribute to specify is the “Message endpoint” attribute of your created and deployed bot, which is in the format as {url of the app service where your bot is deployed}/api/messages.

01

After the bot service has been created, you can change the messaging endpoint in the Settings tab if required.

14

Also, on the Settings tab, you can get the Microsoft App ID for the bot service and the password (you can click the “Manage” link to generate a new password. The password created previously is not visible if you have not noted it somewhere before).  You now need to update the web.config file in your bot Visual Studio project to set the Microsoft App ID and the password to the MicrosoftAppId and the MicrosfotAppPassword key. After the change, you need to republish your bot to update the Azure app service.

18

After the your bot is updated in the Azure app service, your bot service is ready to use. you can test it in the “Test in Web Chat” tab in your Bot Registration Channels service.

15

Step 4 – Connect the bot to channels

A bot channel is the connection between the Bot Framework and communication apps which can talk to your bot. You can create channels on the “Channels” tab in you Bot Registration Channels service page. There are quite a number of apps that have been supported by Bot Framework, such as Skype, Teams, Facebook Message etc.

19.PNG

In this blog post, I created a Microsoft Teams channel to communicate to our Power BI QnA bot through just clicking the Microsoft Teams icon on the channel page, and then channel is up running.

20

On the Chat tab in your Microsoft Teams app, start a new chat to the Microsoft App ID of your bot service.

21

And now you can start to ask your bot Power BI related questions from Microsoft Team app.

02

Workaround of the Power BI Guest User License Assignment Issue

I have recently been working with an insurance client to help them design a solution to distribute Power BI app to external guest users with Azure AD B2B. I have designed this solution based on Microsoft Power BI content external distribution guideline whitepaper: Distribute Power BI content to external guest users using Azure Active Directory B2B.  However, I have met an issue for assigning external guest user the Power BI pro license.

To allow external access to the Power BI app, guest users must have a Power BI pro license. Power BI supports three approaches to license external users, using Power BI Premium, assigning Power BI pro licenses to guest users, using guest users’ existing Power BI pro license if they have. This client decided to assign Power BI pro licenses to those guest users who do not have one already.

Based on the guideline whitepaper, we should be able to assign the Power BI pro license to guest user through Office 365 admin portal.

Capture.PNG

However, on the Office 365 admin portal, the product licenses management option is only available for the internal users but not for the Azure AD B2B guest users. After a google research online, there seems an issue currently with guest user license assignment through Office 365 admin portal. It looks the only option for now is to assign guest user license through API.

Here is the scripts for assign Power BI license:

Connect-MsolService
Set-MsolUser -UserPrincipalName "{User principal name}" -UsageLocation GB
Set-MsolUserLicense - UserPrincipalName "{User principal name}" -AddLicenses “{Tenant Name}:POWER_BI_PRO"

The {User principal name} is the guest account name of the invited external user. You can find it on the Username field of the guest user on Office 365 admin portal.

The -UsageLocation parameter in the Set-MsolUser cmdlet specifies the location country of a user with a two-letter ISO code. You can lookup the country code through this link: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2.

The -AddLicenses parameter in the Set-MsolUserLicense cmdlet specifies the license to assign to the user which consists two part: {Tenant Name} (the name of the tenant) and POWER_BI_PRO (license code for Power BI pro).

SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2

*The source code created for this blog post can be found here.

In the previous blog posts in the SSIS in Azure series, we created a SSIS package to periodically ingests data from Azure SQL database to Azure Data Lake Store and deployed the package in the Azure-SSIS Integrated Runtime. Up to this point, we have achieved two goals in the SSIS in Azure series:

  • Using SSIS to move data between cloud storages
  • Host and run SSIS packages in cloud

The last goal we need to achieve is to schedule the execution of the SSIS package in cloud. Traditionally, we can schedule the SSIS package execution by creating a SQL Server Agent job. However, we can only do that through an on-premises SQL Server instance or provision a SQL Server vm in Azure. Thanks to the support of Stored Procedure activity in ADF, we can now schedule the SSIS package execution using a cloud-based ADF pipeline. This blog post will walk through the steps to achieve that.

You can find the source code created for this blog post in my Github. The ningago.demo.adf.ssis.adfv2 project contains the json and powershell files to create the ADF pipeline.

4.PNG

 

The key to this solution is to call the sp_executesql stored procedure in the RunSSISPackagePipeline pipeline that execute the sql script for triggering the SSIS package execution. We can define a Stored Procedure activity in the pipeline to call the sp_executesql stored procedure and pass in the sql script for triggering the SSIS package execution as parameter.

5

We don’t have to manually author the sql script from scratch, but instead we can generate the script using SSMS. First, we connect to the SSISDB catalog in our Azure-SSIS integrated runtime (please refer to SSIS in Azure #2 for how to do that), select the SSIS project or package we can execute, and click the Execute button to open the Execute Package dialog window.

6

On the Execute Package dialog window, we select the package to execute and also set a value (can be any value) for the package parameter (the reason is set any value to the parameter is to ensure the parameter setting statement will be generated in the sql script).

2

We then click the “Script” button on the top-left that will generate the sql script for executing the selected SSIS package.

3

We can then copy the script into the stored procedure activity defined in the RunSSISPackagePipeline pipeline.

5

As you may have noticed that the SSIS package parameter, DateScliceToLoad, has been manually set by us with a random value. The next step we need to take is to pass the date of pipeline run into the sql script in order to make the SSIS package only move the data in the give day. Firstly, we need to define the “scheduledRunTime” parameter in the RunSSISPipeline which will receive the scheduled pipeline run time from the pipeline trigger we will create later.

6.PNG

In the sql script we just generated to trigger SSIS package execution, we replace the random date we have manually set with the ADF expression which points to the scheduleRunTime parameter in the pipeline.

7.PNG

Next, we will need to create the trigger json file to schedule the execution of the pipeline. After we specify the pipeline to execute in the trigger file, we need to pass the scheduledTime of the trigger which is a system variable to the scheduledRunTime parameter defined in the pipeline.

{
  "properties": {
    "name": "RunSSISPackageScheduler",
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Hour",
        "interval": 1,
        "startTime": "2018-01-07T00:00:00-22:00"
      }
    },
    "pipelines": [
      {
        "pipelineReference": {
          "type": "PipelineReference",
          "referenceName": "RunSSISPackagePipeline"
        },
        "parameters": {
          "scheduledRunTime": "@trigger().scheduledTime"
        }
      }
    ]
  }
}

After we have created the json files to define linked service, pipeline and the trigger, we need to deploy them into our data factory instance. As the V2 version of ADF does not support deployment through UI yet, we need to create a deployment scripts using PowerShell.

$DataFactoryName = "ninjago3843adf"
$ResourceGroupName = "SSIS"

Set-AzureRmDataFactoryV2LinkedService -DataFactoryName $DataFactoryName -ResourceGroupName $ResourceGroupName -Name "SSISDBLinkedService" -File ".\SSISDBLinkedService.json"
Set-AzureRmDataFactoryV2Pipeline -DataFactoryName $DataFactoryName -ResourceGroupName $ResourceGroupName -Name "RunSSISPackagePipeline" -DefinitionFile ".\RunSSISPackagePipeline.json"

Stop-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler"
Set-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler" -DefinitionFile ".\RunSSISPackageScheduler.json"
Start-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "RunSSISPackageScheduler"

After the pipeline is deployed and scheduled by the trigger, the pipeline will execute our SSIS package based on the schedule.

8.PNG