Category: Azure Data Platform

Configuration-Driven Azure Data Factory Pipelines

In this blog post, I will introduce two configuration-driven Azure Data Factory pipeline patterns I have used in my previous projects, including the Source-Sink pattern and the Key-Value pattern.

The Source-Sink pattern is primarily used for parameterising and configuring the data movement activities, with the source location and sink location of the data movement configured in a database table. In Azure Data Factory, we can use Copy activity to copy an data entity (database table or text-based file) from a location to another, and we need to create an ADF dataset for the source and an ADF dataset for the sink. It can easily turn into a nightmare when we have a large number of data entities to copy and we have to create a source dataset and a sink dataset for each data entity. With the Source-Sink pattern, we only need to create a generic ADF dataset for each data source type, and dynamically refer to the data entity for copy with the source and sink settings configured in a separate configuration database.

The Key-Value pattern is used for configuring the settings of activities in Azure Data Factory, such as the notebook file path of a Databricks Notebook activity or the url of a logic app Https event. Instead of hard-coding the settings in ADF pipeline itself, the Key-Value pattern allows to configure the activity settings in a database table and to load the settings and dynamically apply them to the ADF activities at runtime.

In the rest of this blog post, I will introduce the implementation of those two patterns in Azure Data Factory.

Source-Sink Pattern

To implement the Source-Sink pattern in Azure Data Factory, the following steps need to be followed:

  • Step 1 – create data movement source-sink configuration table
  • Step 2 – create generic source and sink datasets
  • Step 3 – use Lookup activity to fetch the source-sink settings of the data entities for copy from the configuration table
  • Step 4 – use ForEach activity to loop through the data entities for copy
  • Step 5 – add a generic Copy activity inner the ForEach activity

Step 1 – create data movement source-sink configuration table

Create a configuration database (an Azure SQL database will be a good option), and create a table with columns to define data movement category, location for source and location for sink.

ADF_04_001

Along with the configuration table, we need to create a stored procedure for the Lookup ADF activity to retrieve the source-sink settings of the data entities for specified copy category.

ADF_04_020

Step 2 – create generic source and sink datasets

Create generic source and sink datasets. To make the datasets generic, we need to create a number of dataset parameters to define the location of the table or file referred by the dataset. Those parameters will take the values fetched from the configuration database table at runtime.

ADF_04_005

On the Connection tab of the generic dataset, we need to specify the table name or file container/path/name using the dataset parameters we just defined.

ADF_04_006

Step 3 – use Lookup activity to fetch the source-sink settings 

Add a Lookup activity and set to use the stored procedure we created earlier for retrieving the source-sink settings of all the data entities for copy in a specified category.

ADF_04_019ADF_04_007

The output of the Lookup activity will contains source-sink settings of all the data entities in the specified category defined in the configuration database table.

ADF_04_017

Step 4 – use ForEach activity to loop through the data entities for copy

Add and chain a ForEach activity after the Lookup activity, and set the Items property of the ForEach activity as the output of the Lookup activity, @activity(‘Lookup Configuration’).output.value, so that the ForEach activity will loop through the source-sink settings of all the data entities fetched from the configuration database table.

ADF_04_009

Step 5 – add a generic Copy activity inner the ForEach activity

Inner the ForEach activity, add the Copy activity with the Source and Sink pointing to the generic source dataset and sink dataset we created earlier and pass the corresponding source-sink settings of current data entity item, such as @item().Sink_Item for the file name of the sink.

ADF_04_010

At this point, the source-sink pattern is all set. When the pipeline runs, a copy channel is created and executed for each data entity defined in the configuration database table.

ADF_04_011

Key-Value Pattern

To implement the Key Value pattern in Azure Data Factory, the following steps need to be followed:

  • Step 1 – create activity settings key-value configuration table
  • Step 2 – create stored procedure to transform the key-value table
  • Step 3 – use Lookup activity to retrieve activity settings from configuration table
  • Step 4 – apply settings on ADF activities

Step 1 – create activity settings key-value configuration table

Create a database table with columns defining pipeline name, activity type, activity name, activity setting key, activity setting value.

ADF_04_012

Step 2 – create stored procedure to transform the key-value table 

The ADF Lookup activity is not working in the key-value type dataset. Therefore, we need to convert the key-value table into a single data row with column header as the activity setting key and the column value as the activity setting value.

ADF_04_013

We can use T-SQL PIVOT operator to convert the key-value table into the single row table. However, as the number of records in the key-value table is not fixed, we cannot specify the keys for pivot. Instead, we need to fetch the keys from the key-value table first and author the PIVOT operation in dynamical sql. The snapshot below shows the stored procedure I have created for convert the key-value table.

ADF_04_014

Step 3 – use Lookup activity to retrieve activity settings from configuration table

In the Azure Data Factory pipelines you are creating, at the start of the activities chain, Add a Lookup activity and use the stored procedure created earlier to retrieve the key-value table and convert it into the single row format to be readable for the Lookup activity.  On the Lookup activity, tick the ‘First row only’ on as the result of the Lookup activity will be the single row dataset.

ADF_04_015

The snapshot below shows what the output of the Lookup activity looks like.

ADF_04_016

Step 4 – apply settings on ADF activities

From the output of the Lookup activity, the other ADF activities chained after can access the value of an activity setting as @activity(‘{lookup activity name}’).output.firstRow.{column name for the key}. 

ADF_04_018

Handling Embarrassing Parallel Workload with PySpark Pandas UDF

Introduction

In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and is not able to split workload into multiple worker nodes. In this blog post, I will walk through another approach to handle embarrassing parallel workload with Databricks Pandas UDF.

Embarrassing parallel workload refers to the type of problems that are easy to separate into parallel task with no dependency for communication between those parallel tasks. Embarrassing parallel workload normally calculates similar things many times with different groups of data or different sets of parameters. The calculation are independent of each other and each calculation takes a fair amount of time. The datasets involved in embarrassing can be big, but in most of use cases, the datasets are not at the “Big Data” scale.

The native parallelism mechanism of Apache Spark might not be an efficient way for the embarrassing parallel workload due to the overhead of serialization and inter-process communication.  In addition, many libraries commonly used in the Embarrassing Parallel use cases, such as numpy and scikit-learn, are not supported by PySpark. However, Pandas UDFs supported by PySpark can be a feasible way for embarrassing parallel workload.

Approach

Unlike the PySpark UDFs which operate row-at-a-time, grouped map Pandas UDFs operate in the split-apply-combine pattern where a Spark dataframe is split into groups based on the conditions specified in the groupBy operator and a user-defined Pandas UDF is applied to each group and the results from all groups are combined and returned as a new Spark dataframe. Embarrassing parallel workload fits into this pattern well. In this way, the calculation of an embarrassing parallel workload can be encapsulated into a Pandas UDF. The dataset involved in the embarrassing parallel workload is loaded into a PySpark dataframe and split into group and the calculation on each group of data is executed in the Pandas UDF with Spark tasks running on separate executors in parallel.

This blog post demonstrates the Pandas UDF approach with the sample example I have used in the previous blog post for explaining the approach for handling embarrassing parallel workload with Databricks notebook workflows.

The sample notebook I have created for this blog post can be found here in my Github repository. The snapshot below shows the core part of the notebook:

databricks_2_002

A Pandas UDF, “planGroupTravelling” (Line 6-12) is crated to execute the group travel planning optimisation algorithm from Toby Segaran’s “Collective Intelligence” book for one travel group.

The dataset of all travel groups used in this example is converted into a Spark dataframe where each row contains fields of travel group id, traveler name, and travel origin (Line 14). For this example, an utility function, “travelGroupsToDataframe“,  is created to covert the original dataset in Python dictionary format to a Spark dataframe (The code for the “travelGroupsToDataframe” function can be found in the sample notebook for this blog post here). The snapshot below shows the converted Spark dataframe, i.e. the input dataframe for the parallel workload.

databricks_2_004

The dataframe is split by travel group id and the “planGroupTraveling” Pandas UDF is applied to each group (Line 15). The results from each UDF, the optimised travelling arrangement for each traveler, are combined into a new Spark dataframe.

databricks_2_003

Conclusion

This blog post describes another approach for handling embarrassing parallel workload using PySpark Pandas UDFs. Grouped map Pandas UDFs operate in the split-apply-combine pattern which distributes calculation logic written in native Python to separate executors and run in parallel. This pattern fits well with embarrassing parallel workload.

Handling Embarrassing Parallel Workload with Databricks Notebook Workflows

Introduction

Embarrassing Parallel refers to the problem where little or no effort is needed to separate the problem into parallel tasks, and there is no dependency for communication needed between the parallel tasks. Embarrassing parallel problem is very common with some typical examples like group-by analyses, simulations, optimisations, cross-validations or feature selections. Normally, an Embarrassing Parallel workload has the following characteristics:

  • Calculate similar things many times with different groups of data or different parameter sets
  • Calculation are independent of each other
  • Each calculation take a fair amount of time
  • The datasets involved in the calculation can be big, but in most of use cases, the datasets are not at  the “Big Data” scale.

The native parallelism mechanism of Apache Spark might not be an efficient way for the Embarrassing Parallel workload due to the overhead of serialization and inter-process communication, especially when the dataset is not big but the calculation is complex. Based on benchmark tests conducted by Dorian Beganovic, Spark performed much more poorly on small or medium size of datasets compared to Pandas running on single machine. Dorian suggested a >500GB mark as the suitable size for spark. In addition, many libraries commonly used in the Embarrassing Parallel use cases, such as numpy and scikit-learn, are not supported by PySpark.

One alternative solution is to take advantage of Databricks Notebook workflows to handle the Embarrassing Parallel workloads. Basically, one generic Databricks Notebook is created to run the calculation for one parallel task against one partition of the entire dataset or one set of parameters. The partition of the dataset or set of parameters are specified by Notebook parameters. In this blog post, these generic Databricks notebooks to run the calculation logic in parallel are referred as Parallel Notebooks.  The Parallel Notebooks are triggered by another Databricks Notebook, which is named as Master Notebook in this blog post. The Master Notebook encapsulates the logic for partitioning the dataset or creating the parameters set and launches the parallel notebooks to execute the calculation logic against the partitioned dataset or parameter set in parallel.

Approach

This blog post walks through the detailed steps to handle Embarrassing Parallel workloads using Databricks Notebook workflows. You can find the sample Databricks notebooks I created for this blog post here.

Baseline Test

We first take a baseline test to run some calculation logic workload in sequence, and then we create notebook workflows to run the workload in parallel to compare the total execution time.

For the calculation task, I will use the group travel planning optimisation example from Toby Segaran’s “Collective Intelligence” book. You can find the Python code encapsulated the optimisation algorithms from here. In this baseline test, I have created eight travel groups and stored them into a dictionary with the dictionary key as the group ID and the dictionary value as the travel group data. A for loop is crated to loop through each travel group and run the optimisation algorithm for each group one by one. The result shows that it took 4.99 minutes to run through all eight groups.

ADB_1_001

After we have the execution time for the sequential workload, we can start to handle the workload in parallel with Databricks notebook workflows to see if the total execution time can be reduced with parallel approach.

Parallel Notebook

We first create the parallel notebook and include the optimisation algorithm code snippet from  Toby Segaran’s “Collective Intelligence” book which will be used for finding the optimised  travel plan for a travel group.

We then create a “groupIDs” parameter which takes the ids of the travel groups to process in this parallel task.

ADB_1_009

The value of the “groupIDs”  parameter is used to retrieve the data of the specified travel groups. In this example, for the simplicity, I just used a Python dictionary to store the data in the file. In real-world scenarios, the data is more likely stored externally such as a database or data lake. In the generic notebook, the optimisation algorithm is only run on the travel groups specified by the parameters.

ADB_1_004

Master Notebook

We then create the Master notebook to partition the travel groups and launch the parallel notebooks. I have written a GroupTravelPlanning class which encapsulates the travel group partition function and the function to launch multiple notebooks in parallel using the built-in multiprocessing library of Python.

##################################################################################
# The master class that partitions the travel groups and trigger the sub notebooks 
# execution for each partition in parallel
##################################################################################
class GroupTravelPlanning:
  
  def __init__(self, partitionNum, travelGroups):
    self.partitionNum = partitionNum
    self.travelGroups = travelGroups
    self.groupPartitionList = self.PartitionTravelGroups()
  
  #Partition travel groups
  def PartitionTravelGroups(self):

    #Initialise a dictionary for hosting the partitions of travel groups
    partitions={}
    for partitionID in range(0, self.partitionNum):
      partitions[partitionID]=[]
      
    #Partition the travel groups 
    for index, groupID in enumerate(self.travelGroups.keys(), start=0):
      partitionID = index % self.partitionNum
      partitions[partitionID].append(groupID)
  
    #Convert the travel groups partition dictionary into an list of travel 
    #group strings, such as ['Group-1,Group-2', 'Group-3,Group4',...] 
    groupPartitionList = []
    for group in partitions.values():
      groupPartition = ','.join(group)
      groupPartitionList.append(groupPartition)
    return groupPartitionList
  
  #Run group travel planning sub-notebooks in parallel
  def RunGroupTravelPlanningNotebooks(self):
    from multiprocessing.pool import ThreadPool
    pool = ThreadPool(self.partitionNum)
    
    pool.map(
       lambda groupIDs: dbutils.notebook.run(
          '/Users/malinxiao@hotmail.com/Group Travel/GroupTravelPlanning_Parallel',
           timeout_seconds=0,
           arguments = {"groupIDs":groupIDs}),
       self.groupPartitionList)

You can find more details for using Python multiprocessing library for concurrent Databricks notebook workflows from this doc. Basically, a thread pool is crated with the number of threads to be same as the number of partition of travel groups to process so that all the partitions of travel groups can be processed at the same time and no waiting queue.

ADB_1_007

Apart from the GroupTravelPlanning class, we also need to add the code to create an instance of the GroupTravelPlanning class in the Master notebook and trigger the main method.

ADB_1_011

After the Master notebook starts to run, the list of sub notebooks (from the generic notebook template with the partitioned travel group ids as parameter) is launched.

ADB_1_002

The result shows that the total time for running through the eight test travel groups in four groups in parallel took 1.63 minutes compared to 4.99 minutes for running in sequence.

ADB_1_001

Limitations

This blog post walks through an approach to handle Embarrassing Parallel workload with Databricks notebook workflows. This approach is logically simple to understand and is easy to control the parallelism with built-in Python multiprocessing package.  However, the limitation of this approach is that all the parallel tasks are running on a single machine (the cluster driver) instead of being distributed to worker nodes. That limits the solution only being able to scale up to a certain point although the maximum VM supported by Databricks at the time when the blog post was written can reach to 80 cores/640 GB (Standard_L80s_V2). To support Embarrassing Parallel workload across multiple nodes,  the alternative approaches can use Pandas UDF for PySpark. I will introduce the Pandas UDF approach in next blog post.

Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service

Introduction

One requirement I have been recently working with is to run R scripts for some complex calculations in an ADF (V2) data processing pipeline. My first attempt is to run the R scripts using Azure Data Lake Analytics (ADLA) with R extension. However, two limitations of ADLA R extension stopped me from adopting this approach. Firstly, ADLA R extension supports only one input dataframe and one output dataframe at the time when I write this blog post. However, the requirement I am working with needs the R scripts to take multiple dataframes as input and output multiple result dataframes. Secondly,  the total size for the input and output is limited to 500 MB.

Another attempt I have taken was to create an ADF custom activity to execute the R scripts in Azure Batch Service. This option turns out to be a pretty flexible, easy to implement and manage approach.

Approach

This blog post highlights the key steps to implement this approach and also mentions the lessons I have learnt.

Step 1 – Preparing Azure Batch Service

Firstly, we need to add an Azure Batch pool in an Azure Batch service instance. If the Azure Batch service instance doesn’t exist yet, a new instance needs to be provisioned. Please refer to Microsoft official docs for the details on creating Azure Batch service and pools.

While adding the Azure Batch pool, we need to specify the VM image to provision as the computing nodes in the pool. We can use Data Science Virtual Machine image which ships with most of common R packages.

Capture1.PNG

Step 2- Creating Container in Azure Blob Storage to Host R Source Files and Input/Output Data

Create a Blob storage container in your Azure Storage account and then create an Input folder and an Output folder within the container root folder.

The R source files will be deployed into the container root folder. The input data files are the output of the upstream process activities in the ADF pipeline and pushed (copied) into the Input folder. The R source files and the input data files will be submitted to execute in the Azure Batch service (by the Azure Batch custom activity which will be created later in the ADF pipeline), the output will be written into the Output folder in the Azure Blob storage.

Step 3 – Authoring R Scripts to Communicate with the Azure Blob Storage

When the Azure Batch custom activity is triggered in the ADF pipeline, the R source files and the input data files will be submitted into the work directory created for the submitted task on the Azure Batch computer nodes. The R scripts will load the data in the data files into dataframes, run the calculations and transformations, and finally write the results to output data files and store them in the work directory. The output data files will then be written into the Output folder in the Azure Blob storage using the blob operation functions provided by rAzureBatch package. Here is a good sample from the doAzureParallel Github site on the blob operations with rAZureBatch.

Basically, we need first to create a rAzureBatch StorageServiceClient with the Azure Storage account credentials.

storageCredentials <- rAzureBatch::SharedKeyCredentials$new(
    name = "{name of the Azure Storage account}",
    key = "{access key of the Azure storage account }"
)

storageClient <- rAzureBatch::StorageServiceClient$new(
   authentication = storageCredentials,
   url = "{url of the Azure Blob storage}"
)

Then, we need to create a SAS token with write permission on the Output folder

writeSasToken <- storageClient$generateSasToken(permission = "w", "c", path = {the path of the Output folder})

Lastly, we can save the output file into the Output folder on Azure Blob storage with the uploadBlob function.

response <- storageClient$blobOperations$uploadBlob(
    containerName,
    fileDirectory = "{output file name}",
    sasToken = writeSasToken,
    accountName = storageAccountName)
if (response[['status_code']]!=201) {
    stop('Failed to save the output file.')
}

It is important to explicitly check the response status_code and throw error when the save action is failed. Otherwise, the ADF pipeline will not able to capture the error but instead treat the custom activity as running successfully and move on to the downstream activities.

Step 4 – Setup Azure Batch Custom Activity in ADF Pipeline

After the Azure Batch service part of work is done, we need to add and configure the Azure Batch Custom Activity in ADF Pipeline. This step is pretty straightforward, please refer to the Microsoft official docs for the more details. The only part needs to note is the settings of “Command” and “Folder path” of the Azure Batch Custom Activity. The “Command” should be “RScript {the entry R file you want to run}”, and the “Folder path” should be the container we have created earlier to host the R source files.

b

Those are the four main steps to setup the execution of R scripts in ADF pipelines.

Below lists a few of tips that might be helpful:

If you need to install additional R packages for your scripts, specify the lib path with the environmental variable, AZ_BATCH_TASK_WORKING_DIR,  in order to install the packages into the working directory of the current task. Please refer to my previous blog post for further explanation.

 install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

If the Azure Batch Custom Activity throw an UserError with message as “Hit unexpected expection and execution failed.”, you can find the detailed error message from the stderr.txt file on the working directory of the failed task.

Capture

To access the working directory, you need to go to the Azure Batch account –> Jobs –>  select job –> select the failed task, and select “Files on node”.

Capture01

Then you should be able to see all the files existing on the working directory of that task, including the stderr error output file.

Capture02.PNG

 

The Tip for Installing R packages on Azure Batch

Problem

In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing and need to be installed.

My first attempt to install the package through running install.packages(“tidyselect”) was failed. The error message revealed that I didn’t have write permission to install package. It actually makes sense that Azure Batch doesn’t allow the jobs submitted from the consumers to change the global environment of computer nodes. Therefore, the R packages can only be installed somewhere on the computer nodes where the submitted jobs have write permission.

Solution

After a bit research, it turns out that Azure Batch organised the submitted jobs in this structure. a

Azure Batch exposes a portion of the file system on a computer note for the submitted job to access. One file directory will be created for each job and each task associated with the job. Within each task directory, a working directory, wd, will be created which provides read/write access to the task. As the task has full permission to create, update and delete content within this directory, here should be the place where we can install the R packages to. Azure Batch provides an environment variable AZ_BATCH_TASK_WORKING_DIR to specify the path of the directory of the current task.

With the knowledge of the task working directory, we can install the R package with the specified installation target directory.

install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

When attaching the installed package, we need to explicitly specify the directory as well:

library(tidyselect, lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker

The first question pop out of my head when I heard Microsoft Bot Framework is how to build some sorts of bots, which is capable to replace me for any of my responsibilities in my work. Part of my Power BI consulting responsibilities is to answer all sorts of Power BI related questions from clients. That would be pretty cool to have a bot to take those responsibilities from me, on the basis that there is no risk to my job security :-).

After a quick research, I found it is very easy to create a Power BI knowledge base bot using Microsoft Bot Framework and QnA Maker service. This blog post walks through the steps to create the bot.

02

There are four major steps to build a Power BI knowledge base bot:

  • Step 1 – Create QnA maker and train the knowledge base
  • Step 2 – Create a bot with the Bot Builder SDK and deploy to an Azure App Service
  • Step 3 – Register the created bot with Azure Bot Service
  • Step 4 – Connect the bot to channels

Step 1 – Create QnA maker and train the knowledge base

QnA maker is one of Microsoft cognitive services that enables developers or business users to build, train and publish QnA type of bot based on FAQ urls, structured documents or editorial content. You can logon to the QnA maker services website through this link, https://qnamaker.ai/, where you can create, manage, test and publish your QnA bot.

To create a QnA maker, you need to click the “Create new service” link on the top navigation bar on the QnA maker service website to launch the “Create a QnA service” form.

0317

You need to provide the name of the QnA maker and to specify the urls of the knowledge base pages or to upload the files that contains your questions and answers.

After the QnA maker is created, you can access it from “My services” page. You can select to edit your QnA maker that brings you to the maker edit page where you can edit, retrain, test and publish your QnA maker.

 

04On the knowledge base tab in the maker edit page, all the questions and answers from the web pages of the knowledge base you specified or from the documents you manually uploaded will be extracted into a table, where you can manually delete the imported question/answer pairs or manually add and edit new pairs.

After you have edited your QnA maker, you can save and retrain it using the green “Save and retrain” button on the top right on the maker edit page, and then test the QnA maker on the “Test” tab.

05

After you are happy with your QnA maker, you can publish it as a service using the blue “Publish” button on the top right of the maker edit page. After the maker is successfully published, you will be provided with the sample HTTP request for building your bot. you need to note down the QnA knowledge base id and the subscription key (highlighted with red box in the snapshot below) that will be used later when create the bot.

06.PNG

Step 2 – Create a bot with the Bot Builder SDK and deploy to an Azure App Service

There is a couple of ways to create a bot. In this blog post, I will use Bot Builder SDK for .Net to create the bot in Visual Studio 2017.

Before you can use Visual Studio 2017 to create a bot with Microsoft Bot Framework, there are some prerequisite tasks you need to do, such as installing project and item templates. You can follow this doc for those prerequisite tasks: https://docs.microsoft.com/en-us/bot-framework/dotnet/bot-builder-dotnet-quickstart

After you have done all the prerequisite tasks, you can see a new “Bot Application” project template on the “New Project” dialog.

07

After you created a project using the “Bot Application”  template, you will have the predefined controller and dialog files, and the predefined configuration settings in the web.config file.

10

Open the web.config file, under the appSettings section, add two keys for QnA subscription key and Knowledge base id that you have noted down earlier when you published your QnA maker service.
08

Under the Dialogs folder in your project, create a new dialog file using the dialog item template and name it as PowerBIQnADialog. Add a reference to Microsoft.Bot.Builder.CognitiveService.QnAMaker, and change the dialog file to inherit from QnAMakerDialog. Call the base constructor of the class and pass the QnA maker attribute data over, including the QnA subscription key and QnA knowledge base id (specified in the web.config file), the bot message text when there is no answer matching the asked question, and the matching threshold for deciding whether an answer is matching the question.

09

After the dialog file is created, open the existing MessageController.cs file. In the “Post” method, change the Dialogs.RootDialog() to Dialogs.PowerBIQnAdialog() as the snapshot shown below.

11

After you referenced the right dialog class in the MessageController.cs file, you can publish the bot to an Azure App service. You can create the app service through Azure Portal, or create it in Visual Studio when select to publish the project.

12

Step 3 – Register the created bot with Azure Bot Service

After you have created a bot and deployed to the Azure app service, you can create an Azure bot service and register the bot you have created to the bot service.

Logon to the Azure Portal, create a new Bot Channels Registration service. Specify the general information such as name, subscription, location, etc. The most important attribute to specify is the “Message endpoint” attribute of your created and deployed bot, which is in the format as {url of the app service where your bot is deployed}/api/messages.

01

After the bot service has been created, you can change the messaging endpoint in the Settings tab if required.

14

Also, on the Settings tab, you can get the Microsoft App ID for the bot service and the password (you can click the “Manage” link to generate a new password. The password created previously is not visible if you have not noted it somewhere before).  You now need to update the web.config file in your bot Visual Studio project to set the Microsoft App ID and the password to the MicrosoftAppId and the MicrosfotAppPassword key. After the change, you need to republish your bot to update the Azure app service.

18

After the your bot is updated in the Azure app service, your bot service is ready to use. you can test it in the “Test in Web Chat” tab in your Bot Registration Channels service.

15

Step 4 – Connect the bot to channels

A bot channel is the connection between the Bot Framework and communication apps which can talk to your bot. You can create channels on the “Channels” tab in you Bot Registration Channels service page. There are quite a number of apps that have been supported by Bot Framework, such as Skype, Teams, Facebook Message etc.

19.PNG

In this blog post, I created a Microsoft Teams channel to communicate to our Power BI QnA bot through just clicking the Microsoft Teams icon on the channel page, and then channel is up running.

20

On the Chat tab in your Microsoft Teams app, start a new chat to the Microsoft App ID of your bot service.

21

And now you can start to ask your bot Power BI related questions from Microsoft Team app.

02