Category: Python/R

Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service

One requirement I have been recently working with is to run R scripts for some complex calculations in an ADF (V2) data processing pipeline. My first attempt is to run the R scripts using Azure Data Lake Analytics (ADLA) with R extension. However, two limitations of ADLA R extension stopped me from adopting this approach. Firstly, ADLA R extension supports only one input dataframe and one output dataframe at the time when I write this blog post. However, the requirement I am working with needs the R scripts to take multiple dataframes as input and output multiple result dataframes. Secondly,  the total size for the input and output is limited to 500 MB.

Another attempt I have taken was to create an ADF custom activity to execute the R scripts in Azure Batch Service. This option turns out to be a pretty flexible, easy to implement and manage approach. This blog post highlights the key steps to implement this approach and also mentions the lessons I have learnt.

Step 1 – Preparing Azure Batch Service

Firstly, we need to add an Azure Batch pool in an Azure Batch service instance. If the Azure Batch service instance doesn’t exist yet, a new instance needs to be provisioned. Please refer to Microsoft official docs for the details on creating Azure Batch service and pools.

While adding the Azure Batch pool, we need to specify the VM image to provision as the computing nodes in the pool. We can use Data Science Virtual Machine image which ships with most of common R packages.

Capture1.PNG

Step 2- Creating Container in Azure Blob Storage to Host R Source Files and Input/Output Data

Create a Blob storage container in your Azure Storage account and then create an Input folder and an Output folder within the container root folder.

The R source files will be deployed into the container root folder. The input data files are the output of the upstream process activities in the ADF pipeline and pushed (copied) into the Input folder. The R source files and the input data files will be submitted to execute in the Azure Batch service (by the Azure Batch custom activity which will be created later in the ADF pipeline), the output will be written into the Output folder in the Azure Blob storage.

Step 3 – Authoring R Scripts to Communicate with the Azure Blob Storage

When the Azure Batch custom activity is triggered in the ADF pipeline, the R source files and the input data files will be submitted into the work directory created for the submitted task on the Azure Batch computer nodes. The R scripts will load the data in the data files into dataframes, run the calculations and transformations, and finally write the results to output data files and store them in the work directory. The output data files will then be written into the Output folder in the Azure Blob storage using the blob operation functions provided by rAzureBatch package. Here is a good sample from the doAzureParallel Github site on the blob operations with rAZureBatch.

Basically, we need first to create a rAzureBatch StorageServiceClient with the Azure Storage account credentials.

storageCredentials <- rAzureBatch::SharedKeyCredentials$new(
    name = "{name of the Azure Storage account}",
    key = "{access key of the Azure storage account }"
)

storageClient <- rAzureBatch::StorageServiceClient$new(
   authentication = storageCredentials,
   url = "{url of the Azure Blob storage}"
)

Then, we need to create a SAS token with write permission on the Output folder

writeSasToken <- storageClient$generateSasToken(permission = "w", "c", path = {the path of the Output folder})

Lastly, we can save the output file into the Output folder on Azure Blob storage with the uploadBlob function.

response <- storageClient$blobOperations$uploadBlob(
    containerName,
    fileDirectory = "{output file name}",
    sasToken = writeSasToken,
    accountName = storageAccountName)
if (response[['status_code']]!=201) {
    stop('Failed to save the output file.')
}

It is important to explicitly check the response status_code and throw error when the save action is failed. Otherwise, the ADF pipeline will not able to capture the error but instead treat the custom activity as running successfully and move on to the downstream activities.

Step 4 – Setup Azure Batch Custom Activity in ADF Pipeline

After the Azure Batch service part of work is done, we need to add and configure the Azure Batch Custom Activity in ADF Pipeline. This step is pretty straightforward, please refer to the Microsoft official docs for the more details. The only part needs to note is the settings of “Command” and “Folder path” of the Azure Batch Custom Activity. The “Command” should be “RScript {the entry R file you want to run}”, and the “Folder path” should be the container we have created earlier to host the R source files.

b

Those are the four main steps to setup the execution of R scripts in ADF pipelines.

Below lists a few of tips that might be helpful:

If you need to install additional R packages for your scripts, specify the lib path with the environmental variable, AZ_BATCH_TASK_WORKING_DIR,  in order to install the packages into the working directory of the current task. Please refer to my previous blog post for further explanation.

 install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

If the Azure Batch Custom Activity throw an UserError with message as “Hit unexpected expection and execution failed.”, you can find the detailed error message from the stderr.txt file on the working directory of the failed task.

Capture

To access the working directory, you need to go to the Azure Batch account –> Jobs –>  select job –> select the failed task, and select “Files on node”.

Capture01

Then you should be able to see all the files existing on the working directory of that task, including the stderr error output file.

Capture02.PNG

 

Advertisements

The Tip for Installing R packages on Azure Batch

In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing and need to be installed.

My first attempt to install the package through running install.packages(“tidyselect”) was failed. The error message revealed that I didn’t have write permission to install package. It actually makes sense that Azure Batch doesn’t allow the jobs submitted from the consumers to change the global environment of computer nodes. Therefore, the R packages can only be installed somewhere on the computer nodes where the submitted jobs have write permission.

After a bit research, it turns out that Azure Batch organised the submitted jobs in this structure. a

Azure Batch exposes a portion of the file system on a computer note for the submitted job to access. One file directory will be created for each job and each task associated with the job. Within each task directory, a working directory, wd, will be created which provides read/write access to the task. As the task has full permission to create, update and delete content within this directory, here should be the place where we can install the R packages to. Azure Batch provides an environment variable AZ_BATCH_TASK_WORKING_DIR to specify the path of the directory of the current task.

With the knowledge of the task working directory, we can install the R package with the specified installation target directory.

install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

When attaching the installed package, we need to explicitly specify the directory as well:

library(tidyselect, lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

Why Bother to Use Pandas “Categorical” Type in Python

When we process data using Pandas library in Python, we normally convert the string type of categorical variables to the Categorical data type offered by the Pandas library. Why do we bother to do that, considering there is actually no difference with the output results no matter you are using the Pandas Categorical type or the string type? To answer this question, let’s first take a simple test.

In this test, we create a data frame with two columns, “Category” and “Value”, and generate 50 millions rows in this data frame. The values of “Category” column are generated from a list of six predefined categories, [‘category1’, ‘category2’ … ‘category6’], and the values of “Value” column are generated from the list integer of [0 … 9].

0102

We first time the execution of a group by operation against the “Category” column which is in the string type and also observe the memory usage of the “Category” column.

03

We then convert the “Category” column to the Pandas Categorical type, repeat the execution of the same group by operation and observe the memory usage of the “Category” column in Pandas Categorical type.

04

The result reveals a five times improvement on running speed and one eighth memory usage when converting the “Category” column to the Pandas Categorical data type.

05

This test result answers our original question that the reason to use Pandas Categorical data type is for the optimised memory usage and improved data processing speed. Then why does the Categorical data type have such magics? The answer is pretty simple, i.e. dictionary encoding.

If we open the source code of the Pandas Categorical class, we can see this class contains two properties, “categories” and “codes”.

06

After we printed the two properties for the “Category” column used in our test, we can see the “categories” property stores the dictionary of the six categories available for the “Category” column and the actual category information of the “Category” column for all the rows in the data frame is stored in the “codes” property in the format of integer number which points to the position of the corresponding category in the “categories” property.

07

08

In this way, Pandas Categorical data type takes much less memory space to store the category information in integer type compared to store in the original string type. The query operations on the category column scan less memory space and therefore the time used on the query is shorten.

Dictionary encoding is a common technique used for data compression. For example, Azure Analysis Service and Power BI also used dictionary encoding in their Vertipad engine to compress data to reduce memory usage and to increase query speed.

13fig04
*This figure is from  the book “Definitive Guide to DAX” authored  by Alberto Ferrari and Marco Russo

 

R Visual – Create Gartner Magic Quadrant-Like Charts in Power BI using ggplot2

In this blog post, I am going to create a R visual that renders the Gartner magic quadrant-like charts in Power BI using the ggplot2 package.

2.PNG

A dummy dataset will be created, including three columns, the “Company” column holding the name of the companies which will be ranked in the quadrant chart, the “ExcutionScore” column and the “VisionScore” column corresponding to the “Ability to Execute” metric and the “Completeness of Vision” metric in the Gartner magic quadrant assessment. In the dummy dataset, the “ExcutionScore” and the “VisionScore” are scale from 0 to 100.

3

We drag a R visual onto Power BI editor canvas and add the three columns from the dummy dataset. We can bind the RStudio IDE to Power BI and use it to author and test the R scripts.

In the R script editor, we first reference the “ggplot2” library and the “grid” library. The “grid” library is used to draw custom annotations outside of the main ggplot2 panel.

library(ggplot2)
library(grid)

We then create a ggplot2 object using the dataset referenced in the R visual, assigning the “VisionScore” value to x-axis and assigning the “ExcutionScore” value to y-axis.

p <- ggplot(dataset, aes(VisionScore, ExcutionScore))
p <- p + scale_x_continuous(expand = c(0, 0), limits = c(0, 100)) 
p <- p + scale_y_continuous(expand = c(0, 0), limits = c(0, 100))

4

We now have our base panel and we can start our journey to build the Gartner Magic Quadrant-Like chart.

First of all, we set the x-axis label as “COMPLETEMENT OF VISION” and set the y-axis label as “ABILITY TO EXECUTE” and make them aligned to left-side. We then remove the axis ticks and text from the plot. We will also add a title to the top of the plot.

p <- p + labs(x="COMPLETEMENT OF VISION",y="ABILITY TO EXECUTE")
p <- p + theme(axis.title.x = element_text(hjust = 0, vjust=4, colour="darkgrey",size=10,face="bold"))
p <- p + theme(axis.title.y = element_text(hjust = 0, vjust=0, colour="darkgrey",size=10,face="bold"))

p <- p + theme(
          axis.ticks.x=element_blank(), 
          axis.text.x=element_blank(),
          axis.ticks.y=element_blank(),
          axis.text.y=element_blank()
        )

p <- p + ggtitle("Gartner Magic Quadrant - Created for Power BI using ggpolt2") 

Those steps will progress our chart to somewhere like:

5

We then add four rectangle type of annotations to fill the four quadrant areas using the Gartner magic quadrant scheme. We also need to create a border and split lines for the quadrant chart.

p <- p +
      annotate("rect", xmin = 50, xmax = 100, ymin = 50, ymax = 100, fill= "#F8F9F9")  + 
      annotate("rect", xmin = 0, xmax = 50, ymin = 0, ymax = 50 , fill= "#F8F9F9") + 
      annotate("rect", xmin = 50, xmax = 100, ymin = 0, ymax = 50, fill= "white") + 
      annotate("rect", xmin = 0, xmax = 50, ymin = 50, ymax = 100, fill= "white")

p <- p + theme(panel.border = element_rect(colour = "lightgrey", fill=NA, size=4))
p <- p + geom_hline(yintercept=50, color = "lightgrey", size=1.5)
p <- p + geom_vline(xintercept=50, color = "lightgrey", size=1.5)

6

We also need to add a label to each quadrant area:

p <- p + geom_label(aes(x = 25, y = 97, label = "CALLENGERS"), 
                    label.padding = unit(2, "mm"),  fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 75, y = 97, label = "LEADERS"), 
                    label.padding = unit(2, "mm"), fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 25, y = 3, label = "NICHE PLAYERS"), 
                    label.padding = unit(2, "mm"),  fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 75, y = 3, label = "VISIONARIES"), 
                    label.padding = unit(2, "mm"), fill = "lightgrey", color="white")

7

Up to this point, our chart starts to look like the Gartner magic quadrant. Next, we need to draw the company points to the chart with the position corresponding to their “Ability to Execute” value and “Completeness of Vision” value.

p <- p + geom_point(colour = "#2896BA", size = 5) 
p <- p  + geom_text(aes(label=Company),colour="#2896BA", hjust=-0.3, vjust=0.25, size=3.2)

8

Our quadrant chart is nearly done, just one part missing, the arrows next to the “Ability to Execute” and “Completeness of Vision” text labels.

10.PNG

As the arrows need to be located outside of the main panel, we need to create custom annotation (annotation_custom) with linesGrob to draw a straight line with an arrow at the far end of the line. To make the arrows to visible outside of the main panel, we need to turn off the clip attribute of the main panel.

p <- p + annotation_custom(
            grob = linesGrob(arrow=arrow(type="open", ends="last", length=unit(2,"mm")), 
                   gp=gpar(col="lightgrey", lwd=4)), 
            xmin = -2, xmax = -2, ymin = 25, ymax = 40
          )
p <- p + annotation_custom(
  grob = linesGrob(arrow=arrow(type="open", ends="last", length=unit(2,"mm")), 
                   gp=gpar(col="lightgrey", lwd=4)), 
  xmin = 28, xmax = 43, ymin = -3, ymax = -3
)

gt = ggplot_gtable(ggplot_build(p))
gt$layout$clip[gt$layout$name=="panel"] = "off"
grid.draw(gt)

We now have our completed quadrant chart.

9

You can find the complete source code here:

Please find the pbix file here.

R Visual – from Grid-Facet to Geo-Facet in Power BI

R Visual – from Grid-Facet to Geo-Facet in Power BI

In one of my previous blog post, I used the facet_wrap function in ggplot2 package to build a grid facet to display the rank history of each Eurovision competition country.

1t1

The grid facet looks pretty neat as all sub-panels are perfectly aligned, however, it fails to display the geospatial information of the countries that may reveal some useful insights. For example, in my last blog post , I built a voting network chart of Eurovision competition that has revealed the mutual high voting scores between some neighbour countries.

There is a R package, namely geofacet, which comes with a list of pre-built geospatial grids for a number of geographical areas, countries and states. One of the pre-built grids is for Europe area which is perfect for our Eurovision example.

It is very straightforward to use the geofacet package. After referenced the package in our R script, all we need to do is to replace the facet_wrap function in our ggplot2 code with the facet_geo function provided by the geofacet package. We need to specify the column by which the facet is divided and the name of the pre-built grid we will use. In this example, we use “eu_grid1” which is the grid for Europe area.

b3

Now we have done all the work to convert our standard grid facet to geospatial facet. You can download the pbix file here.

b2

Apart from the Europe area grid, you can find a list of other pre-built grids here. Considering where I am living at this moment, another pre-built grid I am particularly interested at is the London Borough grid. This is a geo-facet chart I have created to visualise the unemployment rate in the London boroughs.

b1 You can also create your own grid which is literally a data frame with four columns, name and code columns that map to the facet label column in the dataset, and the row and col columns that specify the grid locations.

This is a test grid I have created to demonstrate how to create custom grid:

customGrid <- data.frame(
  name = c("Enfield", "Haringey", "Islington", "Hackney", "Camden", "Hackeny", "Redbridge", "Brent", "Ealing"),
  code = c("Enfield", "Haringey", "Islington", "Hackney", "Camden", "Hackeny", "Redbridge", "Brent", "Ealing"),
  row = c(1, 2, 3, 3, 3, 3, 3, 4, 5),
  col = c(3, 3, 5, 4, 1, 2, 3, 3, 3),
  stringsAsFactors = FALSE
)

b4

R Visual – Build Eurovision Voting Network Chart in Power BI

I have been watching Eurovision competitions for several years. I personally think the voting results from Eurovision competitions can be a very good source for the research of relationships between European countries. In this blog post, I will create a social network R visual using iGraph package and use the visual to analyse the voting network of Eurovison competitions.

1

Firstly, we need to prepare our raw Eurovision dataset into the following format that contains three columns, “From country” (where is the vote from), “To Country” (where is the vote to), and “Avg Point” (that computes the average points the “From country” has given to the “To Country” over the years. You can prepare the data either using DAX (creating the “AvgPoint” measure) or Power Query (group by “From country”+”To Country” and calculate average points).

a3

We add a R visual to the Power BI canvas and add the three columns to the R visual. If you prefer to use other R IDE (e.g., RStudio) to edit the R scripts, you can bind your IDE to Power BI.

We will use the igraph R package to render the voting network chart. Firstly, we need to load igraph library and then create a igraph data frame from the dataset specified on the Power BI R visual. We will use the plot function to render the network chart with the style attribute settings of vertex, edge etc.

# user igraph library
library(igraph)

# create a igraph data frame from the dataset specified on the Power BI R visual
df.g <- graph.data.frame(d = dataset, directed = TRUE)

# define colors
comps <- components(df.g)$membership
colbar <- rainbow(max(comps)+1)
V(df.g)$color <- colbar[comps+1]

# render the network chart and set the style attributes of vertex, edge etc.
plot(df.g, 
     vertex.label = V(df.g)$name,
     layout=layout_with_fr, 
     vertex.size=12,
     vertex.label.dist=0, 
     vertex.label.color= "darkblue",
     vertex.shape = "circle",
     vertex.label.cex = 1,
     vertex.label.font = 2,
     edge.arrow.size=0.5,
     edge.curved=T,
     margin =-0.05
 )

a4

After we authored and tested the R script in RStudio, we can now add the script to the R visual in Power BI that will be able to interact with other visuals on the same page.

Before we set any threshold on the average voting points, all voting paths between the countries will be draw on the network chart that makes the chart unreadable.

a2

However, when we set a higher average voting points as threshold which only shows the voting path over the threshold, we can find some relationship patterns.

1

For example, we can see the mutual high votes between neighbour countries like Spain <-> Andorra, Roumania <-> Moldova, and Greece<->Cyprus.

11.PNG

Please find the pbix file here.