Tag: R

Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service

Introduction

One requirement I have been recently working with is to run R scripts for some complex calculations in an ADF (V2) data processing pipeline. My first attempt is to run the R scripts using Azure Data Lake Analytics (ADLA) with R extension. However, two limitations of ADLA R extension stopped me from adopting this approach. Firstly, ADLA R extension supports only one input dataframe and one output dataframe at the time when I write this blog post. However, the requirement I am working with needs the R scripts to take multiple dataframes as input and output multiple result dataframes. Secondly,  the total size for the input and output is limited to 500 MB.

Another attempt I have taken was to create an ADF custom activity to execute the R scripts in Azure Batch Service. This option turns out to be a pretty flexible, easy to implement and manage approach.

Approach

This blog post highlights the key steps to implement this approach and also mentions the lessons I have learnt.

Step 1 – Preparing Azure Batch Service

Firstly, we need to add an Azure Batch pool in an Azure Batch service instance. If the Azure Batch service instance doesn’t exist yet, a new instance needs to be provisioned. Please refer to Microsoft official docs for the details on creating Azure Batch service and pools.

While adding the Azure Batch pool, we need to specify the VM image to provision as the computing nodes in the pool. We can use Data Science Virtual Machine image which ships with most of common R packages.

Capture1.PNG

Step 2- Creating Container in Azure Blob Storage to Host R Source Files and Input/Output Data

Create a Blob storage container in your Azure Storage account and then create an Input folder and an Output folder within the container root folder.

The R source files will be deployed into the container root folder. The input data files are the output of the upstream process activities in the ADF pipeline and pushed (copied) into the Input folder. The R source files and the input data files will be submitted to execute in the Azure Batch service (by the Azure Batch custom activity which will be created later in the ADF pipeline), the output will be written into the Output folder in the Azure Blob storage.

Step 3 – Authoring R Scripts to Communicate with the Azure Blob Storage

When the Azure Batch custom activity is triggered in the ADF pipeline, the R source files and the input data files will be submitted into the work directory created for the submitted task on the Azure Batch computer nodes. The R scripts will load the data in the data files into dataframes, run the calculations and transformations, and finally write the results to output data files and store them in the work directory. The output data files will then be written into the Output folder in the Azure Blob storage using the blob operation functions provided by rAzureBatch package. Here is a good sample from the doAzureParallel Github site on the blob operations with rAZureBatch.

Basically, we need first to create a rAzureBatch StorageServiceClient with the Azure Storage account credentials.

storageCredentials <- rAzureBatch::SharedKeyCredentials$new(
    name = "{name of the Azure Storage account}",
    key = "{access key of the Azure storage account }"
)

storageClient <- rAzureBatch::StorageServiceClient$new(
   authentication = storageCredentials,
   url = "{url of the Azure Blob storage}"
)

Then, we need to create a SAS token with write permission on the Output folder

writeSasToken <- storageClient$generateSasToken(permission = "w", "c", path = {the path of the Output folder})

Lastly, we can save the output file into the Output folder on Azure Blob storage with the uploadBlob function.

response <- storageClient$blobOperations$uploadBlob(
    containerName,
    fileDirectory = "{output file name}",
    sasToken = writeSasToken,
    accountName = storageAccountName)
if (response[['status_code']]!=201) {
    stop('Failed to save the output file.')
}

It is important to explicitly check the response status_code and throw error when the save action is failed. Otherwise, the ADF pipeline will not able to capture the error but instead treat the custom activity as running successfully and move on to the downstream activities.

Step 4 – Setup Azure Batch Custom Activity in ADF Pipeline

After the Azure Batch service part of work is done, we need to add and configure the Azure Batch Custom Activity in ADF Pipeline. This step is pretty straightforward, please refer to the Microsoft official docs for the more details. The only part needs to note is the settings of “Command” and “Folder path” of the Azure Batch Custom Activity. The “Command” should be “RScript {the entry R file you want to run}”, and the “Folder path” should be the container we have created earlier to host the R source files.

b

Those are the four main steps to setup the execution of R scripts in ADF pipelines.

Below lists a few of tips that might be helpful:

If you need to install additional R packages for your scripts, specify the lib path with the environmental variable, AZ_BATCH_TASK_WORKING_DIR,  in order to install the packages into the working directory of the current task. Please refer to my previous blog post for further explanation.

 install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

If the Azure Batch Custom Activity throw an UserError with message as “Hit unexpected expection and execution failed.”, you can find the detailed error message from the stderr.txt file on the working directory of the failed task.

Capture

To access the working directory, you need to go to the Azure Batch account –> Jobs –>  select job –> select the failed task, and select “Files on node”.

Capture01

Then you should be able to see all the files existing on the working directory of that task, including the stderr error output file.

Capture02.PNG

 

The Tip for Installing R packages on Azure Batch

Problem

In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing and need to be installed.

My first attempt to install the package through running install.packages(“tidyselect”) was failed. The error message revealed that I didn’t have write permission to install package. It actually makes sense that Azure Batch doesn’t allow the jobs submitted from the consumers to change the global environment of computer nodes. Therefore, the R packages can only be installed somewhere on the computer nodes where the submitted jobs have write permission.

Solution

After a bit research, it turns out that Azure Batch organised the submitted jobs in this structure. a

Azure Batch exposes a portion of the file system on a computer note for the submitted job to access. One file directory will be created for each job and each task associated with the job. Within each task directory, a working directory, wd, will be created which provides read/write access to the task. As the task has full permission to create, update and delete content within this directory, here should be the place where we can install the R packages to. Azure Batch provides an environment variable AZ_BATCH_TASK_WORKING_DIR to specify the path of the directory of the current task.

With the knowledge of the task working directory, we can install the R package with the specified installation target directory.

install.packages("tidyselect", lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

When attaching the installed package, we need to explicitly specify the directory as well:

library(tidyselect, lib=Sys.getenv("AZ_BATCH_TASK_WORKING_DIR"))

R Visual – Create Gartner Magic Quadrant-Like Charts in Power BI using ggplot2

In this blog post, I am going to create a R visual that renders the Gartner magic quadrant-like charts in Power BI using the ggplot2 package.

2.PNG

A dummy dataset will be created, including three columns, the “Company” column holding the name of the companies which will be ranked in the quadrant chart, the “ExcutionScore” column and the “VisionScore” column corresponding to the “Ability to Execute” metric and the “Completeness of Vision” metric in the Gartner magic quadrant assessment. In the dummy dataset, the “ExcutionScore” and the “VisionScore” are scale from 0 to 100.

3

We drag a R visual onto Power BI editor canvas and add the three columns from the dummy dataset. We can bind the RStudio IDE to Power BI and use it to author and test the R scripts.

In the R script editor, we first reference the “ggplot2” library and the “grid” library. The “grid” library is used to draw custom annotations outside of the main ggplot2 panel.

library(ggplot2)
library(grid)

We then create a ggplot2 object using the dataset referenced in the R visual, assigning the “VisionScore” value to x-axis and assigning the “ExcutionScore” value to y-axis.

p <- ggplot(dataset, aes(VisionScore, ExcutionScore))
p <- p + scale_x_continuous(expand = c(0, 0), limits = c(0, 100)) 
p <- p + scale_y_continuous(expand = c(0, 0), limits = c(0, 100))

4

We now have our base panel and we can start our journey to build the Gartner Magic Quadrant-Like chart.

First of all, we set the x-axis label as “COMPLETEMENT OF VISION” and set the y-axis label as “ABILITY TO EXECUTE” and make them aligned to left-side. We then remove the axis ticks and text from the plot. We will also add a title to the top of the plot.

p <- p + labs(x="COMPLETEMENT OF VISION",y="ABILITY TO EXECUTE")
p <- p + theme(axis.title.x = element_text(hjust = 0, vjust=4, colour="darkgrey",size=10,face="bold"))
p <- p + theme(axis.title.y = element_text(hjust = 0, vjust=0, colour="darkgrey",size=10,face="bold"))

p <- p + theme(
          axis.ticks.x=element_blank(), 
          axis.text.x=element_blank(),
          axis.ticks.y=element_blank(),
          axis.text.y=element_blank()
        )

p <- p + ggtitle("Gartner Magic Quadrant - Created for Power BI using ggpolt2") 

Those steps will progress our chart to somewhere like:

5

We then add four rectangle type of annotations to fill the four quadrant areas using the Gartner magic quadrant scheme. We also need to create a border and split lines for the quadrant chart.

p <- p +
      annotate("rect", xmin = 50, xmax = 100, ymin = 50, ymax = 100, fill= "#F8F9F9")  + 
      annotate("rect", xmin = 0, xmax = 50, ymin = 0, ymax = 50 , fill= "#F8F9F9") + 
      annotate("rect", xmin = 50, xmax = 100, ymin = 0, ymax = 50, fill= "white") + 
      annotate("rect", xmin = 0, xmax = 50, ymin = 50, ymax = 100, fill= "white")

p <- p + theme(panel.border = element_rect(colour = "lightgrey", fill=NA, size=4))
p <- p + geom_hline(yintercept=50, color = "lightgrey", size=1.5)
p <- p + geom_vline(xintercept=50, color = "lightgrey", size=1.5)

6

We also need to add a label to each quadrant area:

p <- p + geom_label(aes(x = 25, y = 97, label = "CALLENGERS"), 
                    label.padding = unit(2, "mm"),  fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 75, y = 97, label = "LEADERS"), 
                    label.padding = unit(2, "mm"), fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 25, y = 3, label = "NICHE PLAYERS"), 
                    label.padding = unit(2, "mm"),  fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 75, y = 3, label = "VISIONARIES"), 
                    label.padding = unit(2, "mm"), fill = "lightgrey", color="white")

7

Up to this point, our chart starts to look like the Gartner magic quadrant. Next, we need to draw the company points to the chart with the position corresponding to their “Ability to Execute” value and “Completeness of Vision” value.

p <- p + geom_point(colour = "#2896BA", size = 5) 
p <- p  + geom_text(aes(label=Company),colour="#2896BA", hjust=-0.3, vjust=0.25, size=3.2)

8

Our quadrant chart is nearly done, just one part missing, the arrows next to the “Ability to Execute” and “Completeness of Vision” text labels.

10.PNG

As the arrows need to be located outside of the main panel, we need to create custom annotation (annotation_custom) with linesGrob to draw a straight line with an arrow at the far end of the line. To make the arrows to visible outside of the main panel, we need to turn off the clip attribute of the main panel.

p <- p + annotation_custom(
            grob = linesGrob(arrow=arrow(type="open", ends="last", length=unit(2,"mm")), 
                   gp=gpar(col="lightgrey", lwd=4)), 
            xmin = -2, xmax = -2, ymin = 25, ymax = 40
          )
p <- p + annotation_custom(
  grob = linesGrob(arrow=arrow(type="open", ends="last", length=unit(2,"mm")), 
                   gp=gpar(col="lightgrey", lwd=4)), 
  xmin = 28, xmax = 43, ymin = -3, ymax = -3
)

gt = ggplot_gtable(ggplot_build(p))
gt$layout$clip[gt$layout$name=="panel"] = "off"
grid.draw(gt)

We now have our completed quadrant chart.

9

You can find the complete source code here:

library(ggplot2)
library(grid)
p <- ggplot(dataset, aes(VisionScore, ExcutionScore))
p <- p + scale_x_continuous(expand = c(0, 0), limits = c(0, 100))
p <- p + scale_y_continuous(expand = c(0, 0), limits = c(0, 100))
p <- p + labs(x="COMPLETEMENT OF VISION",y="ABILITY TO EXCUTE")
p <- p + theme(axis.title.x = element_text(hjust = 0, vjust=4, colour="darkgrey",size=10,face="bold"))
p <- p + theme(axis.title.y = element_text(hjust = 0, vjust=0, colour="darkgrey",size=10,face="bold"))
p <- p + theme(
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
p <- p+ggtitle("Gartner Magic Quadrant - Created for Power BI using ggpolt2")
p <- p +
annotate("rect", xmin = 50, xmax = 100, ymin = 50, ymax = 100, fill= "#F8F9F9") +
annotate("rect", xmin = 0, xmax = 50, ymin = 0, ymax = 50 , fill= "#F8F9F9") +
annotate("rect", xmin = 50, xmax = 100, ymin = 0, ymax = 50, fill= "white") +
annotate("rect", xmin = 0, xmax = 50, ymin = 50, ymax = 100, fill= "white")
p <- p + theme(panel.border = element_rect(colour = "lightgrey", fill=NA, size=4))
p <- p + geom_hline(yintercept=50, color = "lightgrey", size=1.5)
p <- p + geom_vline(xintercept=50, color = "lightgrey", size=1.5)
p <- p + geom_label(aes(x = 25, y = 97, label = "CALLENGERS"),
label.padding = unit(2, "mm"), fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 75, y = 97, label = "LEADERS"),
label.padding = unit(2, "mm"), fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 25, y = 3, label = "NICHE PLAYERS"),
label.padding = unit(2, "mm"), fill = "lightgrey", color="white")
p <- p + geom_label(aes(x = 75, y = 3, label = "VISIONARIES"),
label.padding = unit(2, "mm"), fill = "lightgrey", color="white")
p <- p + geom_point(colour = "#2896BA", size = 5)
p <- p + geom_text(aes(label=Company),colour="#2896BA", hjust=-0.3, vjust=0.25, size=3.2)
p <- p + annotation_custom(
grob = linesGrob(arrow=arrow(type="open", ends="last", length=unit(2,"mm")),
gp=gpar(col="lightgrey", lwd=4)),
xmin = -2, xmax = -2, ymin = 25, ymax = 40
)
p <- p + annotation_custom(
grob = linesGrob(arrow=arrow(type="open", ends="last", length=unit(2,"mm")),
gp=gpar(col="lightgrey", lwd=4)),
xmin = 28, xmax = 43, ymin = -3, ymax = -3
)
gt = ggplot_gtable(ggplot_build(p))
gt$layout$clip[gt$layout$name=="panel"] = "off"
grid.draw(gt)

Please find the pbix file here.

R Visual – Build Eurovision Voting Network Chart in Power BI

I have been watching Eurovision competitions for several years. I personally think the voting results from Eurovision competitions can be a very good source for the research of relationships between European countries. In this blog post, I will create a social network R visual using iGraph package and use the visual to analyse the voting network of Eurovison competitions.

1

Firstly, we need to prepare our raw Eurovision dataset into the following format that contains three columns, “From country” (where is the vote from), “To Country” (where is the vote to), and “Avg Point” (that computes the average points the “From country” has given to the “To Country” over the years. You can prepare the data either using DAX (creating the “AvgPoint” measure) or Power Query (group by “From country”+”To Country” and calculate average points).

a3

We add a R visual to the Power BI canvas and add the three columns to the R visual. If you prefer to use other R IDE (e.g., RStudio) to edit the R scripts, you can bind your IDE to Power BI.

We will use the igraph R package to render the voting network chart. Firstly, we need to load igraph library and then create a igraph data frame from the dataset specified on the Power BI R visual. We will use the plot function to render the network chart with the style attribute settings of vertex, edge etc.

# user igraph library
library(igraph)

# create a igraph data frame from the dataset specified on the Power BI R visual
df.g <- graph.data.frame(d = dataset, directed = TRUE)

# define colors
comps <- components(df.g)$membership
colbar <- rainbow(max(comps)+1)
V(df.g)$color <- colbar[comps+1]

# render the network chart and set the style attributes of vertex, edge etc.
plot(df.g, 
     vertex.label = V(df.g)$name,
     layout=layout_with_fr, 
     vertex.size=12,
     vertex.label.dist=0, 
     vertex.label.color= "darkblue",
     vertex.shape = "circle",
     vertex.label.cex = 1,
     vertex.label.font = 2,
     edge.arrow.size=0.5,
     edge.curved=T,
     margin =-0.05
 )

a4

After we authored and tested the R script in RStudio, we can now add the script to the R visual in Power BI that will be able to interact with other visuals on the same page.

Before we set any threshold on the average voting points, all voting paths between the countries will be draw on the network chart that makes the chart unreadable.

a2

However, when we set a higher average voting points as threshold which only shows the voting path over the threshold, we can find some relationship patterns.

1

For example, we can see the mutual high votes between neighbour countries like Spain <-> Andorra, Roumania <-> Moldova, and Greece<->Cyprus.

11.PNG

Please find the pbix file here.

 

R Visual – Building Facet Grid in Power BI

  • The pbix file created for this blog post can be found in my GitHub here.

Introduction

Since Power BI started to support R visual, it has become difficult to criticise Power BI’s visualisation capability because we can now take full advantage of R’s powerful visualisation packages such as ggplot2 to create Power BI reports. Unlike creating Power BI custom visual which is a rather time-consuming task, we can create eye-catching charts in just a few of lines with R visual.

This blog post introduce an approach to create grid-facet and geo-facet types of visualisation in Power BI using ggplot2 R package.

Grid-Facet

Facet grid is a popular chart type but is not supported by Power BI yet. However, we can easily build a facet grid chart with the help of ggplot2 package.

1t1

Firstly, we need to get our data into the right format. In this example, we use the Eurovision competition dataset which contains the voting records between 1975 to 2016.

5

We need to calculate the rank of each country for each year based on the points they received from the rest of countries. We can use the DAX RANKX function to calculate the rank measure and get the results like:

2

Now we are ready to create our facet grid visual. We add a R visual to the Power BI canvas and add three columns, Year, ToCountry, Rank, to the visual.

7.PNG

3

On the R script editor, we first reference the ggplot2 library and then create a ggplot object placing Year on x-axis and Rank on y-axis. The key step is to add facet_wrap(~ToCountry) that generates the facet grid by voting destination country (ToCountry column).

4

Geo-facet

The grid facet looks pretty neat as all sub-panels are perfectly aligned, however, it fails to display the geospatial information of the countries that may reveal some useful insights. For example, in my last blog post , I built a voting network chart of Eurovision competition that has revealed the mutual high voting scores between some neighbour countries.

There is a R package, namely geofacet, which comes with a list of pre-built geospatial grids for a number of geographical areas, countries and states. One of the pre-built grids is for Europe area which is perfect for our Eurovision example.

It is very straightforward to use the geofacet package. After referenced the package in our R script, all we need to do is to replace the facet_wrap function in our ggplot2 code with the facet_geo function provided by the geofacet package. We need to specify the column by which the facet is divided and the name of the pre-built grid we will use. In this example, we use “eu_grid1” which is the grid for Europe area.

b3

Now we have done all the work to convert our standard grid facet to geospatial facet.

b2

Apart from the Europe area grid, you can find a list of other pre-built grids here. Considering where I am living at this moment, another pre-built grid I am particularly interested at is the London Borough grid. This is a geo-facet chart I have created to visualise the unemployment rate in the London boroughs.

b1 You can also create your own grid which is literally a data frame with four columns, name and code columns that map to the facet label column in the dataset, and the row and col columns that specify the grid locations.

This is a test grid I have created to demonstrate how to create custom grid:

customGrid <- data.frame(
  name = c("Enfield", "Haringey", "Islington", "Hackney", "Camden", "Hackeny", "Redbridge", "Brent", "Ealing"),
  code = c("Enfield", "Haringey", "Islington", "Hackney", "Camden", "Hackeny", "Redbridge", "Brent", "Ealing"),
  row = c(1, 2, 3, 3, 3, 3, 3, 4, 5),
  col = c(3, 3, 5, 4, 1, 2, 3, 3, 3),
  stringsAsFactors = FALSE
)

b4

R Visual – Building Component Cycle Timeline

One common approach to detect exceptions of a machine is to monitor the correlative status of components in the machine. For example, in normal condition, two or more components should be running at the same time, or some components should be running in sequential order. When the components are not running in the way as expected, that indicates potential issues with the machine which need attentions from engineers.

Thanks to IoT sensors, it is easy to capture the data of component status in a machine. To help engineers to easily pick up the potential issues, a machine components timeline chart will be very helpful. However, Power BI does not provide this kind of timeline chart, and it can be time consuming to build custom Javascript-based visuals. Fortunately, we have R visual in Power BI. With a little help from ggplot2 library, we can easily build a timeline chart (as the one shown below) in four lines code.

f1

Firstly, we need add a R visual onto Power BI canvas, and set the data fields required for the chart.

f2

The minimal fields required are the name/code of the component, start date and end date of the component running cycle, such as:

ComponentCode Start DateTime End DateTime
Pump 2017-01-01 09:12:35 2017-01-01 09:18:37
Motor 2017-01-01 09:12:35 2017-01-01 09:18:37

In the R script editor, we use the geom_segment to draw each component running cycles. The x axis is for the time of component running, and the y axis is for the component name/code.

f3

Apart from showing a Power BI timeline chart for engineers to detect the machine issue as this blog post introduced, we can also send alerts to engineers using Azure Functions.