Spark dataframe provides the repartition function to partition the dataframe by a specified column and/or a specified number of partitions. However, for some use cases, the repartition function doesn't work in the way as required. For example, in the previous blog post, Handling Embarrassing Parallel Workload with PySpark Pandas UDF, we want to repartition the traveller dataframe so … Continue reading Create Custom Partitioner for Spark Dataframe
Configuration-Driven Azure Data Factory Pipelines
In this blog post, I will introduce two configuration-driven Azure Data Factory pipeline patterns I have used in my previous projects, including the Source-Sink pattern and the Key-Value pattern. The Source-Sink pattern is primarily used for parameterising and configuring the data movement activities, with the source location and sink location of the data movement configured in a … Continue reading Configuration-Driven Azure Data Factory Pipelines
Handling Embarrassing Parallel Workload with PySpark Pandas UDF
Introduction In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and … Continue reading Handling Embarrassing Parallel Workload with PySpark Pandas UDF
Handling Embarrassing Parallel Workload with Databricks Notebook Workflows
Introduction Embarrassing Parallel refers to the problem where little or no effort is needed to separate the problem into parallel tasks, and there is no dependency for communication needed between the parallel tasks. Embarrassing parallel problem is very common with some typical examples like group-by analyses, simulations, optimisations, cross-validations or feature selections. Normally, an Embarrassing … Continue reading Handling Embarrassing Parallel Workload with Databricks Notebook Workflows
Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service
Introduction One requirement I have been recently working with is to run R scripts for some complex calculations in an ADF (V2) data processing pipeline. My first attempt is to run the R scripts using Azure Data Lake Analytics (ADLA) with R extension. However, two limitations of ADLA R extension stopped me from adopting this … Continue reading Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service
The Tip for Installing R packages on Azure Batch
Problem In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing … Continue reading The Tip for Installing R packages on Azure Batch
Why Bother to Use Pandas “Categorical” Type in Python
When we process data using Pandas library in Python, we normally convert the string type of categorical variables to the Categorical data type offered by the Pandas library. Why do we bother to do that, considering there is actually no difference with the output results no matter you are using the Pandas Categorical type or … Continue reading Why Bother to Use Pandas “Categorical” Type in Python
Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker
The first question pop out of my head when I heard Microsoft Bot Framework is how to build some sorts of bots, which is capable to replace me for any of my responsibilities in my work. Part of my Power BI consulting responsibilities is to answer all sorts of Power BI related questions from clients. … Continue reading Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker
Workaround of the Power BI Guest User License Assignment Issue
I have recently been working with an insurance client to help them design a solution to distribute Power BI app to external guest users with Azure AD B2B. I have designed this solution based on Microsoft Power BI content external distribution guideline whitepaper: Distribute Power BI content to external guest users using Azure Active Directory B2B. … Continue reading Workaround of the Power BI Guest User License Assignment Issue
SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2
*The source code created for this blog post can be found here. In the previous blog posts in the SSIS in Azure series, we created a SSIS package to periodically ingests data from Azure SQL database to Azure Data Lake Store and deployed the package in the Azure-SSIS Integrated Runtime. Up to this point, we have … Continue reading SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2
You must be logged in to post a comment.