Category: Azure Data Platform

How Azure Storage Cheats Over the CAP Theorem

Microsoft claims Azure Storage providing both high availability and strong consistency. It sounds good but obviously violates the CAP theorem as the 'P' (network partitioning) is not avoidable in the real world. In theory, you can only achieve either high availability or strong consistency in a distributed storage system. I have done a bit of … Continue reading How Azure Storage Cheats Over the CAP Theorem →

Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow

Firstly, I need to clarify that what I am discussing in this blog post is only with ADF Mapping Data Flow instead of the whole ADF service. I am not going to challenge ADF’s role as the superb orchestration service in the Azure data ecosystem. In fact, I love ADF. At the control flow level, … Continue reading Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow →

Create Custom Partitioner for Spark Dataframe

Spark dataframe provides the repartition function to partition the dataframe by a specified column and/or a specified number of partitions. However, for some use cases, the repartition function doesn't work in the way as required. For example, in the previous blog post, Handling Embarrassing Parallel Workload with PySpark Pandas UDF, we want to repartition the traveller dataframe so … Continue reading Create Custom Partitioner for Spark Dataframe →

Configuration-Driven Azure Data Factory Pipelines

In this blog post, I will introduce two configuration-driven Azure Data Factory pipeline patterns I have used in my previous projects, including the Source-Sink pattern and the Key-Value pattern. The Source-Sink pattern is primarily used for parameterising and configuring the data movement activities, with the source location and sink location of the data movement configured in a … Continue reading Configuration-Driven Azure Data Factory Pipelines →

Handling Embarrassing Parallel Workload with PySpark Pandas UDF

Introduction In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and … Continue reading Handling Embarrassing Parallel Workload with PySpark Pandas UDF →

Handling Embarrassing Parallel Workload with Databricks Notebook Workflows

Introduction Embarrassing Parallel refers to the problem where little or no effort is needed to separate the problem into parallel tasks, and there is no dependency for communication needed between the parallel tasks. Embarrassing parallel problem is very common with some typical examples like group-by analyses, simulations, optimisations, cross-validations or feature selections. Normally, an Embarrassing … Continue reading Handling Embarrassing Parallel Workload with Databricks Notebook Workflows →

Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service

Introduction One requirement I have been recently working with is to run R scripts for some complex calculations in an ADF (V2) data processing pipeline. My first attempt is to run the R scripts using Azure Data Lake Analytics (ADLA) with R extension. However, two limitations of ADLA R extension stopped me from adopting this … Continue reading Execute R Scripts from Azure Data Factory (V2) through Azure Batch Service →

The Tip for Installing R packages on Azure Batch

Problem In one project I have been recently working with, I need to execute R scripts in Azure Batch. The computer nodes of the Azure Batch pool were provisioned with Data Science Virtual Machines which already include common R packages. However, some packages required for the R scripts, such as tidyr and rAzureBatch, are missing … Continue reading The Tip for Installing R packages on Azure Batch →

Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker

The first question pop out of my head when I heard Microsoft Bot Framework is how to build some sorts of bots, which is capable to replace me for any of my responsibilities in my work. Part of my Power BI consulting responsibilities is to answer all sorts of Power BI related questions from clients. … Continue reading Build a Power BI Knowledge Base Bot Using Microsoft Bot Framework and QnA Maker →

SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2

*The source code created for this blog post can be found here. In the previous blog posts in the SSIS in Azure series, we created a SSIS package to periodically ingests data from Azure SQL database to Azure Data Lake Store and deployed the package in the Azure-SSIS Integrated Runtime. Up to this point, we have … Continue reading SSIS in Azure #3 – Schedule and Monitor SSIS Package Execution using ADF V2 →

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: