In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and is not able to split workload into multiple worker nodes. In this blog post, I will walk through another approach to handle embarrassing parallel workload with Databricks Pandas UDF.
Embarrassing parallel workload refers to the type of problems that are easy to separate into parallel task with no dependency for communication between those parallel tasks. Embarrassing parallel workload normally calculates similar things many times with different groups of data or different sets of parameters. The calculation are independent of each other and each calculation takes a fair amount of time. The datasets involved in embarrassing can be big, but in most of use cases, the datasets are not at the “Big Data” scale.
The native parallelism mechanism of Apache Spark might not be an efficient way for the embarrassing parallel workload due to the overhead of serialization and inter-process communication. In addition, many libraries commonly used in the Embarrassing Parallel use cases, such as numpy and scikit-learn, are not supported by PySpark. However, Pandas UDFs supported by PySpark can be a feasible way for embarrassing parallel workload.
Unlike the PySpark UDFs which operate row-at-a-time, grouped map Pandas UDFs operate in the split-apply-combine pattern where a Spark dataframe is split into groups based on the conditions specified in the groupBy operator and a user-defined Pandas UDF is applied to each group and the results from all groups are combined and returned as a new Spark dataframe. Embarrassing parallel workload fits into this pattern well. In this way, the calculation of an embarrassing parallel workload can be encapsulated into a Pandas UDF. The dataset involved in the embarrassing parallel workload is loaded into a PySpark dataframe and split into group and the calculation on each group of data is executed in the Pandas UDF with Spark tasks running on separate executors in parallel.
This blog post demonstrates the Pandas UDF approach with the sample example I have used in the previous blog post for explaining the approach for handling embarrassing parallel workload with Databricks notebook workflows.
The sample notebook I have created for this blog post can be found here in my Github repository. The snapshot below shows the core part of the notebook:
A Pandas UDF, “planGroupTravelling” (Line 6-12) is crated to execute the group travel planning optimisation algorithm from Toby Segaran’s “Collective Intelligence” book for one travel group.
The dataset of all travel groups used in this example is converted into a Spark dataframe where each row contains fields of travel group id, traveler name, and travel origin (Line 14). For this example, an utility function, “travelGroupsToDataframe“, is created to covert the original dataset in Python dictionary format to a Spark dataframe (The code for the “travelGroupsToDataframe” function can be found in the sample notebook for this blog post here). The snapshot below shows the converted Spark dataframe, i.e. the input dataframe for the parallel workload.
The dataframe is split by travel group id and the “planGroupTraveling” Pandas UDF is applied to each group (Line 15). The results from each UDF, the optimised travelling arrangement for each traveler, are combined into a new Spark dataframe.
This blog post describes another approach for handling embarrassing parallel workload using PySpark Pandas UDFs. Grouped map Pandas UDFs operate in the split-apply-combine pattern which distributes calculation logic written in native Python to separate executors and run in parallel. This pattern fits well with embarrassing parallel workload.
One thought on “Handling Embarrassing Parallel Workload with PySpark Pandas UDF”
How much time it took in this approach? In your previous blogpost it was ~5mins (running in for loop) vs. ~2 mins (parallel execution through notebook workflow)