Tag: Parallel

Handling Embarrassing Parallel Workload with PySpark Pandas UDF

Introduction In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and … Continue reading Handling Embarrassing Parallel Workload with PySpark Pandas UDF

Handling Embarrassing Parallel Workload with Databricks Notebook Workflows

Introduction Embarrassing Parallel refers to the problem where little or no effort is needed to separate the problem into parallel tasks, and there is no dependency for communication needed between the parallel tasks. Embarrassing parallel problem is very common with some typical examples like group-by analyses, simulations, optimisations, cross-validations or feature selections. Normally, an Embarrassing … Continue reading Handling Embarrassing Parallel Workload with Databricks Notebook Workflows