With Spark + AI Summit just around the corner, the team behind the big data analytics engine pushed out Spark 3.0 late last week, bringing accelerator-aware scheduling, improvements for Python users, and a whole lot of under-the-hood changes for better performance.
In its tenth year as an open source project, the project’s developers have – amongst other things – put a particular focus on improving the Spark SQL engine. The hope is that changes in this area will benefit the majority of the user base, since, for example, a look at the numbers on the Databricks platform suggest that most Spark users pass their work through the SQL engine, independent of the programming language used.
To help all of them get their queries executed faster, Spark contributors Databricks and Intel have come up with an adaptive query execution (AQE) framework that is supposed to generate better execution plans at runtime. It does so through three optimisation techniques that can combine small shuffle partitions, automatically switch from sort-merge join to broadcast-hash join if it yields better performance, and improve skew joins. First benchmarks claim speed-ups ranging from 1.1x to more than 1.5x when using AQE.
If optimisers can’t identify which partitions can be skipped at compile time, for example when star schemas are used, Spark now applies a dynamic partition pruning approach for query speedup. Users also have new hints HUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL available to influence the optimiser’s choice of plan.
The release also accounts for the fact that Python currently is the “most widely used language on Spark”. The project’s Python API PySpark has been reworked with simplified exceptions to improve error handling, and fitted with a new interface for user-defined functions (UDF) of analysis library pandas. It also includes new Pandas Function APIs such as map and co-grouped map, as well as UDF types “iterator of series to iterator of series” and “iterator of multiple series to iterator of series”.
Other than that Spark was able to move its Project Hydrogen forwards. The initiative was meant to help the engine succeed in a deep learning context and had its first success by getting barrier execution mode into Spark 2.4. Optimisations for data exchange and accelerator-aware scheduling were on the to-do list as well. The latter is now part of the 3.0 release. It allows users to leverage things like GPUs to speed-up the learning process, for example by letting them specify accelerators through the configuration and call those via a couple of new APIs.
The Hydrogen efforts also seem to be responsible for a new UI to inspect structured streaming jobs that aggregate information on completed streaming query jobs and streaming query statistics. Metrics currently included are the input and process rate, input rows, and the time needed to process a batch or perform certain operations.
Details on other improvements, such as additional monitoring for batch and streaming applications as well as a new catalog plug-in API to let users access and manipulate metadata of external data sources, can be found in the release notes.