Pump it up: Apache Spark 3.2 completes ANSI SQL mode, makes adaptive query execution new normal

clo21

Data analytics platform Apache Spark has recently been made available in version 3.2, featuring enhancements to improve performance for Python projects and simplify things for those looking to switch over from SQL.

Spark 3.2 is the first release that has adaptive query execution, which now also supports dynamic partition pruning, enabled by default. One of the most highlighted features of the release, though, is a pandas API which offers interactive data visualisations, and provides pandas users with a comparatively simple option to scale workloads to node clusters and reduce the number of out-of-memory occurrences on single machines. 

The addition is the result of dataframe project Koalas getting merged into PySpark and also allows things like querying data via SQL or using Spark’s data stream processing capabilities.

Speaking of SQL, Spark’s ANSI SQL mode has been marked generally available with this release. However it includes some major behavioral changes, which is why the team didn’t make the mode the new default yet. Amongst other things the mode now enforces new explicit cast syntax rules, and throws errors when SQL operators or functions use invalid inputs which was silently ignored before and might lead to unexpected outcomes for those used to earlier Spark versions.

When working with Spark on Kubernetes, users can now limit the number of pending pods to prevent scheduler overloads, and utilise remote driver/executor template files. The update also comes with a new developer API that allows the creation of custom driver and executor feature steps.

Spark’s structured streaming component has meanwhile been fitted with an implementation of a RocksDB StateStore to be able to handle large states in stateful operations such as streaming aggregation and join. It also contains a new session window that doesn’t have a static window begin and end time but depends on a defined period of time, and uses v2.8 of the Apache Kafka client.

To improve Spark’s performance, the team added a couple of query optimisations — so that v3.2 for instance supports cardinality estimation for union, sort and range operators, removes redundant aggregates in the optimizer, and is quicker to construct new query plans. Some operators of large deployments also found that the shuffle component had become a potential bottleneck, which is why it has been reworked to use a push-based shuffle approach. The latter performs shuffles at the end of mappers, pre-merges blocks and moves them towards reducers for better efficiency. 

Elements deprecated with the 3.2 release include the ps.broadcast API, the num_files argument, DataFrame.to_spark_io, spark.launcher.childConectionTimeout, as well as GROUP BY … GROUPING SETS (…). More details on bug fixes, and known issues can be found in the Spark release notes.