Apache Spark shows off Zen successes and resource reliefs

Apache Spark 3.1

After a minor hiccup earlier this year, Apache Spark 3.1.1 is now good to go and sees the big data analytics engine gaining Kubernetes and node decommissioning capabilities as well as some enhancements to make working with Python and ANSI SQL a bit easier. 

Though the name suggests otherwise, v3.1.1 is an official feature release. The Spark team had to abandon version 3.1 due to technical issues during the release preparation process, which is why the Spark team discourages using v3.1 for any purpose to prevent potential problems.

Spark 3.1.1 includes the first results of the Project Zen initiative launched last year to improve the Python interface PySpark. Thanks to this, Python devs can now leverage dependency management functionality and Python type hints. It also provides a reworked and more user friendly documentation, which not only includes API references but additional articles on stumbling blocks like debugging and testing. 

SQL users meanwhile will, once updated, be able to work with the char/varchar data types, and additional functions. The Spark team has also been busy getting rid of ambiguous DDL syntaxes for creating tables. Devs who have the dialect mode for ANSI SQL enabled, will know sooner if their input doesn’t comply with the needed format since more functions have learned to throw errors in that case instead of returning null. The mode has also been fitted with new explicit cast syntax rules.

With the release, Spark on Kubernetes has reached general availability. The enhancement allows Spark to create a driver running within a Kubernetes pod, which handles the creation of executors to run within such units, connection to them, and app code execution. Capabilities to decommission nodes are still in development, however there’s now an experimental preview available that should work on both Spark and the Kubernetes variant.

Under the hood, the project has updated dependencies and now uses Apache Hadoop 3.2 by default. In a bid to improve performance, Spark has grown to support the elimination of subexpression for interpreted predicates and expression evaluation as well as in projects with whole-stage-codegen and in conditional expressions. It also includes predicate pushdown enhancements, and functionality to preserve shuffled hash join side partitioning and hash join stream side ordering, generate code for such joins, and support full outer joins. 
Other areas of improvement are structured streaming and the Spark machine learning library MLlib. Structured streaming finally offers schema validation for the streaming state store and state schema validation to prevent indeterministic behaviour. It also includes left semi and full outer stream-stream joins, and history server support, which should help when debugging or monitoring streaming applications. MLlib provides training summaries for various models in its latest iteration, and various blockification measures, details on which are available in the project’s documentation.