Pump it up: Apache Spark 3.2 completes ANSI SQL mode, makes adaptive query execution new normal

AI/ML

By Team Devclass

October 21, 2021

Pump it up: Apache Spark 3.2 completes ANSI SQL mode, makes adaptive query execution new normal

Data analytics platform Apache Spark has recently been made available in version 3.2, featuring enhancements to improve performance for Python projects and simplify things for those looking to switch over from SQL.

Spark 3.2 is the first release that has adaptive query execution, which now also supports dynamic partition pruning, enabled by default. One of the most highlighted features of the release, though, is a pandas API which offers interactive data visualisations, and provides pandas users with a comparatively simple option to scale workloads to node clusters and reduce the number of out-of-memory occurrences on single machines.

The addition is the result of dataframe project Koalas getting merged into PySpark and also allows things like querying data via SQL or using Spark’s data stream processing capabilities.

Speaking of SQL, Spark’s ANSI SQL mode has been marked generally available with this release. However it includes some major behavioral changes, which is why the team didn’t make the mode the new default yet. Amongst other things the mode now enforces new explicit cast syntax rules, and throws errors when SQL operators or functions use invalid inputs which was silently ignored before and might lead to unexpected outcomes for those used to earlier Spark versions.

When working with Spark on Kubernetes, users can now limit the number of pending pods to prevent scheduler overloads, and utilise remote driver/executor template files. The update also comes with a new developer API that allows the creation of custom driver and executor feature steps.

Spark’s structured streaming component has meanwhile been fitted with an implementation of a RocksDB StateStore to be able to handle large states in stateful operations such as streaming aggregation and join. It also contains a new session window that doesn’t have a static window begin and end time but depends on a defined period of time, and uses v2.8 of the Apache Kafka client.

To improve Spark’s performance, the team added a couple of query optimisations — so that v3.2 for instance supports cardinality estimation for union, sort and range operators, removes redundant aggregates in the optimizer, and is quicker to construct new query plans. Some operators of large deployments also found that the shuffle component had become a potential bottleneck, which is why it has been reworked to use a push-based shuffle approach. The latter performs shuffles at the end of mappers, pre-merges blocks and moves them towards reducers for better efficiency.

Elements deprecated with the 3.2 release include the ps.broadcast API, the num_files argument, DataFrame.to_spark_io, spark.launcher.childConectionTimeout, as well as GROUP BY … GROUPING SETS (…). More details on bug fixes, and known issues can be found in the Spark release notes.

Pump it up: Apache Spark 3.2 completes ANSI SQL mode, makes adaptive query execution new normal

Microsoft SQL Server MCP tool: 'Leap in data interaction' or limited and frustrating?

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier

CloudBees opens MCP server so agents can infiltrate DevOps

AI is generating code at scale – but human scale code review can’t keep up

Redefining identity security in the age of agentic AI

GitLab warms up investors for winter release of agentic AI flavoured Duo Workflow

New Relic aims to crack open MCP servers

Shadow AI in the enterprise: managing risk without slowing progress

Cursor AI editor hits 1.0 milestone, including BugBot and high-risk background agents

Node.js frustrating and inefficient? OpenAI rewrites AI coding tool in Rust

Researchers warn of prompt injection vulnerability in GitHub MCP with no obvious fix

MCP will be built into Windows to make an 'agentic OS' but security will be a key concern

ABOUT US

FOLLOW US