Apache Arrow hits 1.0 bullseye with more stable columnar format

Apache Arrow hits 1.0 bullseye with more stable columnar format

Apache Arrow has released version 1.0 of its in-memory analytics development platform, touting binary stability of the columnar format, and a reworked compute kernel layer amongst other things.

The platform is meant to enable big data systems to quickly process and move data, providing a data memory format for analytics operations. According to the project-management committee, version 1.0 is actually the 18th major release of Arrow, and the team will now switch to semantic versioning for Arrow libraries, probably to make versions that break compatibility easier to spot. 

Part of the version bump is also the stabilisation of the Arrow columnar format, which is made up of a language-agnostic in-memory data structure specification, protocols for serialisation and data transport, as well as metadata serialisation. The format’s current iteration no longer insists on dictionary indices being signed integers, and comes without a validity bitmap buffer in Union types, but includes an optional bitWidth field for decimal types. 

It also contains a feature enum to signal the use of some optional IPC stream features – though the field can’t be found in any implementation yet – and alterations in the buffer layout of Union types in the metadata. Amongst other things, the latter means Union arrays can no longer have a top-level bitmap. Datasets can now be read from CSV files and the system knows how to assemble datasets of Parquet files from a single _metadata file.

Looking into the features of Apache Arrow 1.0, the project team fitted the compute kernel layer with capabilities for generic function lookup, dispatch, and execution. It also altered the framework to handle tasks like type checking and function dispatch, which is supposed to facilitate writing new function kernels since developers no longer have to take care of those things.

While the Arrow Core sports a new sort kernel and support for large lists, DataFusion’s UX has been improved to work with named columns and ways to share an ExecutionContext between threads. Changes specific to working with certain languages seem to be comparatively minor in nature, though improvements in static linking for C++ devs, better conversion of Arrow types in R, and support for the Arrow Dataset in Ruby are well worth a try.

The Flight libraries for fast data transport introduced last October have been extended to provide a bi-directional data endpoint and support mutual TLS authentication as well the ability for clients to control the size of a data message in transit.  Flight’s C++ and Python flavours meanwhile expose options like the address of a client on the server from the gRPC project. Details can be found on the project’s blog.