Apache Arrow hits 1.0 bullseye with more stable columnar format

AI/ML

By Julia Schmidt

July 29, 2020

Apache Arrow hits 1.0 bullseye with more stable columnar format

Apache Arrow has released version 1.0 of its in-memory analytics development platform, touting binary stability of the columnar format, and a reworked compute kernel layer amongst other things.

The platform is meant to enable big data systems to quickly process and move data, providing a data memory format for analytics operations. According to the project-management committee, version 1.0 is actually the 18th major release of Arrow, and the team will now switch to semantic versioning for Arrow libraries, probably to make versions that break compatibility easier to spot.

Part of the version bump is also the stabilisation of the Arrow columnar format, which is made up of a language-agnostic in-memory data structure specification, protocols for serialisation and data transport, as well as metadata serialisation. The format’s current iteration no longer insists on dictionary indices being signed integers, and comes without a validity bitmap buffer in Union types, but includes an optional bitWidth field for decimal types.

It also contains a feature enum to signal the use of some optional IPC stream features – though the field can’t be found in any implementation yet – and alterations in the buffer layout of Union types in the metadata. Amongst other things, the latter means Union arrays can no longer have a top-level bitmap. Datasets can now be read from CSV files and the system knows how to assemble datasets of Parquet files from a single _metadata file.

Looking into the features of Apache Arrow 1.0, the project team fitted the compute kernel layer with capabilities for generic function lookup, dispatch, and execution. It also altered the framework to handle tasks like type checking and function dispatch, which is supposed to facilitate writing new function kernels since developers no longer have to take care of those things.

While the Arrow Core sports a new sort kernel and support for large lists, DataFusion’s UX has been improved to work with named columns and ways to share an ExecutionContext between threads. Changes specific to working with certain languages seem to be comparatively minor in nature, though improvements in static linking for C++ devs, better conversion of Arrow types in R, and support for the Arrow Dataset in Ruby are well worth a try.

The Flight libraries for fast data transport introduced last October have been extended to provide a bi-directional data endpoint and support mutual TLS authentication as well the ability for clients to control the size of a data message in transit. Flight’s C++ and Python flavours meanwhile expose options like the address of a client on the server from the gRPC project. Details can be found on the project’s blog.

Sourcegraph coding assistant now supports Anthropic Claude 3 – though limited to 7K token input

Supabase moves out of beta, adds supports for Swift, plugs in Oriole storage engine

Go dev survey shows frustration with Python’s dominance of AI

AI coding: Hugging Face engineer extols benefits of open source models, but hard questions remain

.NET Smart Components experiment the "Visual Basic" of AI programming?

GitHub autofix progresses to public beta: insecure code corrected with AI, but only for enterprise

JetBrains bows to user pressure and unbundles AI Assistant in new IntelliJ IDEA beta

Hands On: Netlify AI-assisted deployment aims to reduce log-diving

Stack Overflow turns to Google for hosting and AI features, trusts in Gemini for tech answers

Employing your cloud data warehouse to scale up AI/ML

Rust-based Zed editor now open source – with built-in support for OpenAI and GitHub Copilot

AI assistance is leading to lower code quality, claim researchers

Apache Arrow hits 1.0 bullseye with more stable columnar format

ABOUT US

FOLLOW US