Apache Arrow offers Flight to counteract data access headaches

Apache Arrow offers Flight to counteract data access headaches

The team behind in-memory data development platform Apache Arrow has introduced its new fast data transport framework Flight to the public. 

The project apparently was motivated by the “pain associated with accessing large datasets over a network”. According to pandas creator and Arrow co-founder Wes McKinney, Flight helps to ease those troubles by increasing “the efficiency of distributed data systems”.

It does so through a protocol for data services that uses the Arrow columnar format as a public API and for over-the-wire data representation. The latter means data doesn’t have to be serialised on receipt anymore, removing the serialisation cost associated with data transport.

Flight libraries are meant for implementing services to send and receive data streams and know basic requests for tasks like authorisation, sending and receiving data streams from a client, and returning data stream schemas and lists. At this stage the project relies heavily on Google’s gRPC library which for example allows Flight to let clients and servers exchange data while requests are being served, which allows for the building of better scalable service. 

The project follows the best practice of using Protocol Buffers to make gRPC work. This comes with the added benefit of offering gRPC clients not aware of the Arrow format a way of interacting with Flight services. However, since the use of “protobuf” isn’t free, the Flight engineers optimised their code to work largely without memory copying and deserialisation, and generally kept the use of the library to a minimum.

Additional Flight features should improve horizontal scalability and let developers define actions for operations like metadata discovery or parameter setting. More information can be found in the introductory blog entry.

Those interested in the project can try it via the latest Apache Arrow release. Version 0.15 was issued in early October and includes C++ (with Python bindings) and Java implementations of Flight. The libraries are still in beta, the team however only expects minor changes to API and protocol.

Moving forward, the Flight developers will keep looking into alternatives for gRPC, at least as far as data transport layers go. At the moment the project only supports TCP via gRPC, but plans to put some design and development time in making RDMA for example work as well. Other items on the team’s to-do list include creating user-facing Flight-enabled services and improving the documentation.