Flink ML 2.0 paves way for better usability with algorithm implementations and Python SDK

Flink ML 2.0

After putting the project through a major refactoring and giving it a new home, the team behind machine learning library Flink ML has released version 2.0, sporting a redesigned API and initial Python support.

Flink ML is part of the Apache Flink stream processing framework and meant to provide APIs and infrastructure for building machine learning pipelines. While users were largely expected to use the included APIs to implement machine learning algorithms themselves in earlier versions, Flink ML 2.0 is the first iteration to include some basic algorithms out of the box. 

To realise this, the Flink ML developers teamed up with their counterparts at Flink-based machine learning algorithm platform Alink. Together they redesigned the project’s APIs and reworked some of Alink’s implementations for Flink ML — a first step towards the Apache project’s long-term goal of providing a library of performant algorithms. As a start, developers now have logistic regression, k-means, k-nearest neighbors, naive bayes, and one-hot encoder implementations at their disposal. 

Since a full-blown algorithm library is still a good bit away, Flink ML 2.0 at least provides some more help for those implementing algorithms on their own. The library now accepts multiple inputs for a given stage of a machine learning workflow and is able to return more than one output, which lets algorithm devs describe workflows as directed acyclic graphs of pre-defined stages subsequent users don’t need to know much about.

Other convenient additions expose model data in a way useful for creating online learning applications, provide ways to split workflows up and use the resulting modules across workflows, and offer a native way of processing data in an iterative manner. The latter comes in the form of a stream-batch unified iteration library, which was needed since the DataSet API that was used for bounded iteration before is set for deprecation and had some performance issues. Integrated functionalities include the transmission of records to preceding operators and the tracking of round progress inside an iteration.

While planning version 2.0, the Flink ML team looked into venturing into the realm of Python programming, given that the language has become a popular choice amongst ML practitioners. The result is a dedicated Python package with APIs similar to the ones already available to Java developers, which is meant to evolve into an interoperability tool that allows the mixing of Java and Python stages along the line.

Opening Flink ML for Python might help draw more interest to the library since it would provide an in for those already familiar with popular ML frameworks such as TensorFlow and PyTorch. However, it’s not the only thing the Flink ML team wants to try in order to curb acceptance. Higher development speed is also on their agenda, and since having its own repository has proven to help Flink’s Stateful Functions in that respect, Flink ML now has a separate home as well. 

Meanwhile sub-projects dl-on-flink and clink have been moved from Flink ML into the flink-extended GitHub organisation, which is hoped to facilitate collaboration with other contributors looking to extend Apache Flink’s functionality.