IBM has decided to share its CodeFlare framework as an open-source project with the broader machine learning community. The aim: to simplify the scaling and integration of complex ML pipelines or analytics processes covering multiple steps.
According to Raghu Ganti, Principal Research Staff Member at IBM, who spoke about the project at Ray Summit last month, CodeFlare is mainly aimed at data scientists and other actors from the machine learning space who aren’t necessarily familiar with distribution techniques, but whose processes could benefit from a bit of parallelism.
To assist them, CodeFlare provides a scalable pipeline implementation and deployment helpers for running tasks on Red Hat OpenShift and IBM’s own Cloud Code Engine. Users then have an easier time switching to a Kubernetes-based platform, should the need arise to make the jump from a local to a cluster deployment.
Of course CodeFlare isn’t the first project to offer pipelines for data-heavy ML scenarios — kubeflow, sklearn, and Apache Spark are only a few examples that work towards the same goal. For IBM’s purposes, however, these solutions always fell short on certain aspects, so with CodeFlare it tries to combine approaches and mitigate shortcomings.
The company is, for instance, determined to have Python (and Java) functions as a computational unit as far as the data scientist is concerned, rather than a container — which IBM finds a bit too coarse-grained for its purposes. It also strives to provide data as well as task parallelism, and an efficient data exchange between environments.
Under the hood, CodeFlare uses a framework called Ray. The open-source project was originally developed at the University of California Berkeley’s Artificial Intelligence Research Lab, and means to support the building of distributed applications. As the place of origin suggests, Ray was created with AI and machine learning use cases in mind — which fits IBM’s bill quite nicely.
Ray’s designers wrote in 2018 that their aim in creating it was to “enable practitioners to turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a cluster (or on a single multi-core machine) with relatively few additional lines of code”.
CodeFlare takes up this approach by using the primitives offered by the Ray framework — computational scale-out, objects, and a distributed object store — and mapping them to the pipeline use-case. It also tries to learn from the shortcomings in Spark and sklearn, and improve pipeline scaling by using lists of object references as I/O. This should achieve better parallelism, among other things.
CodeFlare also introduces AND/OR graphs to represent pipelines, besides the commonly used directed acyclic graphs, so that it is able to work with distinct input, firing, state, and output semantics, which can be combined in different ways to speed up the process.
Practical CodeFlare examples can be found in the project’s documentation. To use the framework, data scientists need to have Python 3.8 or newer and a version of JupyterLab installed.
CodeFlare is still pretty new, so maybe not ripe for production use just yet. In the coming months, IBM plans to enhance the project by improving fault tolerance and consistency, and coming up with improvements for integration and data management for external sources. The company also looks into pipeline visualisation, which would surely be a useful edition so keep your eyes peeled.