Databricks hands MLflow to Linux Foundation, speeds up Delta Lake, and pushes pandas on Spark forward

Databricks hands MLflow to Linux Foundation, speeds up Delta Lake, and pushes pandas on Spark forward

Data science conference Spark+AI Summit is still in full swing. Organised by Spark experts Databricks, it naturally lent itself as a stage for some progress reports and letting the firm show off new products.

Amongst other things, machine learning platform MLflow has found a new home at the Linux Foundation. After being in the open for two years, the move provides the Databricks project with a new vendor neutral environment in the hopes that this will lead to higher adoption rates and more outside committers. 

While this seems like a somewhat sensible thing to do, onlookers might wonder about the choice of foundation. After all, company co-founder and MLflow creator Matei Zaharia chose the Apache Software Foundation for Spark. A second glance however reveals that the Linux Foundation seems to become a bit of a default for Databricks in recent years, since the company’s Delta Lake was also handed over to the org last autumn

This all could be an indication for the Linux Foundation growth in relevance over the last few years (or its clearly working marketing department) and a nod to its engagement in the machine learning community. Nevertheless, its AI subdivision seems, model standardisation project ONNX aside, a bit bland, containing only projects like Acumos AI and Angel, that are mostly known to those really deep down into the machine learning rabbit hole. 

It will be interesting to see if the move gets MLflow the hoped for push, so that it can catch up with its sparkly older brother. Newly announced features such as experiment autologging and better model management are surely nice little bits to help along the way.

In other tooling news, Databricks announced the first major release of Koalas, a project aiming to bring the popular dataframe implementation pandas to the Spark platform. Koalas was introduced by the company in 2019 “to provide data scientists using pandas with a way to scale their existing big data workloads by running them on Apache Spark without significantly modifying their code”.

Since then, a lot of stabilisation work has been done, so that version 1.0 now implements “almost all widely used APIs and features in pandas”. It also offers a Spark accessor to let devs leverage PySpark APIs, and is better in returning type hints amongst other things.

Spark aficionados looking for more cutting edge news also got their fair share of things to try out with the newly released Delta Engine. As the name suggests, the project is bound to storage layer Delta Lake which promised users an improvement in data quality for data lakes when it was announced last year. Delta Engine is now meant to improve the open source project’s performance by adding “an improved query optimizer, a caching layer that sits between the execution layer and the cloud object storage, and a native vectorized execution engine that’s written in C++”.

Another issue could become of interest in the upcoming months, since Databricks plans a hosted version of newly acquired Redash. The company, which counts Cloudflare and Mozilla to its customer base, is known for collaborative data visualisation and dashboarding capabilities in an open source project of the same name. Interested parties can already register for closed previews of the integrated Databricks/Redash “experience” about to hit later this year.