Databricks, the company behind the minds that brought you the big data analytics engine Apache Spark, has open-sourced storage layer Delta Lake in a bid to fit Spark and suchlike with ACID transactions.
Dubbing the new project “the next step in the evolution of the data journey” leading to a “paradigm shift”, Databricks CEO Ali Ghodsi feels pretty confident about his company’s newest creation. To clarify why that is, a bit of historical context is needed: “For the last decade, most enterprises have started data lakes. These data lakes are good at capturing data, so they’re storing all their data in it.”
“The promise has been, that once they have all that data, that all enterprises on the planet can do machine learning, they can get to real-time use cases, they can do business intelligence, they can do reporting, they can do all these things in one place – in the data lake.” In reality however, things tend to look different with most of these projects failing, Ghodsi muses.
“Just storing lots and lots and lots of data and dumping it into a data lake, doesn’t mean that you can later actually do something useful with it. So really the problems people are having are about data quality, reliability, scalability, and performance. That’s what Delta Lake addresses.”
Delta Lake is built on Apache Spark and is supposed to be put on top of a given data lake to improve data quality. It does that by making sure data that is funneled through it conforms to a predefined schema – if it does, the data stays in Delta Lake, otherwise it will be sent back into the data lake where it is put into quarantine. Users can then have a look at the quarantined data, see if the issues the system has with it are fixable, and let it return to the Delta Lake afterwards.
As the data flows into the Delta Lake, the project is able to apply ACID (atomic, consistent, isolated, and durable) transactions to them. All of these either succeed or the system will delete any residue and try again, so that the Delta Lake data stays high-quality. To Ghodi this is one of the main differences of today’s data lakes, where, for example if an ingestion process fails halfway through, half of the data will be in the lake with the rest missing.
Having transactions also means being able to use transactional operations such as update and delete on the Delta Lake, which can be way less computationally expensive than the standard batch operations on regular data lakes. To help with performance and scalability, Delta Lake uses Spark to distribute and operate on metadata, which often is a bottleneck.
If you’re wondering about the project’s name, this is mainly down to the way metadata about the data that comes in is stored. Delta files keep track of the transactions that have been used on the data, which also gives users the chance to “time travel”, meaning that – without having to restore deleted data for example – transactions can be performed as if the data was in the state it was in at any given time. This can be useful for compliance reasons for example.
The roadmap for the year ahead was mainly informed by customer feedback collected during production trials in the past couple of months. This phase already saw an exabyte of data being processed by Delta Lake with Apple being one of the companies to try the project. According to Databricks’ Michael Armbrust ease of ingesting data from the cloud is one of the main issues that the team will be seeing to next.
A feature that goes along with that is the “expectation” concept. It will allow users to formulate what quality means to them, or more precisely what useful data should look like beyond a mere schema so that the tool is able to make sure ingested data matches those criteria. In the long run, customers are meant to be able to declaratively specify what the structure of the data lake should be. This includes things like where certain tables have to be stored to ensure compliance, if human readable descriptions have to be applied, and where data should be discoverable.