Databricks starts adding delete, update, merge capabilities to Delta Lake

Databricks starts adding delete, update, merge capabilities to Delta Lake

Version 0.3 of Databricks’ open source project Delta Lake is now available to download, adding some APIs to the storage layer introduced in April 2019.

The new release for example includes Scala/Java APIs that allow users to query a table’s commit history, so that they can gain insight about who changed what when. The history of a table is by default stored for 30 days and contains information such as timestamps, operations performed, and the users who prompted them.

To modify data in a Delta table, v0.3 comes with Scala/Java APIs for delete, update, and merge operations. Those can be useful to deduplicate information or follow data protection rules (delete), change minor things (update), and integrate data from Spark DataFrames into the Delta Lake (merge). While the merge API is similar to SQL’s merge command, it also includes ways to delete data and lets users specify additional conditions when updating, inserting, or deleting. 

Another addition is APIs to get rid of old files. When running the so-called vacuum feature, files older than a set retention threshold will be garbage collected to free up memory space. You have to be aware though, that once a table is vacuumed, you won’t be able to skip to a version older than your retention period anymore. The standard threshold is 7 days.

The next Delta Lake release is meant to land on September 15. For now the team plans to include additional APIs so that delete, merge and updated aren’t restricted to the Java crowd anymore, but can be used by Python devs as well. The release is also supposed to come with a Java/Scala API for converting Parquet tables to Delta “without reading and writing the whole table to a new location”.

Databricks released Delta Lake earlier this year dubbing it its “biggest innovation to date”. The storage layer is meant to improve the quality of data stored in a data lake. According to Databricks’ CEO Ali Ghodsi, companies tend to just dump data into lakes without thinking about data quality, which means doing something useful with it later on can prove tricky. 

Delta Lake is layered on top of Apache Spark or other big data engines to make sure data going into a lake conforms to some sort of schema to improve that situation. It is also able to perform ACID (short for atomicity, consistency, isolation, durability) transactions on data flowing through it. The focus on Apache Spark is down to the project stemming from the Databricks team. 

The current release is the second one after the project was introduced in April. Version 0.2 was presented in June and added support for cloud storage offerings such as Amazon S3 and Azure Blob Storage.