Databricks starts adding delete, update, merge capabilities to Delta Lake

AI/ML
DevOps

By Team Devclass

August 5, 2019

Databricks starts adding delete, update, merge capabilities to Delta Lake

Version 0.3 of Databricks’ open source project Delta Lake is now available to download, adding some APIs to the storage layer introduced in April 2019.

The new release for example includes Scala/Java APIs that allow users to query a table’s commit history, so that they can gain insight about who changed what when. The history of a table is by default stored for 30 days and contains information such as timestamps, operations performed, and the users who prompted them.

To modify data in a Delta table, v0.3 comes with Scala/Java APIs for delete, update, and merge operations. Those can be useful to deduplicate information or follow data protection rules (delete), change minor things (update), and integrate data from Spark DataFrames into the Delta Lake (merge). While the merge API is similar to SQL’s merge command, it also includes ways to delete data and lets users specify additional conditions when updating, inserting, or deleting.

Another addition is APIs to get rid of old files. When running the so-called vacuum feature, files older than a set retention threshold will be garbage collected to free up memory space. You have to be aware though, that once a table is vacuumed, you won’t be able to skip to a version older than your retention period anymore. The standard threshold is 7 days.

The next Delta Lake release is meant to land on September 15. For now the team plans to include additional APIs so that delete, merge and updated aren’t restricted to the Java crowd anymore, but can be used by Python devs as well. The release is also supposed to come with a Java/Scala API for converting Parquet tables to Delta “without reading and writing the whole table to a new location”.

Databricks released Delta Lake earlier this year dubbing it its “biggest innovation to date”. The storage layer is meant to improve the quality of data stored in a data lake. According to Databricks’ CEO Ali Ghodsi, companies tend to just dump data into lakes without thinking about data quality, which means doing something useful with it later on can prove tricky.

Delta Lake is layered on top of Apache Spark or other big data engines to make sure data going into a lake conforms to some sort of schema to improve that situation. It is also able to perform ACID (short for atomicity, consistency, isolation, durability) transactions on data flowing through it. The focus on Apache Spark is down to the project stemming from the Databricks team.

The current release is the second one after the project was introduced in April. Version 0.2 was presented in June and added support for cloud storage offerings such as Amazon S3 and Azure Blob Storage.

Databricks starts adding delete, update, merge capabilities to Delta Lake

Docker adds AI agents to Compose along with GPU-powered cloud Offload service

Microsoft SQL Server MCP tool: 'Leap in data interaction' or limited and frustrating?

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier

CloudBees opens MCP server so agents can infiltrate DevOps

AI is generating code at scale – but human scale code review can’t keep up

Redefining identity security in the age of agentic AI

GitLab warms up investors for winter release of agentic AI flavoured Duo Workflow

New Relic aims to crack open MCP servers

Shadow AI in the enterprise: managing risk without slowing progress

Cursor AI editor hits 1.0 milestone, including BugBot and high-risk background agents

Node.js frustrating and inefficient? OpenAI rewrites AI coding tool in Rust

Researchers warn of prompt injection vulnerability in GitHub MCP with no obvious fix

ABOUT US

FOLLOW US