Databricks has introduced new features and a Data Ingestion Network in a bid to provide an “easy and automated way to populate your lakehouse from hundreds of data sources into Delta Lake”.
Delta Lake was premiered by Databricks in April 2019. The storage layer is meant to improve data quality by only letting schema-conforming data through, while putting the rest aside to be potentially fixed by a human user. Thanks to the wide interest Delta Lake spurred, the project quickly found a new home at the Linux Foundation.
Meanwhile, Lakehouse describes the company’s’ “pattern of building a central, reliable and efficient single source of truth for data in an open format […] with decoupled storage and compute”. This however comes with its own set of challenges, the biggest being data ingestion from different third party sources and cloud storage.
According to Databricks’ Marketing Manager Hiral Jasani, without the right tools this process is “often hard, in many cases requiring custom development and dozens of connectors or APIs that change over time and then break the data loading process.”
This is why the company has partnered with data ingestion product providers, who have then gone on to build “native integrations with Databricks to ingest and store data in Delta Lake directly in your cloud storage”.
Beside integration with the Azure Data Factory, customers are now promised an easy way to use Fivetran, Qlik, Infoworks, StreamSets, and Syncsort in combination with Databricks. Additional support from Informatica, Segment, and Stitch is planned to follow soon.
Along with the release, Databricks’ engineering team also tried to tackle the data latencies and costs emerging when loading raw files from cloud storage to Delta tables. The solution they came up with is called Auto Loader, “an optimised file source” that is advertised as being easy to use, scalable, and without the need for file state management.
Staff product manager Prakash Chockalingam explains the project in an introductory blog post, stating that users will only “need to provide a source directory path and start a streaming job. The new structured streaming source, called ‘cloudFiles’, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.”
While Auto Loader guarantees exactly-once data ingestion when used to stream loads, those looking to work with batch loads now have a new idempotent COPY command available to them. It can be rerun should a failure occur and therefore seems like a good choice for data pipelines. Examples and more details can be found on the Databricks blog.