Databricks tries making data sources more digestible

AI/ML

By Team Devclass

February 25, 2020

Databricks tries making data sources more digestible

Databricks has introduced new features and a Data Ingestion Network in a bid to provide an “easy and automated way to populate your lakehouse from hundreds of data sources into Delta Lake”.

Delta Lake was premiered by Databricks in April 2019. The storage layer is meant to improve data quality by only letting schema-conforming data through, while putting the rest aside to be potentially fixed by a human user. Thanks to the wide interest Delta Lake spurred, the project quickly found a new home at the Linux Foundation.

Meanwhile, Lakehouse describes the company’s’ “pattern of building a central, reliable and efficient single source of truth for data in an open format […] with decoupled storage and compute”. This however comes with its own set of challenges, the biggest being data ingestion from different third party sources and cloud storage.

According to Databricks’ Marketing Manager Hiral Jasani, without the right tools this process is “often hard, in many cases requiring custom development and dozens of connectors or APIs that change over time and then break the data loading process.”

This is why the company has partnered with data ingestion product providers, who have then gone on to build “native integrations with Databricks to ingest and store data in Delta Lake directly in your cloud storage”.

Beside integration with the Azure Data Factory, customers are now promised an easy way to use Fivetran, Qlik, Infoworks, StreamSets, and Syncsort in combination with Databricks. Additional support from Informatica, Segment, and Stitch is planned to follow soon.

Along with the release, Databricks’ engineering team also tried to tackle the data latencies and costs emerging when loading raw files from cloud storage to Delta tables. The solution they came up with is called Auto Loader, “an optimised file source” that is advertised as being easy to use, scalable, and without the need for file state management.

Staff product manager Prakash Chockalingam explains the project in an introductory blog post, stating that users will only “need to provide a source directory path and start a streaming job. The new structured streaming source, called ‘cloudFiles’, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.”

While Auto Loader guarantees exactly-once data ingestion when used to stream loads, those looking to work with batch loads now have a new idempotent COPY command available to them. It can be rerun should a failure occur and therefore seems like a good choice for data pipelines. Examples and more details can be found on the Databricks blog.

Databricks tries making data sources more digestible

Microsoft shovels extra Copilot features into VS Code amid dev complaints of 'more AI bloat'

Docker adds AI agents to Compose along with GPU-powered cloud Offload service

Microsoft SQL Server MCP tool: 'Leap in data interaction' or limited and frustrating?

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier

CloudBees opens MCP server so agents can infiltrate DevOps

AI is generating code at scale – but human scale code review can’t keep up

Redefining identity security in the age of agentic AI

GitLab warms up investors for winter release of agentic AI flavoured Duo Workflow

New Relic aims to crack open MCP servers

Shadow AI in the enterprise: managing risk without slowing progress

Cursor AI editor hits 1.0 milestone, including BugBot and high-risk background agents

Node.js frustrating and inefficient? OpenAI rewrites AI coding tool in Rust

ABOUT US

FOLLOW US