Databricks Open Sources Delta Lake
April 24, 2019
noted a new open source project called Delta Lake to deliver reliability
to data lakes. Delta Lake is the first production-ready open source
technology to provide data lake reliability for both batch and streaming
data. This new open source project will enable organizations to
transform their existing messy data lakes into clean Delta Lakes with
high quality data, thereby accelerating their data and machine learning
While attractive as an initial sink for data, data lakes suffer from
data reliability challenges. Unreliable data in data lakes prevents
organizations from deriving business insights quickly and significantly
slows down strategic machine learning initiatives. Data reliability
challenges derive from failed writes, schema mismatches and data
inconsistencies when mixing batch and streaming data, and supporting
multiple writers and readers simultaneously.
“Today, nearly every company has a data lake they are trying to gain
insights from, but data lakes have proven to lack data reliability.
Delta Lake has eliminated these challenges for hundreds of enterprises.
By making Delta Lake open source, developers will be able to easily
build reliable data lakes and turn them into ‘Delta Lakes’,” said Ali
Ghodsi, cofounder and CEO at Databricks.
Delta Lake delivers reliability by managing transactions across
streaming and batch data and across multiple simultaneous readers and
writers. Delta Lakes can be easily plugged into any Apache Spark job as
a data source, enabling organizations to gain data reliability with
minimal change to their data architectures. With Delta Lake,
organizations no longer need to spend resources building complex and
fragile data pipelines to move data across systems. Instead, developers
can have hundreds of applications reliably upload and query data at
Delta Lake, developers will be able to undertake local development and
debugging on their laptops to quickly develop data pipelines. They will
be able to access earlier versions of their data for audits, rollbacks
or reproducing machine learning experiments. They will also be able to
convert their existing Parquet, a commonly used data format to store
large datasets, files to Delta Lakes in-place, thus avoiding the need
for substantial reading and rewriting.
The Delta Lake project can be found at delta.io and is under the
permissive Apache 2.0 license. This technology is deployed in production
by organizations such as Viacom, Edmunds, Riot Games and McGraw Hill.
“We’ve believed right from the onset that innovation happens in
collaboration - not isolation. This belief led to the creation of the
Spark project and MLflow. Delta Lake will foster a thriving community of
developers collaborating to improve data lake reliability and accelerate
machine learning initiatives,” added Ghodsi.