Apache DataSketches is a Top-Level Project
February 4, 2021
Apache Software Foundation (ASF), the all-volunteer developers,
stewards, and incubators of more than 350 Open Source projects and
initiatives, announced today Apache® DataSketches™ as a Top-Level
Apache DataSketches is a highly performant Big Data analysis library
for scalable approximate algorithms. The project originated at Yahoo
in 2012, was open-sourced in 2015, and entered the Apache Incubator
in March 2019.
"We are excited to be part of the ASF," said Lee Rhodes, Vice
President of Apache DataSketches. "We have learned a great deal from
the incubation process and look forward to working with new users of
our library that want to take advantage of sketching technology."
Apache DataSketches’s library of specialized streaming algorithms
—known as sketches— comprise small data structures that process data
at massive scale. Sketches are ideal for queries that cannot afford
the time or huge compute resources needed to generate exact results.
Where approximate results are acceptable, sketches are the only
viable alternative for interactive queries with real-time analysis.
Apache DataSketches is:
Fast —produces approximate results at orders of magnitude faster
than traditional methods -- user configurable size vs accuracy
Efficient —sketch algorithms process data in a single pass for both
real-time and batch;
Mergeable —allows for parallelization;
Optimized for large-scale computing environments that process Big
Data —such as Apache Hadoop, Apache Spark, Apache Druid, Apache
Hive, Apache Pig, PostgreSQL;
Binary compatible across multiple languages and platforms —available
in Java, C++, and Python;
Expanded Analysis —including count distinct with set operations,
quantiles, most frequent items (heavy hitters), matrix computations,
and more; and
Mathematically defined and proven error properties —provides a
priori and a posteriori error estimation and upper and lower bounds
with statistically derived confidence intervals.
Apache DataSketches is used in large-scale computing environments
such as Nielsen Identity, Permutive, Splice Machine, and Verizon
Media, among others, as well as Apache Druid and Apache Pinot
"The Apache DataSketches project takes powerful algorithms for data
summarization and analysis, and makes them available to everyone,"
said Professor Graham Cormode of the University of Warwick. "While
these methods are tremendously useful in practice, their
descriptions were previously only in highly technical scientific
papers. This project has made robust, dependable and well-documented
implementations available to all. Already the library has been used
for a wide range of applications, including service quality,
monitoring, ad analytics and the sciences."
"Using Apache DataSketches has enabled Apache Druid users to perform
common tasks such as quantiles and unique counting in a highly
performant and efficient manner," said Gian Merlino, Vice President
of Apache Druid. "We have worked closely together over the years to
make the power of DataSketches accessible to Apache Druid users,
helping us provide real-time analytics at scale."
"Sketches are fundamental to calculating many of our key company
metrics," said Tom Miller, Director of Software Development
Engineering at Verizon Media. "It allows us to greatly simplify our
data processing and reduce storage costs by allowing us to calculate
non-additive metrics across user specified dimension combinations at
report time instead of having to either retain raw data or
pre-calculate for each set of dimensions."
"Combining Apache Druid and DataSketches allows us to provide our
customers real-time insights into their target audiences and
advertising campaigns," said Yakir Buskilla, Senior Vice President
of Research and Development and General Manager Israel at Nielsen
Identity. "The ability to evaluate set expressions make the Theta
Sketch especially powerful for multi-set cardinality estimation as
well as funnel analysis."
DataSketches has provided us with a solid theoretical foundation
upon which we are able to store and process data at scale - in a
simple, fast and cost-efficient manner," said David Cromberge,
Senior Software Engineer at Permutive. "It has been a pleasure to
engage with their creators and community who have been helpful at
every step of the way.”
"We use DataSketches's Theta-Sketches for distinct-count
aggregations that are used to solve large multi-set cardinality
approximation," said Mayank Shrivastava, Committer and member of the
Apache Pinot (incubating) Podling Project Management Committee. "The
ability to evaluate set expressions make the Theta Sketch especially
powerful for multi-set cardinality estimation as well as funnel
"We welcome those interested in streaming algorithms to visit us,
learn about this exciting technology, and contribute to Apache
DataSketches to make our project even better," added Rhodes.