Microsoft open sources Data Accelerator for Apache Spark
By Geoff Staneff , Microsoft
Principal Program Manager
April 19, 2019
Data Accelerator for Apache Spark simplifies streaming big data using Spark. Data Accelerator has been used for two years within Microsoft for processing streamed data across many internal deployments handling data volumes at Microsoft scale. Offering an easy to use platform to learn and evaluate your streaming needs and requirements, we are excited to share this project with the wider community as open source.
A few of the ways Data Accelerator will make it easier to build a streaming pipeline on spark:
Data Accelerator is an easy way to set up and run a streaming big data pipeline on Apache Spark. Within the Developer Tools group at Microsoft, we have used an instance of Data Accelerator to process events Microsoft scale since the fall of 2017.
Data Accelerator isn’t just a pipe between an EventHub and a database, however. It allows us to reshape incoming events while continuing to stream, then route different parts of the same event into different data stores, while providing health monitoring and alerting over the status of the whole pipeline. Data Accelerator also provides a configuration UI and rules/query designer experience that allows you to be up and running without needing to write any code. Also, anyone doing stream data processing will often have need for processing data with sliding windows, or to handle late arrival data, or to accumulate data over time. Data Accelerator enables and simplifies use of these advanced features.
And lastly, the dev-test loop supports a fast validation cycle where your query is run against events sampled locally – allowing the implementation to be finalized before the first deployment. We think these capabilities will appeal to some of you and we hope that some find it useful enough to work with and even contribute back to the project. We can’t wait to see what comes next!
When to use Data Accelerator
We built Data Accelerator to deal with data from many incoming data streams that needed to be combined and routed to many different output sinks in a way that promotes quick discovery of data insights. Naturally, normalization is a big deal here and anyone who has worked in a heterogeneous event environment probably recognizes the perils and potential for days spent capturing and tuning event parsers. That’s why we implicitly infer event schema from a sample of your event data.
But more than reading different sources, transforming events in the stream and writing them out is of critical importance. Through combination of event and schema, Data Accelerator can recognize and modify events or event parts as they continue streaming through the pipeline. Reshaped events can be split, merged with values based on reference data or algorithm, modified or dropped entirely. Complex queries and policies using different time window functions or accumulators can be set up easily. In our experience, the ability to instantly validate your queries using the Live Query feature, running against a sample of incoming data, saves hours of frustration when setting up processing on big data pipes. Finally, a lightweight health dashboard and alerting system rounds out the pipeline, standing up all the essential elements to evaluate a streaming big data pipeline on Spark end-to-end.
There are three main scenarios where you may want to leverage Data Accelerator:
Data Accelerator is useful in other situations as well (we’ve had an instance in production since late 2017), but the greatest advantages of the toolset show up before the production environment has settled down into a routine of maintenance and servicing updates.
How Data Accelerator supports your pipeline needs
Data Accelerator supports three tiers of engagement with your data pipeline.
To help learn about Data Accelerator we’ve created dozens of tutorials, a documentation wiki, and a couple of live samples that deploy via source, Azure ARM template, or Docker container on Linux, Mac, or Windows.
We are excited to share this tool with the wider community, to help others learn and evaluate streaming options when they are facing down a big data challenge on Apache Spark. You can find all the tutorials, supporting documentation, and deployment options in our GitHub repository. Docker-deploy one of the samples for your platform of choice and start exploring today.