Benchmarking as a Service
By Jonathan Metzman, Abhishek Arya, Google OSS-Fuzz Team and László Szekeres, Google Software Analysis Team
March 4, 2020
We are excited to launch FuzzBench, a fully automated, open source, free service for evaluating fuzzers. The goal of FuzzBench is to make it painless to rigorously evaluate fuzzing research and make fuzzing research easier for the community to adopt.
Fuzzing is an important bug finding technique. At Google, we’ve found tens of thousands of bugs (1, 2) with fuzzers like libFuzzer and AFL. There are numerous research papers that either improve upon these tools (e.g. MOpt-AFL, AFLFast, etc) or introduce new techniques (e.g. Driller, QSYM, etc) for bug finding. However, it is hard to know how well these new tools and techniques generalize on a large set of real world programs. Though research normally includes evaluations, these often have shortcomings—they don't use a large and diverse set of real world benchmarks, use few trials, use short trials, or lack statistical tests to illustrate if findings are significant. This is understandable since full scale experiments can be prohibitively expensive for researchers. For example, a 24-hour, 10-trial, 10 fuzzer, 20 benchmark experiment would require 2,000 CPUs to complete in a day.
To help solve these issues the OSS-Fuzz team is launching FuzzBench, a fully automated, open source, free service. FuzzBench provides a framework for painlessly evaluating fuzzers in a reproducible way. To use FuzzBench, researchers can simply integrate a fuzzer and FuzzBench will run an experiment for 24 hours with many trials and real world benchmarks. Based on data from this experiment, FuzzBench will produce a report comparing the performance of the fuzzer to others and give insights into the strengths and weaknesses of each fuzzer. This should allow researchers to focus more of their time on perfecting techniques and less time setting up evaluations and dealing with existing fuzzers.
Integrating a fuzzer with FuzzBench is simple as most integrations are less than 50 lines of code (example). Once a fuzzer is integrated, it can fuzz almost all 250+ OSS-Fuzz projects out of the box. We have already integrated ten fuzzers, including AFL, LibFuzzer, Honggfuzz, and several academic projects such as QSYM and Eclipser.
Reports include statistical tests to give an idea how likely it is that performance differences between fuzzers are simply due to chance, as well as the raw data so researchers can do their own analysis. Performance is determined by the amount of covered program edges, though we plan on adding crashes as a performance metric. You can view a sample report here.
How to Participate
Our goal is to develop FuzzBench with community contributions and input so that it becomes the gold standard for fuzzer evaluation. We invite members of the fuzzing research community to contribute their fuzzers and techniques, even while they are in development. Better evaluations will lead to more adoption and greater impact for fuzzing research.
We also encourage contributions of better ideas and techniques for evaluating fuzzers. Though we have made some progress on this problem, we have not solved it and we need the community’s help in developing these best practices.
Please join us by contributing to the FuzzBench repo on GitHub.