Confidence Intervals for Benchmarks

When benchmarking, confidence intervals are a standard tool that give us a reliable measure of how much run-to-run variation occurs for a given workload. For example, if I run several iterations of the Bioshock benchmark and score each iteration by recording the average FPS, I might report Bioshock’s average (or geomean) score as 74.74 FPS with a 99% confidence interval of 0.10. By reporting this result, I’m predicting that that 99% of Bioshock scores on this platform configuration will fall between 74.64 and 74.84 FPS.

Unless otherwise noted, confidence intervals assume the data is normally distributed:

Each pebble in the demonstration represents one benchmark result. Our normal curve may be thiner and taller (or shorter and wider), but the basic shape is the same; most of the results will cluster around the mean, with a few outliers.

Normally distributed data means our 95% confidence interval will be smaller than our 99% confidence interval; 95% of the results will be clustered more closely around the mean value. If our 99% confidence interval is [74.64, 74.84], our 95% interval might be +/- 0.06, or [74.67, 74.80]. The 100% confidence interval is always [-infinity, +infinity]; we’re 100% confident that every measured result will fall somewhere on the number line.

Computing the averages of averages is not always statistically sound, so it may seem incorrect to take the average FPS from each iteration of a benchmark and average them together. In this case we can confidently say that each average has equal weight; if not, we need a different benchmark!

Next: Comparing Benchmark Results