The problem with guessing performance
You just refactored a string parsing loop. You replaced a match statement with a direct byte comparison. You run the program and it feels snappier. You time it with std::time::Instant, run it once, and celebrate. Then you run it again and the time jumps by forty percent. You run it a third time and it drops back down. The numbers are bouncing around like a pinball.
Modern CPUs lie to you. They scale frequency up and down based on thermal load. They prefetch memory. They cache hot loops. Background processes steal cycles. A single measurement tells you nothing about your code. It tells you about the state of your machine at that exact millisecond. You need a tool that runs your code thousands of times, filters out the noise, and gives you a statistically sound answer.
Why criterion instead of a stopwatch
criterion is a statistical benchmarking framework for Rust. Think of it like a sports scientist timing a sprinter. They do not hit a stopwatch once and call it a day. They run the athlete through dozens of trials. They discard the false starts. They calculate the median time. They give you a confidence interval that tells you how sure they are about the result.
criterion does the exact same thing for your Rust code. It runs your function in tight loops. It measures how many iterations fit into a fixed wall-clock window. It repeats the process until the statistical variance drops below a threshold. It outputs a table with median time, standard deviation, and throughput. It even generates an interactive HTML report with regression plots.
The crate handles the heavy lifting so you can focus on the code you are measuring. You write a function that calls your target code. criterion handles the timing, the statistical analysis, and the reporting.
Setting up your first benchmark
Rust's default benchmarking harness relies on #[bench] attributes, which have been deprecated for years. criterion replaces that harness entirely. You need to tell Cargo to skip the built-in harness and let criterion take over.
Add criterion as a development dependency. You only need it for benchmarking, not for your production binary. Enable the cargo_bench_support feature so the macros integrate cleanly with cargo bench.
# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5.1", features = ["cargo_bench_support"] }
[[bench]]
name = "my_bench"
harness = false
The harness = false line is the critical switch. Without it, Cargo tries to run its own benchmark runner, which expects the deprecated #[bench] syntax. Setting it to false tells Cargo to compile the file as a standalone binary and execute it directly. criterion then becomes the entry point.
Create the benchmark file in the benches/ directory. The filename must match the name field in Cargo.toml.
// benches/my_bench.rs
use criterion::{criterion_group, criterion_main, Criterion};
/// Measures the execution time of a simple allocation and drop cycle.
fn allocate_and_drop() -> Vec<i32> {
// Create a vector with ten thousand integers
let mut data = Vec::with_capacity(10_000);
for i in 0..10_000 {
data.push(i);
}
data
}
/// Configures criterion to run the benchmark function.
fn bench_allocation(c: &mut Criterion) {
// Register the function under a human-readable name
c.bench_function("allocate_and_drop", |b| {
// Run the function repeatedly until statistical significance is reached
b.iter(allocate_and_drop);
});
}
// Group all benchmark functions into a single test suite
criterion_group!(benches, bench_allocation);
// Wire the group into criterion's main entry point
criterion_main!(benches);
Run the suite with cargo bench. The command compiles your project in release mode by default and executes the benchmark binary. You will see a table printed to your terminal showing iterations per second, median time, and upper/lower bounds.
Do not skip the harness = false line. The compiler will silently ignore your benchmark if Cargo tries to run its own harness first.
What happens under the hood
When you run cargo bench, criterion does not just time a loop. It follows a strict measurement protocol.
First, it runs a warm-up phase. Modern CPUs need time to ramp up frequency and populate caches. criterion executes your function a few times without measuring anything. This ensures the hardware is in a steady state before the real timing begins.
Next, it enters the measurement phase. It runs your function in a tight loop until a target wall-clock duration is reached. The default target is around two seconds. It records how many iterations completed in that window. It repeats this process multiple times to gather a sample distribution.
Finally, it runs a statistical analysis. It calculates the median, the standard deviation, and a confidence interval. It checks for outliers and adjusts the sample size if the variance is too high. The result you see in the terminal is not a guess. It is a statistically validated measurement of your code's performance.
The crate also generates an HTML report in target/criterion/. Open the index file to see interactive plots. You can compare runs across commits, spot regressions, and visualize how your code scales.
Trust the median. Ignore the single fastest run. The median represents the stable performance of your code under normal conditions.
A realistic benchmark scenario
Microbenchmarks are useful for isolated functions, but real code usually takes arguments. You often need to measure how performance changes with different input sizes. criterion supports this through bench_with_input.
Suppose you are comparing two string splitting strategies. You want to see how they perform with short strings versus long strings.
// benches/string_split_bench.rs
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
/// Splits a string by spaces using the standard library iterator.
fn split_by_spaces(input: &str) -> Vec<&str> {
// Collect the split results into a vector
input.split_whitespace().collect()
}
/// Splits a string by spaces using manual byte scanning.
fn split_by_bytes(input: &str) -> Vec<&str> {
// Pre-allocate capacity to avoid reallocations
let mut result = Vec::with_capacity(input.len() / 2);
let mut start = 0;
let bytes = input.as_bytes();
for (i, &byte) in bytes.iter().enumerate() {
if byte == b' ' {
result.push(&input[start..i]);
start = i + 1;
}
}
result.push(&input[start..]);
result
}
/// Configures criterion to run both strategies across multiple input sizes.
fn bench_string_splitting(c: &mut Criterion) {
let mut group = c.benchmark_group("string_splitting");
// Test with three different string lengths to observe scaling behavior
let sizes = [10, 100, 1000];
for size in sizes {
let input = "a ".repeat(size);
let input = input.trim();
// Tell criterion how many bytes are being processed per iteration
group.throughput(Throughput::Bytes(input.len() as u64));
// Benchmark the standard library approach with a labeled ID
group.bench_with_input(BenchmarkId::new("split_whitespace", size), input, |b, i| {
// Wrap input in black_box to prevent compiler optimization
b.iter(|| split_by_spaces(black_box(i)));
});
// Benchmark the manual byte scanner with a labeled ID
group.bench_with_input(BenchmarkId::new("manual_bytes", size), input, |b, i| {
b.iter(|| split_by_bytes(black_box(i)));
});
}
group.finish();
}
criterion_group!(benches, bench_string_splitting);
criterion_main!(benches);
The BenchmarkId labels let you compare multiple configurations in the same report. The Throughput setting tells criterion to calculate bytes per second instead of just iterations per second. This makes the numbers meaningful when input sizes change.
Run this and you will see a table with separate rows for each size and each strategy. The HTML report will plot both lines on the same graph. You can instantly see where one approach crosses the other and becomes faster.
Keep your benchmark inputs deterministic. Random data introduces variance that masks the actual performance difference you are trying to measure.
Common traps and how to avoid them
The most common mistake is letting the compiler optimize your benchmark away. If your function does not produce a visible side effect, LLVM might decide the entire loop is dead code. It will replace your function with a no-op. Your benchmark will report nanosecond times and you will think you just built a quantum computer.
Use criterion::black_box to stop this. black_box tells the compiler to treat the value as an opaque black box. It cannot assume anything about the value's contents or future usage. It forces the compiler to actually execute your code.
// Prevents the optimizer from deleting the work
let result = split_by_spaces(black_box(input));
Another trap is measuring allocation instead of logic. If your function creates a Vec or allocates a String, half your timing might be spent in the global allocator. If you only care about the algorithm, reuse a pre-allocated buffer or measure the core loop separately.
A third trap is running benchmarks in debug mode. cargo bench compiles in release mode automatically, but if you run the binary manually or use cargo run --release, you might forget the release flag. Debug builds include bounds checking, panic unwinding, and zero optimizations. The numbers will be meaningless.
Check the target/criterion/ directory after every run. If the plots show massive spikes or the confidence interval is wider than the median, your benchmark is noisy. Add more iterations, stabilize your input, or isolate the hot path.
Treat black_box as a mandatory safety net. If you skip it, you are benchmarking the compiler's optimizer, not your code.
When to reach for criterion
Use criterion when you need statistically valid microbenchmarks for isolated functions or algorithms. Use criterion when you want to track performance regressions across commits with automated HTML reports. Use criterion when you need to measure throughput, latency, or scaling behavior across different input sizes. Reach for std::time::Instant when you need a quick, rough measurement inside a running application and do not care about statistical rigor. Reach for perf or flamegraph when you need to profile an entire application, find hot paths, or analyze cache misses and branch prediction failures. Pick divan or tinytemplate when you want a lighter dependency with a simpler API and do not need advanced statistical analysis.