How to Use cargo bench for Benchmarking in Rust

cargo bench runs benchmarks in release mode. The built-in #[bench] is nightly-only, so use the criterion crate on stable Rust for proper statistical timing and HTML reports.

You wrote it, now you want to know how fast it is

You just refactored a parsing function. You replaced a long match chain with a lookup table. The code looks cleaner. You also suspect it runs faster. The only way to know is to measure. Guessing based on intuition or squinting at assembly leads to wrong conclusions. Rust gives you a built-in measurement command, but the ecosystem has standardized on a better tool.

Benchmarking versus profiling

Benchmarking answers a specific question. How long does this exact piece of code take to run? It is not profiling. Profiling tells you where a whole program spends its time. Benchmarking isolates a function and times it in a controlled environment. Think of it like a track coach timing a single sprinter on an empty track versus a sports analyst watching a full marathon to find where runners slow down. Both are useful, but they answer different questions.

The built-in cargo bench command relies on the #[bench] attribute. That attribute has lived on nightly Rust for years. The stable ecosystem moved to criterion, a third-party crate that handles statistical analysis, regression detection, and HTML reporting out of the box. We will focus on criterion because it is the standard for stable Rust and produces numbers you can actually trust.

Measure the thing you actually care about. Isolated timing beats guesswork every time.

The smallest working setup

Here is the minimal configuration. You add criterion as a development dependency and tell Cargo to treat a specific file as a benchmark binary.

[package]
name = "speedy"
version = "0.1.0"
edition = "2021"

[dev-dependencies]
# criterion handles statistical analysis, plotting, and result regression.
# 0.5 is the current major version. The html_reports feature enables the browser UI.
criterion = { version = "0.5", features = ["html_reports"] }

# This block tells Cargo to compile benches/my_bench.rs as a benchmark binary.
# harness = false disables the built-in #[bench] runner so criterion can take over.
[[bench]]
name = "my_bench"
harness = false

Now create benches/my_bench.rs:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

// The function we are measuring. Imagine this is part of your real crate.
fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

// Each bench function takes a Criterion handle and registers one or more measured operations.
fn bench_fib(c: &mut Criterion) {
    // c.bench_function gives you a Bencher. The closure inside b.iter is what gets timed.
    // criterion runs it many times until it has enough samples to be statistically confident.
    c.bench_function("fib 20", |b| {
        // black_box is a hint to the compiler. Treat this value as opaque.
        // Do not constant-fold the call away. Without it, the optimizer might compute fib(20) at compile time.
        b.iter(|| fibonacci(black_box(20)))
    });
}

// These two macros wire your bench function into a runnable binary.
// criterion_group collects bench functions. criterion_main creates the entry point.
criterion_group!(benches, bench_fib);
criterion_main!(benches);

Run it with cargo bench. The terminal prints output like this:

fib 20                  time:   [22.418 ยตs 22.503 ยตs 22.610 ยตs]
                        change: [-1.2456% -0.6701% -0.0741%] (p = 0.03 < 0.05)
                        Change within noise threshold.

The three numbers in the time: line are the lower bound, the best estimate, and the upper bound of a 95% confidence interval. You are being told exactly how uncertain the measurement is. The change: line appears on subsequent runs. Criterion stores results from the previous run and tells you whether the new run is faster, slower, or within statistical noise. This is gold for comparing optimization attempts.

How criterion actually measures time

Criterion does not just run your function once and divide by the iteration count. It runs a warmup phase to let the CPU settle into a steady state. Then it samples your function across multiple batches, calculates the mean, and computes a confidence interval. The statistical model accounts for scheduler jitter, background processes, and thermal throttling. You get a distribution, not a single fragile number.

The black_box function is the most important convention in Rust benchmarking. The compiler optimizes aggressively. If it can prove a value never changes, it computes the result at compile time and removes the runtime call entirely. black_box inserts an opaque barrier. The compiler must assume the value could be anything, so it cannot optimize the work away. Pass your inputs through black_box to prevent constant folding. Pass your outputs through it to prevent dead code elimination. The community treats black_box as a contract with the optimizer. Without it, you are benchmarking the compiler's imagination.

Treat black_box as a contract with the optimizer. Without it, you are benchmarking the compiler's imagination.

Comparing implementations side by side

The real value of benchmarking appears when you compare two implementations. Criterion provides a BenchmarkGroup for this exact purpose. Groups let you run multiple variants with identical setup and produce side-by-side plots.

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn slow_sum(v: &[u64]) -> u64 {
    // Naive accumulator with no fancy tricks.
    let mut s = 0;
    for x in v { s += x; }
    s
}

fn fast_sum(v: &[u64]) -> u64 {
    // Iterator chain. LLVM is allowed to vectorise this automatically.
    v.iter().sum()
}

fn bench_sums(c: &mut Criterion) {
    // Pre-allocate the test data once outside the group.
    // Reusing the same slice ensures both variants measure identical memory layouts.
    let data: Vec<u64> = (0..10_000).collect();

    // A group lets criterion plot two competing implementations side by side.
    // The group name becomes the folder name in the HTML report.
    let mut group = c.benchmark_group("sum 10k");
    group.bench_function("manual loop", |b| b.iter(|| slow_sum(black_box(&data))));
    group.bench_function("iter().sum",  |b| b.iter(|| fast_sum(black_box(&data))));
    // finish() triggers the actual measurement and statistical analysis.
    group.finish();
}

criterion_group!(benches, bench_sums);
criterion_main!(benches);

After running, criterion writes an HTML report to target/criterion/report/index.html. The report contains violin plots, regression analysis, and side-by-side timings for every group. Open it in a browser. Visual data catches regressions that terminal numbers hide.

Open the HTML report. Visual data catches regressions that terminal numbers hide.

The traps that ruin measurements

Measurement introduces its own traps. Debug builds run without optimizations. If you accidentally time a debug binary, your numbers will be ten to one hundred times slower than production reality. Always verify you are running cargo bench, which compiles in release mode automatically. The community convention is to never trust a benchmark that did not run through the release pipeline.

Tiny inputs create noise. Modern processors execute simple arithmetic in single-digit nanoseconds. If your benchmark finishes too quickly, you are measuring loop overhead and scheduler jitter instead of your algorithm. Increase the input size or use iter_batched to amortize setup costs across many iterations. Batched iteration runs your setup once, then feeds precomputed inputs to the closure. It separates initialization cost from execution cost.

Cache behavior skews results. Repeated calls in a tight loop keep your data in the L1 cache. Real workloads often fetch data from slower memory tiers. If your function processes a large structure, the benchmark will report optimistic timings. Vary your input addresses or benchmark with production-scale data to see the true cost. Memory hierarchy matters more than algorithmic complexity in many real-world cases.

Stale binaries are a silent killer. If you edit your function but forget to rebuild, criterion times the old code. The compiler will catch obvious mistakes, but it will not warn you about unchanged benchmark files. If you see an error like E0599 (no method named iter_batched found for struct Bencher), your criterion version is outdated. Update the dependency and rebuild. Always run cargo clean if you suspect cached artifacts are lying to you.

Trust the numbers only when the setup matches reality. Garbage inputs produce garbage benchmarks.

When to reach for what

Use criterion for any optimization work where you need statistically sound numbers and regression tracking. Use the nightly #[bench] attribute only when you are experimenting on a nightly toolchain and want zero external dependencies. Use the time command with a release binary for crude end-to-end measurements of a complete CLI application. Use perf or flamegraph when you need to find hot spots in a running program rather than measuring an isolated function. Benchmarks isolate code. Profilers observe behavior. Pick the tool that matches your question.

Benchmarks isolate code. Profilers observe behavior. Pick the tool that matches your question.

Where to go next