How to Benchmark Code in Rust

The compiler will optimize your benchmark away

You just refactored a loop. You replaced a for loop with iter().map().collect(). You feel good. You run the tests; they pass. But is it actually faster? Or did you just make the code uglier for no gain? In Rust, "fast" is a claim you have to prove. The compiler is aggressive. It sees patterns and eliminates work. If you measure code without understanding how the compiler thinks, you will measure the speed of doing nothing. You need a tool that forces the compiler to do the work and gives you statistical confidence that the result is real.

Why criterion is the standard

The old cargo bench command is deprecated. It relied on a harness that couldn't handle modern benchmarking needs. It ran code once, printed a time, and left. That approach ignored CPU frequency scaling, cache warm-up, and statistical variance. The community moved to criterion.

Think of criterion as a statistical lab for your code. It doesn't just run your function once. It runs it thousands of times, warms up the CPU, measures the distribution of times, and calculates a confidence interval. It tells you if the difference between two versions is real or just random noise. It generates an HTML report you can share. It is the default tool for performance work in Rust.

Setup and minimal example

Add criterion to your Cargo.toml under dev-dependencies. Create a benches directory. Write a benchmark file. You must also tell Cargo to disable its default test harness, because criterion provides its own entry point.

[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "my_benchmark"
harness = false

The harness = false line is critical. Cargo normally runs a test harness that looks for #[test] functions. criterion defines its own main function via a macro. If you leave the harness enabled, Cargo tries to run its own main and criterion's main at the same time. You will get a linker error or error[E0601]: no fn main found in crate depending on how the conflict manifests.

Create benches/my_benchmark.rs. This file contains the benchmark logic.

use criterion::{black_box, criterion_group, criterion_main, Criterion};

/// Adds two numbers. A simple pure function to benchmark.
fn add(left: u64, right: u64) -> u64 {
    left + right
}

/// Configures the benchmark group and runs the function.
fn criterion_benchmark(c: &mut Criterion) {
    // black_box prevents the compiler from optimizing away the inputs or outputs.
    // Without it, the compiler sees constant inputs and replaces the call with the result.
    c.bench_function("add", |b| b.iter(|| add(black_box(2), black_box(2))));
}

// criterion_group! collects benchmark functions into a group.
// criterion_main! creates the main function that runs the group.
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Run the benchmark with cargo bench --bench my_benchmark. The command builds in release mode by default. You do not need to pass --release. cargo bench implies release mode. This ensures you are measuring optimized code, not debug builds with checks enabled.

How black_box stops the compiler from cheating

The most common mistake in Rust benchmarking is forgetting black_box. Rust's compiler performs constant folding and dead code elimination. If you benchmark add(2, 2), the compiler evaluates 2 + 2 at compile time. It replaces the entire loop with the constant 4. Your benchmark measures the speed of returning a constant. The result is nanoseconds. That is not the speed of addition. That is the speed of doing nothing.

black_box acts like a wall. It tells the compiler that the value goes in and comes out changed in a way the compiler cannot predict. The compiler must assume the value is used and cannot optimize the computation away. It forces the code to run.

If you omit black_box, you will see suspiciously fast results. The benchmark will finish instantly. The reported time will be near zero. This is a signal that the compiler optimized your work away. Trust black_box. If your benchmark takes zero time, the compiler won.

Realistic benchmarking with setup costs

Real code often has setup costs. You might need to allocate a vector, parse a string, or initialize a structure before the work you care about. If you include setup inside the measured loop, you skew the results. You measure setup plus work. You want to measure only the work.

Use iter_batched to separate setup from measurement. iter_batched runs a setup closure to produce input, then runs the benchmark closure on that input. It batches the setup so the overhead doesn't affect the per-iteration measurement.

use criterion::{black_box, criterion_group, criterion_main, BatchSize, Criterion};

/// Parses a CSV line into fields.
/// This allocates a Vec and splits the string.
fn parse_csv(line: &str) -> Vec<&str> {
    line.split(',').collect()
}

/// Benchmarks parsing with realistic input and batched setup.
fn criterion_benchmark(c: &mut Criterion) {
    // Create a realistic input string.
    // Benchmarking with tiny input hides allocation costs.
    let input = "id,name,age,city,country,zip,phone,email".to_string();

    c.bench_function("parse_csv", |b| {
        // iter_batched runs the setup closure to create input.
        // It then measures the benchmark closure separately.
        // BatchSize::SmallInput is used for small data to reduce overhead.
        b.iter_batched(
            || input.clone(),
            |s| parse_csv(black_box(&s)),
            BatchSize::SmallInput,
        )
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

The setup closure || input.clone() runs to produce a fresh string for each batch. The benchmark closure |s| parse_csv(black_box(&s)) runs the parsing. criterion measures only the benchmark closure. The clone cost is excluded from the reported time. This gives you a clean measurement of the parsing logic.

Convention aside: use BatchSize::SmallInput when the setup is cheap and the data is small. Use BatchSize::LargeInput when setup is expensive or the data is large. LargeInput reduces the number of iterations to keep total runtime reasonable. Pick the batch size that matches your workload.

Measure the work, not the setup. Use iter_batched when initialization matters.

Comparing implementations

You often want to compare two versions of code. criterion supports this via BenchmarkId and benchmark groups. You can run multiple benchmarks in one group and compare their results in the HTML report.

use criterion::{black_box, BenchmarkId, Criterion};

/// Sorts a vector using the standard library sort.
fn sort_vec(data: &mut Vec<i32>) {
    data.sort();
}

/// Sorts a vector using a custom insertion sort.
fn insertion_sort(data: &mut Vec<i32>) {
    for i in 1..data.len() {
        let key = data[i];
        let mut j = i;
        while j > 0 && data[j - 1] > key {
            data[j] = data[j - 1];
            j -= 1;
        }
        data[j] = key;
    }
}

/// Benchmarks two sorting algorithms with labeled IDs.
fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("sort");

    // Generate test data once.
    let data: Vec<i32> = (0..1000).rev().collect();

    // Benchmark standard sort with a specific ID.
    group.bench_with_input(BenchmarkId::new("vec_sort", 1000), &data, |b, d| {
        b.iter(|| {
            let mut v = d.clone();
            sort_vec(&mut v);
            black_box(&v)
        });
    });

    // Benchmark insertion sort with a different ID.
    group.bench_with_input(BenchmarkId::new("insertion_sort", 1000), &data, |b, d| {
        b.iter(|| {
            let mut v = d.clone();
            insertion_sort(&mut v);
            black_box(&v)
        });
    });

    // finish() is required to finalize the group.
    group.finish();
}

BenchmarkId labels the results. The HTML report shows both benchmarks side by side. You can see the relative performance. group.finish() is mandatory. If you forget it, the group is not finalized and you may get a warning or incomplete results. Always call finish() at the end of a group.

Label your results. BenchmarkId keeps your data organized and comparable.

Pitfalls and compiler errors

Benchmarks can mislead if you ignore the details. Watch for these common issues.

Forgetting harness = false causes build failures. Cargo expects a test harness by default. criterion_main! defines a main function. If the harness is enabled, Cargo tries to inject its own main or expects test functions. You will see error[E0601]: no fn main found in crate if the harness strips the main, or a duplicate symbol error if both mains exist. Add harness = false to the [[bench]] section.

Forgetting black_box causes optimized-away benchmarks. The result is near-zero time. This is not an error. It is a silent failure. The benchmark runs, but it measures nothing. Check the reported time. If it is suspiciously fast, add black_box.

Running in debug mode measures the wrong thing. cargo bench builds in release mode. If you run cargo test --release, you are running tests, not benchmarks. If you run cargo bench without --release, it still uses release mode. Do not manually add --release to cargo bench. It is redundant.

CPU frequency scaling can skew results on laptops. Turbo boost changes clock speed during runs. criterion tries to handle this by running many iterations and using statistical methods. For consistent results, run benchmarks on a stable system or disable turbo boost.

Benchmarks lie if you let the compiler cheat. black_box is your shield.

When to use criterion versus alternatives

Rust has a few benchmarking tools. Pick the one that matches your needs.

Use criterion when you need statistical rigor, confidence intervals, and detailed HTML reports. It is the community standard for libraries and serious performance work. It handles variance, warm-up, and comparison groups automatically.

Use divan when you prefer a macro-based approach with zero external dependencies and a simpler setup. divan mimics the old cargo bench ergonomics but is modern and safe. It is a good choice for small projects or when you want minimal configuration.

Reach for std::time::Instant only when you cannot add dev-dependencies, such as in a constrained embedded environment or a quick ad-hoc check inside a binary. Manual timing lacks statistical analysis and is prone to optimization errors. Use it sparingly.

Pick the tool that matches your rigor. criterion for libraries, divan for simplicity.

Where to go next

Benchmarking measures how fast your code runs to find performance bottlenecks. Think of it like a stopwatch for your functions, helping you see if a change made your program faster or slower. You use it when you need to optimize specific parts of your application for speed.