How to use perf with Rust

When the code looks fast but isn't

You wrote a Rust function. It feels fast. Then you run it on a million items and it takes four seconds. You stare at the code. Everything looks optimal. The loop is tight. The allocations are gone. Where is the time going? You need a tool that doesn't lie about performance and doesn't slow your program down by a factor of ten. perf is that tool. It's part of the Linux kernel, it samples your hardware directly, and it works beautifully with Rust once you set up the symbols right.

Sampling without the slowdown

perf doesn't wrap your code. It doesn't inject counters. It asks the CPU to interrupt your program every few thousand cycles and check what instruction is running. This is sampling. The overhead is tiny, usually under one percent. You get a statistically accurate picture of where your time actually goes.

Rust produces DWARF debug information. perf reads that information to map raw addresses back to function names and source lines. If the symbols are missing, perf sees only hex addresses. If the symbols are there, perf shows you your Rust code.

Build with debug symbols, or perf is blind.

Minimal example

Start with a function that does real work. A loop that runs long enough for the sampler to catch it.

/// Calculates a heavy workload to demonstrate perf sampling.
fn heavy_work(n: usize) -> u64 {
    let mut sum = 0u64;
    // Simulate work that perf can sample.
    // The loop must run long enough to generate samples.
    for i in 0..n {
        sum = sum.wrapping_add(i as u64);
    }
    sum
}

fn main() {
    // Run long enough for perf to catch samples.
    // Short runs produce no data.
    let result = heavy_work(100_000_000);
    println!("Result: {}", result);
}

Compile this in release mode. perf needs the optimized binary to show real performance characteristics. Debug builds have no optimizations, so the samples will point to unoptimized code that doesn't represent your production behavior.

cargo build --release

Convention aside: cargo build --release strips debug symbols by default. perf needs them to resolve function names. You must tell Cargo to keep the symbols in the release binary. Add this to your Cargo.toml.

[profile.release]
debug = true

Now record the execution. Use the -g flag to capture the call graph. Without -g, you see hot functions but not who called them. The call graph is essential for understanding context.

perf record -g ./target/release/heavy_work

perf runs your program. The kernel samples the CPU. When the program exits, perf writes the data to perf.data. Generate the report.

perf report

The report shows a sorted list of functions. The top entry is where your CPU spent the most time.

Trust the samples, not the line numbers.

What happens under the hood

When you run perf record, the tool sets up a performance counter in the kernel. It tells the CPU to count cycles. Every time the counter overflows, the CPU triggers a hardware interrupt. The kernel saves the instruction pointer and the stack trace. Your program keeps running.

The interrupt handler is fast. It saves the state and returns immediately. This is why the overhead is low. You aren't executing extra code in your program. You are letting the hardware do the work.

When the program exits, perf writes all those samples to a file. perf report reads the file. It loads the DWARF debug information from your binary. It maps each sample's instruction pointer to a function name and a source line. It aggregates the counts and sorts them.

The result is a histogram of where your time went. If heavy_work appears at 80 percent, the CPU spent 80 percent of the sampled cycles inside that function.

The sampling interval matters. perf defaults to a frequency of 4000 Hz. This means it samples 4000 times per second. For a program that runs for one second, you get 4000 samples. This is enough for a good histogram. If your program runs for 10 milliseconds, you get 40 samples. The statistics are noisy. Run your program longer or lower the frequency with -F 1000. Lower frequency means less overhead but less precision. Find the balance.

Realistic example: finding the hidden cost

Real code has structure. Functions call functions. Traits get monomorphized. perf handles all of this. Here is a more realistic scenario with a hot loop and a cold path.

/// Processes data with a hot loop and a cold path.
fn process_data(data: &[u32]) -> u64 {
    let mut count = 0u64;
    // Hot path: iterates over all data.
    // This loop dominates the runtime.
    for &val in data {
        if val > 100 {
            count += 1;
        }
    }
    count
}

/// Validates the result.
/// This function is called once and takes negligible time.
fn validate(count: u64) -> bool {
    count > 0
}

fn main() {
    let data: Vec<u32> = (0..10_000_000).collect();
    let count = process_data(&data);
    let is_valid = validate(count);
    println!("Valid: {}", is_valid);
}

Record and report as before. The report will show process_data at the top. validate will be near the bottom or invisible.

Press Enter on process_data in the perf report interface. This opens perf annotate. perf annotate shows the assembly code for the function. It highlights the hot instructions in red. You can see exactly which assembly instructions are consuming cycles.

Convention aside: perf annotate is the bridge between high-level Rust and low-level assembly. Use it to verify that the compiler generated efficient code. If you see unexpected branches or memory accesses, the assembly tells you why.

Sometimes the hot function is inlined. Rust inlines aggressively. perf might show the caller instead of the callee. This is correct. The code is inside the caller. If you want to see the callee, use #[inline(never)] temporarily, or look at the assembly in perf annotate to find the inlined code.

Interpreting the report

The report shows two percentages. Self time is the time spent in the function itself. Total time includes time spent in functions called by this function. If process_data has 80 percent self and 80 percent total, the work is inside process_data. If process_data has 10 percent self and 80 percent total, the work is in functions called by process_data. Look at the children in the call graph to find the real hotspot.

Convention aside: perf report sorts by overhead by default. You can change the sort order. Use perf report --sort=dso,symbol to group by binary. This helps when multiple binaries are involved. Use perf report --sort=symbol,dso to group by function. This helps when the same function appears in multiple binaries.

Dynamic libraries can complicate things. If your Rust binary links to shared libraries, perf needs to find the debug symbols for those libraries too. Sometimes perf can't find them. Use perf record --call-graph dwarf to force dwarf unwinding, or ensure the libraries are installed with debug info. Rust static linking avoids this issue. cargo build produces a static binary by default on Linux. This makes symbol resolution easier.

Pitfalls and permissions

perf is powerful, but it has quirks.

Debug builds lie. Profiling a debug build gives you data about unoptimized code. The compiler hasn't unrolled loops. It hasn't eliminated bounds checks. The samples will point to different lines than in release mode. Always profile release mode with debug = true.

Missing symbols break the report. If you see ?? or hex addresses like 0x4f2a10, your debug info is gone. Check [profile.release] debug = true. Rebuild.

Permissions can block recording. perf needs access to hardware counters. Linux restricts this for security. If you get perf record: Failed to open /dev/cpu/0/msr: Permission denied, the kernel parameter perf_event_paranoid is too strict. You can adjust it by running echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid. This allows user-space programs to access all counters. This is a system-wide change. Revert it when you are done.

Inlining hides functions. If a function is inlined, perf attributes the time to the caller. This is not a bug. The code exists in the caller. Use perf annotate to see the inlined assembly. Use #[inline(never)] if you need to isolate a function for profiling.

The kernel knows what your CPU is doing. Ask it.

When to use perf

Use perf when you need low-overhead sampling on Linux to find hotspots in a running application. Use criterion when you need statistically rigorous microbenchmarks to compare algorithm changes over time. Use flamegraph when you want a visual overview of call stacks and want to share a colorful image with your team. Use valgrind callgrind when you need exact call counts and precise line attribution and can tolerate a 20x slowdown. Reach for perf when the bottleneck is in a complex system with many threads and you need to see the real-world behavior without instrumentation noise.

Where to go next

Perf is a tool that acts like a stopwatch for your computer's processor, showing exactly which parts of your code are slowing things down. You use it after building your Rust program to find performance bottlenecks without needing to change your code. Think of it as a heat map that highlights the hottest, most active areas of your application so you know where to focus your optimization efforts.