How to use flamegraph

When timers fail you

You suspect a function is slow. You wrap it in std::time::Instant. You print the duration. The number jumps around. You add more timers. Your code is cluttered with logging. You still can't see the full picture. The bottleneck might be a helper function you didn't think to instrument. Or it might be memory allocation hiding inside a loop.

You need a view of where your CPU actually spends its time. You need a flamegraph.

What a flamegraph shows

A flamegraph is a visual stack trace. It maps function calls to blocks of colored rectangles. The width of a block represents how much CPU time the function consumed. The height represents the call depth.

Think of a busy kitchen. The wide blocks are the stations where chefs spend most of their time. If the chopping station is wide, you spend too much time cutting vegetables. If the plating station is narrow, that step is fast. The stack shows who called whom. The chef at the top called the sous-chef below, who called the intern at the bottom.

Wide blocks are hot. They are where your time goes. Your goal is to find the widest blocks and make them narrower.

The toolchain

Rust does not include a profiler in the standard library. The ecosystem relies on system tools and community wrappers. The standard workflow on Linux uses perf to sample the stack and inferno to draw the graph.

perf is the Linux performance counter tool. It interrupts your program thousands of times a second and records the current call stack. inferno is a Rust crate that takes those samples and renders an interactive SVG.

There is also cargo-flamegraph, a wrapper that handles the perf invocation and inferno rendering for you. It is the fastest way to get a graph.

Convention aside: the Rust community prefers inferno over the Perl-based flamegraph.pl. inferno is written in Rust, integrates better with Rust tooling, and produces cleaner SVG output. Stick with inferno unless you have a specific reason to use the Perl script.

Minimal example

Install cargo-flamegraph. It requires perf and inferno to be available on your system.

cargo install cargo-flamegraph

Create a simple program with a known bottleneck.

/// Simulates a workload with a hidden allocation cost.
fn main() {
    let mut data = Vec::new();
    // This loop allocates memory on every iteration.
    // The flamegraph will show time spent in allocation, not just logic.
    for i in 0..100_000_000 {
        data.push(i);
    }
}

Run the profiler. You must build in release mode. Debug builds include instrumentation that distorts timing and prevents optimization.

cargo flamegraph --release

The command runs your binary, captures samples, and generates flamegraph.svg. Open the file in a browser. You will see a stack of blocks. The widest block at the top tells you where the CPU spent the most time.

Profile in release mode. Debug builds lie about performance.

Reading the graph

Open the SVG. You can click any block to zoom in. You can drag to pan. The graph is interactive.

Look for the widest blocks. Those are your hotspots. If main is wide, the work is spread out. If a single helper function is wide, that function is the bottleneck.

Check the bottom of the stack. System calls like malloc or futex appear there. If you see wide blocks in malloc, your code is allocating too much. If you see futex, you might be contending on locks.

The graph shows CPU time. It does not show I/O wait. If your program spends time waiting for a disk or network, the flamegraph will look empty during that time. You need a different tool for I/O bottlenecks.

Wide blocks are your enemy. Find them and shrink them.

Bottom-up vs top-down

Some developers prefer a bottom-up flamegraph. In a bottom-up view, the root function is at the bottom and the leaves are at the top. This layout makes it easier to see which leaf functions contribute to the total time. You can generate a bottom-up graph by passing --reverse to inferno-flamegraph.

The choice is personal. Top-down matches the call stack order. Bottom-up highlights the leaf work. Pick the orientation that helps you read the graph faster.

Realistic workflow with perf and inferno

cargo-flamegraph is convenient, but sometimes you need control. You might want to profile a specific command with arguments, or capture only a subset of the execution. In that case, use perf directly and pipe the output to inferno.

First, ensure your binary has debug symbols. Profiling without symbols gives you raw addresses instead of function names. Set the debug info level in your build.

RUSTFLAGS="-C debuginfo=2" cargo build --release

Run perf to record samples. The -g flag enables call graph recording. The --freq flag sets the sampling frequency. Higher frequency gives more detail but adds overhead.

perf record -g --freq 99 ./target/release/your-binary --arg1 --arg2

Convert the data to a format inferno understands. perf script outputs the raw stack traces.

perf script | inferno-collapse-perf > out.folded

Generate the SVG.

inferno-flamegraph < out.folded > flamegraph.svg

This manual workflow gives you access to every perf flag. You can filter by PID, use different sampling methods, or capture hardware events like cache misses.

Trust the sampling. 99 samples per second is enough to find the bottleneck.

Digging deeper with perf annotate

When a block is wide, you need to know why. perf annotate lets you drill down into the assembly code. Run perf annotate after recording. It shows the assembly instructions with hit counts. You can see exactly which lines of assembly are hot.

This is useful when the flamegraph points to a function, but you don't know which part of the function is slow. You might find a loop unrolling issue or a bad branch prediction. perf annotate bridges the gap between the high-level graph and the low-level machine code.

Use perf annotate when the flamegraph identifies a hot function and you need to understand the assembly-level cause.

Sampling vs tracing

Flamegraphs use sampling. The profiler interrupts the program at random intervals and records the stack. This has low overhead. It captures CPU time accurately. It does not capture every function call. If a function runs for a few microseconds, the sampler might miss it entirely.

Tracing tools record every event. They have higher overhead. They capture timing for short functions. Use sampling for finding hotspots. Use tracing for understanding latency of individual operations or for debugging race conditions. Flamegraphs are the first step. If you need microsecond precision, switch to a tracer.

Pitfalls and compiler errors

Profiling Rust code has specific traps.

Inlining hides functions. The compiler inlines small functions to improve performance. When a function is inlined, it disappears from the stack trace. perf sees the caller, not the callee. The time spent in the inlined function adds to the caller's block.

If you see a wide block in a function that looks simple, check if it inlined heavy work. You can disable inlining for testing with #[inline(never)], or use perf annotate to see the assembly and identify the hot instructions.

perf_event_paranoid blocks sampling. Linux restricts access to performance counters for security. If perf fails with "Permission denied", you need to adjust the kernel setting or run with elevated privileges.

sudo sysctl -w kernel.perf_event_paranoid=1

Setting the value to 1 allows users to sample their own processes. Setting it to -1 removes restrictions. Do not leave it at -1 on a shared machine.

Async runtimes distort stacks. If you use tokio or async-std, the stack traces show the runtime's task scheduler, not your async functions. The real work is hidden behind block_on or poll.

You need to enable frame pointer unwinding or use perf with --call-graph dwarf to get better stacks in async code. Some runtimes provide profiling features to expose the async call chain.

Self-profiling with measureme. The measureme tool supports profiling the Rust compiler itself via rustc -Zself-profile. This generates .events files that inferno can convert. This workflow is for compiler developers. It is not for application profiling.

rustc -Zself-profile your_code.rs
../measureme/target/release/inferno your_code-<pid>

Use measureme only when you are profiling the compiler or a tool that uses the self-profile infrastructure. For application code, stick to perf.

Treat the flamegraph as a map. If the map is wrong, you will optimize the wrong place.

Decision matrix

Choose the right tool for your profiling goal.

Use cargo flamegraph when you need a quick overview of a binary's hotspots. It handles the setup and rendering with a single command. Use cargo flamegraph for iterative optimization where you run the profiler, fix code, and run again.

Use perf directly when you need control over sampling frequency, filtering, or hardware events. Use perf when cargo-flamegraph lacks a flag you need, such as profiling a specific thread or capturing cache misses.

Use inferno when you want a Rust-native visualization tool. Use inferno instead of the Perl-based flamegraph.pl to stay within the Rust ecosystem and get better integration with Rust tooling.

Use measureme when you are profiling the Rust compiler or a project that uses the self-profile infrastructure. Use measureme for generating .events files from rustc -Zself-profile.

Use tracy or puffin when you need frame-based profiling for games or real-time applications. Use these tools when you want to track allocation patterns, GPU usage, or custom metrics alongside CPU time.

Reach for perf and inferno for general application profiling. The combination is standard, reliable, and covers 90% of use cases.

Where to go next

A flamegraph is a visual chart that shows exactly where a program spends its time, helping you find slow code. It works like a map of your software's performance, highlighting the biggest bottlenecks so you can fix them. You use it whenever you need to speed up a slow compilation or execution process.