How to Profile and Debug Async Performance Issues

The async black box

You launch your async service. It handles a thousand requests smoothly. Then you add a database query, and latency spikes. The CPU usage stays low, but requests pile up. You stare at the code and see nothing wrong. The async runtime is doing its job, but something is blocking the event loop, or tasks are waiting on each other in a deadlock. You need to see inside the black box.

Profiling synchronous code is straightforward. You trace the call stack, find the hot function, and optimize it. Async code breaks this model. A single thread runs many tasks, pausing and resuming them at .await points. The call stack is fragmented across time. A standard profiler sees a flat line of runtime overhead and misses the actual work happening inside the tasks. You need tools that understand the async state machine and can attribute time to the logical task, not just the thread.

Why standard profilers fail

When you write an async fn, the compiler transforms it into a state machine. Each .await point becomes a state transition. The function's local variables move onto the heap, and the function yields control back to the executor. When the task resumes, the state machine restores its context and continues.

This transformation hides the work from standard profilers. Tools like perf sample the CPU stack. When a task is suspended, its stack frame is gone. The profiler sees the runtime's scheduler, not your code. When the task runs, the profiler sees fragments of the state machine, often buried under waker and poll frames. You get a flame graph full of tokio::runtime::... and core::future::... with no clear path to your business logic.

The solution is to use tools that understand async. cargo flamegraph can use perf's async unwind support to reconstruct the logical stack. tokio-console instruments the runtime to track task states directly. These tools bridge the gap between the fragmented stack and the logical flow of your application.

Flamegraphs for async

cargo flamegraph is the standard tool for CPU profiling in Rust. It wraps perf and generates a visual stack trace. For async code, you need to enable async support.

cargo install cargo-flamegraph
cargo flamegraph --bin your_async_app --async

The --async flag tells perf to use async unwind information. This requires debug symbols in your binary. If you strip symbols, the flame graph will show mangled names or nothing at all. You must build with debug info enabled, even in release mode.

[profile.release]
# Debug info is required for flamegraphs to show function names.
debug = true
# Stripping symbols destroys profiling data.
strip = false

Run the flamegraph command against your binary. The tool samples the stack thousands of times per second. Each sample captures the current call stack. The async unwind support reconstructs the logical stack by following the state machine transitions. The result is a flame graph where your async functions appear in the correct order, even across .await points.

Look for wide bars in the graph. These represent functions that consume the most CPU time. In async code, you'll often see waker frames or poll frames. These are the runtime overhead. If your code dominates the graph, you have a CPU bottleneck. If the runtime dominates, you might have too many tasks or excessive polling.

The flame graph lies if you ignore the async context. Look for the runtime frames, not just your code.

Real-time task inspection

Flamegraphs show CPU usage. They don't show tasks that are waiting. If your application hangs because a task is blocked on a lock or a channel, the CPU is idle, and the flame graph is empty. You need a tool that inspects task states.

tokio-console provides a real-time view of your async tasks. It shows which tasks are running, waiting, or suspended. It reveals the dependency graph between tasks. You can see exactly what a task is waiting on.

Add the console-subscriber crate to your dependencies. This crate provides a layer for the tracing ecosystem that captures task events.

[dependencies]
tokio = { version = "1", features = ["full"] }
console-subscriber = "0.2"

Instrument your application with the console layer. This must happen before you spawn any tasks. The layer captures task creation, suspension, and resumption.

use console_subscriber::ConsoleLayer;
use tokio::time::{sleep, Duration};

/// Sets up the Tokio console subscriber.
/// This layer captures task events for the console UI.
#[tokio::main]
async fn main() {
    // Install the console layer.
    // This must happen before spawning tasks.
    ConsoleLayer::builder()
        .server_addr("127.0.0.1:6669")
        .spawned_task_location(true)
        .init();

    // Now tasks are tracked.
    tokio::spawn(async {
        sleep(Duration::from_secs(5)).await;
    });

    // Keep the app running.
    sleep(Duration::from_secs(10)).await;
}

Run your application with RUST_LOG=debug. This enables verbose logging for the console subscriber. Then connect with the tokio-console CLI tool.

RUST_LOG=debug cargo run --bin your_async_app

In a separate terminal, run tokio-console. You'll see a list of tasks. Click on a task to see its state. If a task is stuck, the console will show you exactly what it's waiting on. Trust the graph.

The console has overhead. It captures every task event. Use it for debugging, not for production performance measurement. The community convention is to guard console instrumentation behind a feature flag.

#[cfg(feature = "console")]
use console_subscriber::ConsoleLayer;

#[tokio::main]
async fn main() {
    #[cfg(feature = "console")]
    ConsoleLayer::builder().init();

    // ...
}

Timing and build analysis

Sometimes the bottleneck isn't runtime performance. It's build time. Long compile times slow down iteration. cargo-timing profiles the build process. It shows how long each crate takes to compile.

cargo install cargo-timing
cargo timing --bin your_async_app

The tool generates a timeline of the build. You can see which crates are the slowest. You can identify parallelism bottlenecks. If a single crate dominates the build time, you might need to optimize its code or split it into smaller units.

For runtime timing of specific functions, use tracing spans. The tracing crate allows you to instrument your code with spans. You can collect timing data from these spans and analyze it later.

use tracing::{info_span, instrument};

/// Processes a request.
/// The span captures the duration of the function.
#[instrument(skip(db))]
async fn handle_request(db: &Database) {
    // The span starts here.
    let result = db.query().await;
    // The span ends here.
    info!("Result: {:?}", result);
}

The #[instrument] attribute adds a span automatically. You can configure a subscriber to record span durations. This gives you precise timing for your async functions without the overhead of full profiling.

Pitfalls and configuration

Profiling async code introduces several pitfalls. The first is permissions. perf requires elevated privileges on Linux. If you run cargo flamegraph and get a permission error, raise the limit.

sudo sysctl -w kernel.perf_event_paranoid=1

This allows perf to sample user-space stacks. Without this, the profiler fails silently or produces empty graphs.

The second pitfall is missing features. tokio-console requires the console-subscriber feature. If you forget to add the dependency, the console won't connect. You'll see a "no tasks" message. Check your Cargo.toml and ensure the subscriber is installed.

The third pitfall is overhead. Profiling tools add overhead. cargo flamegraph uses sampling, which has low overhead. tokio-console captures every event, which can slow down your application significantly. Use tokio-console only when debugging hangs or deadlocks. Don't use it for benchmarking.

The fourth pitfall is blocking code. If you call a blocking function inside an async task, you freeze the executor. The profiler will show the blocking function consuming CPU, but the real issue is that you're blocking the async thread. Use spawn_blocking for blocking operations. This moves the work to a separate thread pool.

use tokio::task;

/// Calls a blocking function safely.
/// This prevents blocking the async executor.
async fn safe_blocking_call() {
    let result = task::spawn_blocking(|| {
        // Blocking code here.
        heavy_computation()
    })
    .await
    .unwrap();
}

If you see a thread blocked in the profiler, check for blocking calls. Isolate them with spawn_blocking. The profiler will thank you.

Decision matrix

Use cargo flamegraph when you need to find CPU hotspots in your async application. It generates a visual stack trace showing where cycles are burned. Use cargo-timing when you want a high-level overview of build times and crate compilation durations. It's faster to run and easier to parse for quick checks. Use tokio-console when you suspect task deadlocks, priority inversion, or excessive task spawning. It shows the runtime state graph in real-time. Use async-profiler when you need low-overhead profiling in production environments where perf is unavailable or too heavy. Use tracing with a subscriber when you need structured logs correlated with async spans for debugging logic flow rather than raw performance metrics.

Pick the tool that matches the symptom. CPU spike? Flamegraph. Hang? Console. Slow build? Timing.

Where to go next

Profiling async code helps you find slow parts of your program that run in the background. Think of it like a speed camera for your code, showing exactly where it gets stuck waiting or working too hard. You use special tools to record this data and see a visual map of the slowdowns.