The myth of free concurrency
You rewrite a synchronous HTTP server to handle ten thousand concurrent connections. You sprinkle async and await across your handlers. You expect the server to magically scale. Instead, your CPU usage spikes, latency jumps, and your memory footprint doubles. You assume async is broken. It is not. You just confused a language feature with a runtime strategy.
Rust promises zero-cost abstractions. That promise holds true for async and await. The keywords themselves add zero runtime overhead. They do not allocate memory. They do not spawn threads. They do not poll. They are purely compile-time syntax sugar. The actual cost comes from what the compiler generates and how the runtime executes it. Understanding that boundary separates developers who ship fast systems from developers who fight invisible bottlenecks.
How the compiler actually builds it
When you write an async fn, the compiler does not generate a normal function. It generates a state machine. Think of a paused video game. The game does not stop running. It saves its current state to disk, yields control to the operating system, and waits for you to press play. When you press play, it loads the exact same state and continues from the exact same line.
Rust does the same thing with await. The compiler rewrites your function into a struct that implements the Future trait. That struct holds your local variables, the current execution point, and any pending operations. When you hit an await point, the function returns a Poll::Pending status. The runtime stores the struct somewhere, moves on to other work, and calls back into it later when the underlying operation finishes.
use std::future::Future;
/// Fetches data from a remote endpoint without blocking the thread
async fn fetch_data(url: &str) -> String {
// The compiler turns this into a struct holding `url`
// and a field tracking which await point we are at
let response = http_client.get(url).await;
// Each await point becomes a state transition
// The compiler generates match arms for each state
response.text().await
}
The generated struct is usually small. It contains pointers, integers, and your local variables. The compiler inlines simple states and optimizes away dead branches. If you write a function with a single await, the state machine often collapses into a few machine instructions. That is the zero-cost guarantee. The abstraction disappears when the optimizer runs.
The Future trait itself is minimal. It defines a single method: poll. The runtime calls poll repeatedly. If the work is done, poll returns Poll::Ready(value). If the work is still in progress, it returns Poll::Pending and registers a Waker. The Waker is a callback the runtime uses to wake the task when the underlying I/O or timer completes. No threads are blocked. No OS context switches happen at the await point. The thread stays alive and works on other tasks.
Treat the state machine as a blueprint. The compiler builds it exactly once. The runtime executes it as many times as you spawn.
Where the real cost lives
The keywords are free. The ecosystem is not. Every async program runs on top of an executor. Tokio, async-std, and smol all provide schedulers, I/O drivers, and task managers. Those components have measurable costs.
Task allocation is the first expense. When you call tokio::spawn, the runtime allocates a task control block and a future on the heap. That allocation usually costs a few hundred nanoseconds. If you spawn ten thousand tasks in a tight loop, you will see measurable latency. The runtime mitigates this with arena allocators and object pooling, but the cost is real. You pay for every independent unit of work you hand to the scheduler.
Scheduling overhead is the second expense. The executor maintains a work-stealing queue. When a task yields, the scheduler decides which thread picks it up next. That decision involves atomic counters, lock-free queues, and cache-line bouncing. On a single core, the overhead is negligible. On a thirty-two core machine running a million micro-tasks, the scheduler becomes the bottleneck. You will see CPU usage climb without doing useful work.
Memory layout is the third expense. Synchronous code runs on the stack. Local variables live in contiguous memory. The CPU prefetches them efficiently. Async futures live on the heap. The compiler boxes them or places them in runtime arenas. Pointer indirection increases. Cache misses rise. A tight loop that processes ten thousand items synchronously will often beat an async version that spawns ten thousand tasks, simply because the synchronous version stays in L1 cache.
Convention aside: the Rust community treats spawn as a commitment to concurrency, not a free function call. If you do not need independent scheduling, do not spawn. Chain futures with join! or select! instead. Keep the work on the same logical task until you actually need to split it.
When async slows you down
Async shines when you wait. It struggles when you compute. The executor expects tasks to yield frequently. If a task runs for fifty milliseconds without hitting an await, it monopolizes its worker thread. Other tasks starve. Latency spikes. The system appears to hang.
/// Processes a batch of items with heavy CPU work
async fn process_batch(items: Vec<String>) -> Vec<String> {
// This loop blocks the executor thread
// No await points mean no yielding
let mut results = Vec::with_capacity(items.len());
for item in items {
let heavy = compute_heavy(&item);
results.push(heavy);
}
results
}
The compiler will not stop you. The code compiles cleanly. The runtime will execute it sequentially on one thread. If you run this alongside network handlers, your HTTP server stops responding. The fix is not to add fake await points. The fix is to move CPU work to a thread pool or use tokio::task::spawn_blocking.
Another common trap is ignoring trait bounds. Async futures capture their environment by default. If you capture a non-Send type like Rc<T> or a raw pointer, the future itself becomes non-Send. The compiler rejects it with E0277 (trait bound not satisfied) when you try to spawn it across threads. The error message points to the captured variable, not the spawn call. Replace Rc<T> with Arc<T>, or keep the task on the current thread using spawn_local.
Convention aside: always name your spawned tasks. tokio::spawn(async { ... }) creates an anonymous task in the debugger and profiler. tokio::spawn(async { ... }.in_current_span()) or using tokio::task::Builder::new().name("worker-1") makes profiling actionable. Name your tasks before you need to debug them.
Pinning is the silent gotcha. Futures must be pinned to memory because they may contain self-referential pointers. The compiler handles this automatically for async fn returns. If you manually implement Future or use raw pointers inside a future, you will hit pinning violations. The compiler enforces it with E0734 (cannot move out of a pinned value). Respect the pinning contract. Use Box::pin or std::pin::Pin when you step outside the standard async fn path.
Measure before you optimize. Profile with perf, tokio-console, or flamegraph. Async performance is not a theoretical exercise. It is a resource allocation problem.
Choosing the right tool
Use synchronous code when your workload is CPU-bound and runs to completion without waiting. Use synchronous code when you process a fixed batch of data and do not need to interleave I/O. Use synchronous code when you want predictable stack allocation and maximum cache locality.
Use async when your program spends most of its time waiting for I/O, timers, or external services. Use async when you need to handle thousands of concurrent connections with a small thread pool. Use async when you want to compose non-blocking operations cleanly without callback hell.
Use threads when you need to run independent CPU-heavy workloads that do not share state. Use threads when you interface with blocking C libraries that cannot be wrapped in async. Use threads when you want OS-level isolation and do not care about task scheduling overhead.
Reach for spawn_blocking when you must call a synchronous function from an async context. Reach for Arc<Mutex<T>> when multiple async tasks need mutable shared state across threads. Reach for Rc<RefCell<T>> when you need shared mutable state but stay on a single thread.
Trust the executor. It is designed to yield, not to compute.