How to implement rate limiting in async Rust

You hit the wall at 429

You are writing a scraper that hits a public API. You spin up fifty concurrent tasks to fetch data fast. The API responds with 429 Too Many Requests. Your IP gets banned. The problem isn't your logic. You are sending requests faster than the server can handle. You need a mechanism to count how many requests you send over time and pause when you hit the limit.

Rate limiting isn't a feature you add. It's a constraint you respect.

The token bucket pattern

Rate limiting controls how fast actions happen over time. The standard pattern is the token bucket. Imagine a bucket filling with water at a steady drip. Every time you want to make a request, you scoop out a cup of water. If the bucket has enough water, you proceed. If it's empty, you wait until the drip refills it.

The "water" is permission to act. The "drip" is the allowed rate, like ten requests per second. The bucket also has a maximum size. If you don't use your tokens, they don't accumulate forever. Once the bucket is full, extra drips spill over. This allows bursts: if you've been quiet, you can suddenly send a batch of requests, up to the bucket capacity.

The governor crate implements this algorithm efficiently. It tracks the token count and the refill timing without busy-waiting. You check the limiter, and it tells you whether you can proceed or how long you must wait.

The bucket refills whether you use the tokens or not. Waste them, and you'll pay in latency later.

Minimal setup with governor

You need three pieces to make rate limiting work in async Rust. You need the governor crate for the algorithm. You need Arc to share the limiter across tasks. You need Mutex to protect the mutable state inside the limiter.

use governor::{Quota, RateLimiter};
use std::sync::{Arc, Mutex};

/// Creates a shared rate limiter allowing 10 requests per second.
fn make_limiter() -> Arc<Mutex<RateLimiter<StdState, NoOpMiddleware>>> {
    // Quota defines the refill rate: 10 tokens per second.
    let quota = Quota::per_second(10);
    
    // RateLimiter manages the bucket state based on the quota.
    let limiter = RateLimiter::direct(quota);
    
    // Arc allows sharing ownership across async tasks.
    // Mutex serializes access to the mutable counter inside the limiter.
    Arc::new(Mutex::new(limiter))
}

The RateLimiter type holds mutable state. It tracks how many tokens are left and when the last refill happened. Because it's mutable, you can't share it directly across threads or async tasks. The Mutex wraps the limiter so only one task can check or update the state at a time. The Arc wraps the mutex so multiple tasks can hold a reference to the same limiter.

Convention aside: The community prefers Arc::clone(&limiter) over limiter.clone(). Both compile and do the same thing. The explicit form signals that you are cloning the reference, not the underlying data. It prevents readers from assuming a deep copy is happening.

Convention aside: RateLimiter types are verbose. Define a type alias at the top of your module to keep signatures readable. type Limiter = Arc<Mutex<RateLimiter<StdState, NoOpMiddleware>>>;

Keep the lock duration shorter than the request latency. The mutex guards the counter, not the network call.

Checking the limit in async code

The check method tries to consume a token. It returns Ok(()) if a token is available. It returns Err(Delay) if the bucket is empty. The error contains the duration you must wait before the next token becomes available.

In async Rust, you must never block the executor thread while waiting. If you hold a Mutex and call a blocking sleep, you freeze every other task that needs the limiter. The executor thread is stuck waiting for the lock, but the lock holder is sleeping. Deadlock by design.

You must drop the lock before you sleep.

use governor::{Quota, RateLimiter};
use std::sync::{Arc, Mutex};
use std::time::Duration;

/// Fetches data while respecting the rate limit.
async fn fetch_with_limit(
    limiter: Arc<Mutex<RateLimiter<StdState, NoOpMiddleware>>>,
    url: &str,
) -> Result<String, Box<dyn std::error::Error>> {
    // Lock the mutex to check the rate limit.
    // This lock is held only briefly, minimizing contention.
    let mut guard = limiter.lock().unwrap();
    
    // check() tries to consume a token.
    // It returns an error containing the wait time if the bucket is empty.
    match guard.check() {
        Ok(_) => {} // Token acquired. Proceed immediately.
        Err(negative_duration) => {
            // Extract the wait duration from the error.
            let wait_time = negative_duration.duration();
            
            // Drop the lock before sleeping.
            // Holding the lock while sleeping blocks other tasks from checking the limit.
            drop(guard);
            
            // Sleep asynchronously to avoid blocking the executor thread.
            tokio::time::sleep(wait_time).await;
        }
    }
    
    // Simulate the actual request.
    // In real code, use reqwest or a similar HTTP client.
    Ok(format!("Fetched {}", url))
}

The guard variable holds the mutex lock. When guard goes out of scope, the lock releases. By calling drop(guard) explicitly inside the error branch, you release the lock immediately before the sleep. Other tasks can now acquire the lock and check the limiter while this task waits.

If you forget to drop the lock, the compiler won't stop you. The code compiles. The runtime deadlocks. You have to read the code to see the bug.

Drop the lock before you sleep. The executor has no patience for blocked threads.

Pitfalls and compiler errors

Rate limiting introduces shared mutable state. That state is a source of bugs if you ignore the rules.

If you try to pass the limiter directly to tokio::spawn without Arc, the compiler rejects this with E0277 (trait bound not satisfied). RateLimiter is not Sync, so it cannot be shared across threads. The Arc<Mutex<...>> wrapper is mandatory for concurrent access.

If you try to call check on a non-mutable reference, the compiler rejects this with E0596 (cannot borrow as mutable). The limiter state must change when you consume a token. You need let mut guard = limiter.lock().unwrap(); to get a mutable handle.

The Mutex::lock method returns a Result. If the thread that held the lock panicked, the mutex becomes poisoned. Calling lock().unwrap() will panic on a poisoned mutex. In most applications, a panic in the limiter is a fatal error anyway. Unwrapping is acceptable. If you need resilience, use lock().expect("Limiter mutex poisoned") to log the failure, or handle the error by creating a fresh limiter.

A poisoned mutex stops the world. Handle the lock result, or your application dies silently.

High contention can become a bottleneck. If you have thousands of tasks checking the limiter every millisecond, the Mutex serializes all those checks. The governor crate is optimized, but the mutex overhead adds up. For extreme throughput, consider parking_lot::Mutex which has lower contention overhead, or use governor's RateLimiter with StdState which uses atomic operations internally where possible. The Mutex is still needed for the wrapper, but the internal state updates are faster.

Burst allowance and quotas

The Quota struct lets you configure burst behavior. By default, Quota::per_second(10) creates a bucket that holds ten tokens. You can send ten requests instantly, then you must wait.

You can increase the burst capacity with allow_burst. This sets the maximum bucket size.

use governor::Quota;

// Allows 10 requests per second sustained, but permits bursts of up to 20.
let quota = Quota::per_second(10).allow_burst(20);

This is useful when you want to handle sudden spikes without throttling immediately. If your app has been idle, the bucket fills to 20. You can send 20 requests back-to-back. After that, the rate drops to 10 per second as the bucket drains and refills.

Burst settings should match the server's tolerance. If the server bans you after 15 rapid requests, setting allow_burst(20) guarantees a ban. Always check the API documentation for burst limits, not just sustained rates.

Decision matrix

Pick the tool that matches the constraint. Different problems need different primitives.

Use governor with Arc<Mutex<RateLimiter>> when you need precise control over request rates, including burst allowances and refill intervals, and you are coordinating multiple async tasks. This is the standard choice for API clients, scrapers, and any logic that must respect a time-based quota.

Use tokio::sync::Semaphore when you want to limit the number of concurrent operations rather than the rate over time. A semaphore caps parallelism, which throttles throughput indirectly but does not enforce a strict time-based quota. Use this when the bottleneck is database connections or thread pool slots, not server request limits.

Use tokio::time::interval when a single task must execute at a fixed frequency, such as a periodic health check or data sync. The interval handles timing and drift correction automatically. It is simpler than a rate limiter and avoids shared state entirely.

Use HTTP middleware or client interceptors when rate limiting applies to the entire application boundary. Centralizing the limit in middleware keeps business logic clean and ensures every request passes through the gate. This is the right choice for web servers protecting downstream services.

Pick the tool that matches the constraint. Concurrency limits need semaphores; rate limits need buckets.

Where to go next

Rate limiting restricts how often a specific action can happen, like a bouncer checking IDs at a club door. In async Rust, you need a shared counter that multiple tasks can safely update without crashing. This prevents your server from being overwhelmed by too many requests at once.