How to Implement Circuit Breakers in Rust

Use the `governor` crate to implement a circuit breaker that automatically stops requests to a failing service after a threshold of errors. Add the dependency to your `Cargo.toml` and wrap your service call in a `Breaker` instance that tracks failures and enforces a recovery timeout.

When the downstream service starts screaming

You are building a microservice that fetches user profiles. The profile service is having a bad day. It is returning 500 errors and taking ten seconds to respond. Your service keeps hammering it. Threads pile up waiting for responses. Memory fills with pending requests. The profile service crashes completely under the load. Your service times out. The whole system cascades into failure.

You need a way to stop the bleeding. You need a circuit breaker.

A circuit breaker monitors failures to a downstream service. When failures cross a threshold, the breaker "trips." It stops sending requests entirely. This gives the downstream service time to recover. After a timeout, the breaker allows a test request through. If that succeeds, the circuit closes and traffic resumes. If it fails, the circuit opens again.

The three states of a circuit breaker

A circuit breaker is a state machine with three modes.

Closed is the normal state. Requests flow through. The breaker counts errors. If the error rate stays low, the breaker stays closed.

Open is the tripped state. Requests are rejected immediately without reaching the downstream service. This protects the service from further load. The breaker waits for a recovery timeout.

Half-Open is the testing state. After the timeout, the breaker moves to half-open. It allows a limited number of requests through as probes. If a probe succeeds, the breaker assumes the service has recovered and moves back to closed. If a probe fails, the breaker trips back to open and resets the timeout.

The key insight is that failing fast is better than failing slow. Rejecting a request in zero milliseconds is better than waiting ten seconds for a timeout. The circuit breaker trades availability for stability. It sacrifices some requests to save the system.

A minimal circuit breaker from scratch

Building a circuit breaker from scratch teaches the mechanics. Start with a simple synchronous version.

use std::time::{Duration, Instant};

/// Represents the three states of a circuit breaker.
#[derive(Debug, Clone, Copy, PartialEq)]
enum State {
    Closed,
    Open,
    HalfOpen,
}

/// A simple circuit breaker that tracks errors and trips after a threshold.
struct CircuitBreaker {
    state: State,
    error_count: u32,
    threshold: u32,
    timeout: Duration,
    last_failure_time: Option<Instant>,
}

impl CircuitBreaker {
    /// Create a new breaker with the given error threshold and recovery timeout.
    fn new(threshold: u32, timeout: Duration) -> Self {
        Self {
            state: State::Closed,
            error_count: 0,
            threshold,
            timeout,
            last_failure_time: None,
        }
    }

    /// Check if a request is allowed. Returns true if the circuit is closed or half-open.
    fn allow_request(&mut self) -> bool {
        match self.state {
            State::Closed => true,
            State::Open => {
                // Check if the timeout has elapsed.
                if let Some(last_time) = self.last_failure_time {
                    if last_time.elapsed() >= self.timeout {
                        // Timeout passed. Move to half-open to test recovery.
                        self.state = State::HalfOpen;
                        return true;
                    }
                }
                false
            }
            State::HalfOpen => true,
        }
    }

    /// Record a successful call. Resets the error count and closes the circuit.
    fn record_success(&mut self) {
        self.error_count = 0;
        self.state = State::Closed;
    }

    /// Record a failure. Increments the error count and trips the circuit if threshold is reached.
    fn record_failure(&mut self) {
        self.error_count += 1;
        self.last_failure_time = Some(Instant::now());

        if self.error_count >= self.threshold {
            self.state = State::Open;
        }
    }
}

The struct holds the state, the error count, the threshold, and the timeout. The allow_request method checks the state. If open, it checks the clock. If the timeout has passed, it transitions to half-open. The record_failure method increments the count and trips the breaker when the threshold is hit.

Convention aside: In production code, you will wrap this in Arc<Mutex<CircuitBreaker>> or Arc<RwLock<CircuitBreaker>> to share it across threads. The community prefers RwLock when reads vastly outnumber writes, but a circuit breaker writes on every failure, so Mutex is often safer to avoid contention on the write path.

How the state machine moves

Trace a sequence of calls to see the logic in action.

Create a breaker with a threshold of three errors and a timeout of five seconds.

  1. Call allow_request. State is Closed. Returns true.
  2. Service fails. Call record_failure. Error count becomes 1. State stays Closed.
  3. Service fails. Call record_failure. Error count becomes 2. State stays Closed.
  4. Service fails. Call record_failure. Error count becomes 3. Threshold reached. State moves to Open. last_failure_time is set.
  5. Call allow_request. State is Open. Timeout not elapsed. Returns false. Request rejected immediately.
  6. Wait six seconds.
  7. Call allow_request. State is Open. Timeout elapsed. State moves to HalfOpen. Returns true.
  8. Service succeeds. Call record_success. Error count resets to 0. State moves to Closed.
  9. Traffic resumes normally.

The breaker protects the service by rejecting requests during the open phase. It recovers by probing in the half-open phase. The timeout is a tuning parameter. Too short, and you hammer a service that is still recovering. Too long, and you delay recovery unnecessarily.

Trust the state machine. It knows when to stop.

Real-world async implementation

Real services are asynchronous. You need thread-safe shared state and async-friendly locks.

use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio::sync::Mutex;

/// Async-safe circuit breaker state.
#[derive(Debug)]
struct BreakerState {
    state: State,
    error_count: u32,
    threshold: u32,
    timeout: Duration,
    last_failure_time: Option<Instant>,
}

/// A thread-safe circuit breaker for async code.
#[derive(Clone)]
struct AsyncCircuitBreaker {
    inner: Arc<Mutex<BreakerState>>,
}

impl AsyncCircuitBreaker {
    /// Create a new async circuit breaker.
    fn new(threshold: u32, timeout: Duration) -> Self {
        Self {
            inner: Arc::new(Mutex::new(BreakerState {
                state: State::Closed,
                error_count: 0,
                threshold,
                timeout,
                last_failure_time: None,
            })),
        }
    }

    /// Check if a request is allowed.
    async fn allow_request(&self) -> bool {
        let mut state = self.inner.lock().await;
        match state.state {
            State::Closed => true,
            State::Open => {
                if let Some(last_time) = state.last_failure_time {
                    if last_time.elapsed() >= state.timeout {
                        state.state = State::HalfOpen;
                        return true;
                    }
                }
                false
            }
            State::HalfOpen => true,
        }
    }

    /// Record a success.
    async fn record_success(&self) {
        let mut state = self.inner.lock().await;
        state.error_count = 0;
        state.state = State::Closed;
    }

    /// Record a failure.
    async fn record_failure(&self) {
        let mut state = self.inner.lock().await;
        state.error_count += 1;
        state.last_failure_time = Some(Instant::now());

        if state.error_count >= state.threshold {
            state.state = State::Open;
        }
    }
}

/// Wrapper function that executes a closure only if the breaker allows it.
async fn execute_with_breaker<F, Fut, T>(breaker: &AsyncCircuitBreaker, func: F) -> Result<T, String>
where
    F: FnOnce() -> Fut,
    Fut: std::future::Future<Output = Result<T, String>>,
{
    if !breaker.allow_request().await {
        return Err("Circuit breaker is open".to_string());
    }

    match func().await {
        Ok(value) => {
            breaker.record_success().await;
            Ok(value)
        }
        Err(e) => {
            breaker.record_failure().await;
            Err(e)
        }
    }
}

The AsyncCircuitBreaker wraps the state in Arc<Mutex<BreakerState>>. The Arc allows cloning the breaker to share it across tasks. The Mutex protects concurrent access. The execute_with_breaker function checks the breaker, runs the closure, and records the result.

Convention aside: Use tokio::sync::Mutex instead of std::sync::Mutex in async code. The standard library mutex blocks the OS thread when contended. In an async runtime, blocking a thread can starve other tasks. tokio::sync::Mutex yields the task instead, keeping the runtime responsive.

If you use std::sync::Mutex inside an async block, the compiler may reject you with E0277 (trait bound not satisfied) if the future needs to be Send across task boundaries, or you may just see degraded performance. Stick to tokio::sync::Mutex.

Pitfalls and compiler traps

Circuit breakers introduce complexity. Watch for these issues.

Poisoned locks. If a task panics while holding the Mutex guard, the lock poisons. Subsequent lock attempts fail. tokio::sync::Mutex does not poison. It recovers automatically. This is why the async mutex is preferred. If you use std::sync::Mutex, you must handle PoisonError.

Borrow checker conflicts. If you hold a reference to the breaker while trying to lock it, the compiler rejects you with E0502 (cannot borrow as mutable because it is also borrowed as immutable). Drop the reference before locking.

// BAD: Holding a reference while locking.
let breaker_ref = &breaker;
let _guard = breaker.inner.lock().await; // E0502 if breaker_ref is used later.

// GOOD: Lock directly.
let _guard = breaker.inner.lock().await;

False positives. A brief network blip can trip the breaker if the threshold is too low. Tune the threshold based on your service's error budget. A threshold of three errors might be too aggressive for a high-throughput service. Use a sliding window or a failure rate percentage instead of a raw count for production systems.

Half-open storms. If many tasks wait for the timeout, they all wake up at once and probe the service. This can overwhelm a recovering service. Limit the number of probes in half-open state. Only allow one request through at a time, or use a semaphore to cap concurrent probes.

Distinguishing errors. Not all errors should trip the breaker. A 404 "Not Found" is a client error, not a service failure. A 503 "Service Unavailable" is a service failure. Filter errors before recording them. Only count 5xx errors or timeouts as failures.

Treat the threshold as a safety valve, not a precision instrument. Tune it with real traffic data.

Decision: circuit breaker vs alternatives

Resilience patterns overlap. Pick the right tool for the failure mode.

Use a circuit breaker when you need to protect a downstream service from cascading failures by stopping requests after a threshold of errors. Use a circuit breaker when the downstream service is shared and you want to prevent your service from contributing to its overload. Use a circuit breaker when failures are correlated and likely to persist for a duration.

Use rate limiting when you need to control the volume of requests regardless of success or failure. Use rate limiting when you are enforcing API quotas or preventing abuse from a single client. Use rate limiting when the downstream service has a known capacity limit you must respect.

Use retries with exponential backoff when failures are transient and random. Use retries when you want to give the service a chance to recover without stopping all traffic. Use retries when the cost of a failed request is high and you can afford the latency of a retry.

Use the governor crate when you need battle-tested rate limiting and quota management. Use governor when you want to build a custom circuit breaker on top of its state tracking primitives. Use governor when you need distributed rate limiting with Redis or other backends.

Use a dedicated circuit breaker crate like circuitbreaker when you need advanced features like sliding windows, failure rate calculations, or metrics integration without implementing them yourself.

Don't layer patterns blindly. A circuit breaker combined with retries can mask problems if the retry logic ignores the breaker state. Check the breaker before retrying.

Where to go next