What Compiler Optimizations Does Rust Apply?

Rust applies optimizations based on the `opt-level` setting in your `Cargo.toml` profile, with level 0 for development and level 3 for release builds.

The compiler rewrites your code

You write a loop that sums a million numbers. You run it. It takes three seconds. You check the code. It looks perfect. You add --release to the cargo command. It takes four milliseconds. What changed? The compiler rewrote your code.

Rust does not just translate your source text into machine instructions. It analyzes your intent and transforms the code to run faster, use less memory, or shrink the binary size. The intensity of this transformation depends on the build profile. In development, the compiler prioritizes fast compilation and accurate debugging. In release, it prioritizes execution speed. The difference is not a magic switch. It is a collection of optimization passes that run over your code, each one looking for patterns to improve.

Debug versus release: two different goals

Rust uses LLVM as its code generation backend. LLVM is a library of compiler infrastructure used by many languages. It contains hundreds of optimization passes. A pass is a transformation that scans the code, finds a pattern, and rewrites it. Some passes make the code faster. Some make it smaller. Some enable other passes to work better.

In the default dev profile, Rust disables most optimization passes. The compiler generates code that matches your source structure closely. This makes compilation fast. It also makes debugging reliable. When the program crashes, the stack trace points to the exact line in your file. The variables in the debugger match the variables in your code.

In the release profile, Rust enables aggressive optimization. The compiler runs many passes. It inlines functions, removes dead code, unrolls loops, and vectorizes operations. The resulting machine code often bears little resemblance to your source. Compilation takes longer. Debugging becomes harder because variables may be optimized away or reordered. The code runs significantly faster.

fn main() {
    let mut sum = 0;
    // In debug, this loop has overhead for incrementing i and checking bounds.
    // In release, the optimizer may calculate the sum at compile time.
    for i in 0..1_000_000 {
        sum += i;
    }
    println!("{}", sum);
}

The community convention is strict: never benchmark debug builds. Debug performance is not representative of how your code will run in production. Always use cargo run --release or cargo bench when measuring speed. If you see a performance issue, profile the release build. Optimizations can change the hot path entirely.

Benchmark release builds. Debug performance is a lie.

Monomorphization: generics get concrete

Rust handles generics through monomorphization. When you write a generic function, the compiler does not generate one function that works for all types. It generates a separate copy of the function for each type you use. This happens at compile time. The result is that the optimizer sees concrete types. It knows exactly how big the data is, what operations are available, and how to lay out memory.

/// Sums a slice of any numeric type.
/// The compiler creates a separate version for i32, f64, etc.
fn sum<T: std::ops::Add<Output = T> + Copy + Default>(slice: &[T]) -> T {
    let mut acc = T::default();
    // The optimizer sees the concrete type and can use SIMD instructions.
    for &val in slice {
        acc = acc + val;
    }
    acc
}

fn main() {
    let ints = vec![1, 2, 3, 4, 5];
    // Generates sum::<i32>. The optimizer knows i32 is 4 bytes.
    let int_sum = sum(&ints);
    
    let floats = vec![1.0, 2.0, 3.0];
    // Generates sum::<f64>. The optimizer knows f64 is 8 bytes.
    let float_sum = sum(&floats);
}

Monomorphization helps the optimizer because it removes abstraction overhead. The compiler can inline the function body and apply type-specific optimizations. If you use sum with i32, the optimizer might use SIMD instructions to add four integers at once. If you use it with a custom type, it generates code tailored to that type. This is part of Rust's zero-cost abstraction philosophy. High-level code compiles to the same efficient machine code as hand-written low-level code.

Write idiomatic code. The optimizer loves concrete types and exposed structure.

LLVM passes: what actually happens

The optimizer applies a sequence of passes. The exact set depends on the opt-level configuration. Common transformations include:

  • Inlining: The compiler copies the body of a small function into the caller. This removes function call overhead and exposes more code for further optimization. It allows the compiler to see across function boundaries.
  • Dead code elimination: Variables that are computed but never used are removed. Functions that are never called are stripped. This reduces binary size and removes unnecessary work.
  • Constant folding: Expressions that can be evaluated at compile time are replaced with their result. 2 + 2 becomes 4. Complex calculations with constant inputs may vanish entirely.
  • Loop unrolling: The compiler duplicates the loop body to reduce branch overhead. Instead of looping four times, it runs the body four times in a row. This helps the CPU pipeline.
  • Vectorization: The compiler detects loops that operate on arrays and replaces scalar operations with SIMD instructions. SIMD processes multiple data elements in parallel using wide registers.
  • Bounds check elimination: Rust checks array indices at runtime to prevent out-of-bounds access. The optimizer analyzes the loop bounds and removes checks when it can prove the index is always valid.
fn sum_slice(slice: &[i32]) -> i32 {
    let mut acc = 0;
    // In debug, slice[i] checks bounds every iteration.
    // In release, the optimizer sees i goes from 0 to len.
    // It proves the check always passes and removes it.
    for i in 0..slice.len() {
        acc += slice[i];
    }
    acc
}

fn sum_iter(slice: &[i32]) -> i32 {
    // Iterators often help the optimizer more than index loops.
    // The iterator exposes the structure of the traversal.
    // The compiler can vectorize this more reliably.
    slice.iter().sum()
}

The community convention favors iterators over index loops. Iterators expose the traversal pattern clearly. The optimizer can reason about the data flow and apply vectorization more aggressively. Index loops require the compiler to track the index variable and prove bounds. Iterators handle this internally and often generate better code. Use slice.iter().sum() instead of a manual loop.

Trust the optimizer, but write code that gives it something to work with.

When the optimizer gets blocked

Optimizations are not magic. The compiler can only transform code when it can prove the transformation is safe. If your code introduces uncertainty, the optimizer may stop. Common blockers include:

  • Excessive indirection: Pointers and references force the compiler to assume data may alias. If the optimizer cannot prove two pointers point to different memory, it must be conservative. It cannot reorder loads and stores freely.
  • Complex control flow: Deeply nested conditionals and jumps make it hard for the compiler to analyze data dependencies. The optimizer may give up on vectorization or inlining.
  • Dynamic dispatch: Using dyn Trait involves a vtable lookup. The compiler cannot inline the method call because the target is determined at runtime. This prevents many optimizations.
  • Undefined behavior: If your code contains undefined behavior, the optimizer assumes anything can happen. It may delete checks, reorder memory, or generate incorrect results. This is the most dangerous blocker.
use std::ptr;

/// Accesses memory through a raw pointer.
/// The SAFETY comment documents the invariants required.
unsafe fn read_ptr(ptr: *const i32) -> i32 {
    // SAFETY:
    // 1. ptr must be non-null.
    // 2. ptr must be properly aligned.
    // 3. ptr must point to a valid i32.
    // 4. The referenced i32 must not be mutated concurrently.
    // If any invariant is violated, the optimizer may use the lie
    // to delete code or corrupt memory.
    ptr.read()
}

The optimizer trusts unsafe blocks blindly. If you claim a pointer is valid and aligned, the compiler assumes it is. It may optimize away null checks or assume no aliasing. If you lie, the compiler will happily generate code that crashes or corrupts data. Treat the SAFETY comment as a proof. If you cannot write the invariants, you do not have a proof.

The optimizer trusts your unsafe blocks. If you lie, the compiler will use that lie to delete the safety checks you thought were there.

Configuration knobs

You control optimization intensity through the Cargo.toml profile settings. The main knob is opt-level. It accepts integers or strings.

  • 0: No optimization. Fast compilation. Accurate debugging.
  • 1: Basic optimization. Good balance for development.
  • 2: Standard optimization. Default for release.
  • 3: Aggressive optimization. Enables more inlining and vectorization. May increase compile time and binary size.
  • s: Optimize for size. Reduces binary size at the cost of some speed.
  • z: Optimize for size aggressively. Enables link-time optimization and dead code stripping. Slowest compilation. Smallest binary.

Other settings affect optimization quality:

  • lto: Link-time optimization. Allows the compiler to optimize across crate boundaries. Functions from dependencies can be inlined. Increases compile time significantly.
  • codegen-units: Number of parallel compilation units. More units speed up compilation but reduce optimization quality. The optimizer works on smaller chunks of code. Setting this to 1 maximizes optimization but serializes compilation.
  • debug: Include debug symbols. Useful for profiling release builds. Does not affect runtime performance.
[profile.release]
# Maximum speed optimization.
opt-level = 3
# Optimize across crate boundaries.
lto = true
# Single compilation unit for best optimization quality.
codegen-units = 1
# Keep debug info for profiling.
debug = true

The community convention for production builds is opt-level = 3 with lto = true and codegen-units = 1 when binary size or performance is critical. For larger projects, codegen-units = 1 can make compilation very slow. You may need to balance compile time and optimization quality. Use lto = "thin" for a middle ground that provides some cross-crate optimization without the full cost.

Configuration is a trade-off. Pick the knobs that match your constraints.

Decision matrix

Use opt-level = 0 when you are debugging and need stack traces that match your source code line-for-line. Use opt-level = 3 when you are building for production and want maximum execution speed. Use opt-level = "s" when binary size matters more than raw speed, like in embedded targets or web assembly. Use opt-level = "z" when you are hitting size limits and can tolerate slower compilation and slightly slower runtime for the smallest possible binary. Reach for lto = true when you want the compiler to optimize across crate boundaries, inlining functions from dependencies. Reach for codegen-units = 1 when you want the best possible optimization quality and can tolerate significantly longer compile times. Reach for debug = true in release builds when you need to profile or debug a release binary without losing symbol information.

Where to go next