When crate boundaries hide inefficiencies
You just finished building a command-line tool in Rust. It passes all tests. You run cargo build --release and wait for the compiler to finish. The binary lands at eleven megabytes. You run it, and it feels sluggish compared to a C equivalent that does the same work. You read the Cargo documentation and spot a flag called lto. You flip it on, recompile, and suddenly the binary drops to four megabytes and executes noticeably faster. You did not touch a single line of code. The compiler simply changed when it does its work.
The house contractor analogy
Rust compiles your code in isolated units called crates. Each crate gets its own compilation pass. The optimizer works inside those walls, but it cannot see across them. That isolation is what makes Rust builds fast and parallel. It also creates blind spots.
Imagine you are building a house with three separate contractors. One builds the kitchen, one builds the living room, one builds the hallway. Each contractor optimizes their own space. They might leave a door in the wall between the kitchen and living room, even though you never planned to use it. They might install duplicate plumbing lines because they never coordinated. The house works, but it is inefficient.
Link-Time Optimization is the final inspector who walks through the entire house before you move in. They tear down unnecessary walls, merge duplicate pipes, and reroute traffic for the shortest path. In Rust, that inspector is the linker. LTO gives the linker permission to look at the intermediate representation of every crate, optimize across the boundaries, and emit the final machine code in one pass.
A minimal cross-crate example
Start with a simple two-crate setup. A library crate exports a tiny helper function. A binary crate calls it.
// lib.rs in math_utils crate
/// Returns the square of a number.
pub fn square(n: i32) -> i32 {
// Simple multiplication, but lives in a separate crate.
n * n
}
// main.rs in your binary crate
use math_utils::square;
fn main() {
// Call the function across the crate boundary.
let result = square(5);
// Print the outcome to verify correctness.
println!("{}", result);
}
Without LTO, the compiler generates a standard function call. The CPU jumps to the square address, executes the multiplication, and jumps back. That jump costs cycles. The function also stays in the binary even if you only call it once.
Enable LTO in your Cargo.toml:
[profile.release]
# Tells Cargo to pass LTO flags to the linker.
lto = true
Rebuild. The compiler now treats the entire project as a single optimization unit. It sees that square is only called once with a constant value. It inlines the multiplication directly into main, removes the function call overhead, and drops the standalone square symbol entirely. The binary shrinks. The execution path shortens.
Stop treating crate boundaries as optimization walls. Let the linker see the whole picture.
What actually happens under the hood
Here is what happens during a normal build. rustc translates each crate into object files. Those files contain machine code and a table of symbols. The linker stitches them together, resolves addresses, and produces the executable. The optimization phase is already over. The linker only knows how to glue things together.
With LTO enabled, rustc stops one step earlier. Instead of emitting machine code, it emits LLVM Intermediate Representation for each crate. The linker collects all those IR files. It runs LLVM's optimizer across the entire graph. The optimizer sees the full call tree. It inlines functions, eliminates dead code, merges identical constants, and reorders instructions for better CPU pipeline usage. After the cross-crate optimization finishes, LLVM compiles everything to machine code in a single pass.
The community convention here is to stick with lto = true in Cargo.toml. That enables thin LTO by default in modern Cargo versions. Thin LTO splits the work across multiple threads during the linking phase, keeping compile times reasonable. You can explicitly request lto = "fat" for maximum optimization, but that forces a single-threaded link step that can stall on large projects. Another convention worth knowing: LTO only matters for the final binary crate. Library crates should never enable it in their own Cargo.toml. Downstream users will apply their own optimization settings during their final link.
Trust the linker to do the heavy lifting. Keep your library crates clean and let the binary crate decide the final optimization strategy.
Real-world impact on larger projects
Real projects rarely consist of two files. They pull in dozens of dependencies. Serialization libraries, HTTP clients, logging frameworks. Each dependency brings its own optimized object files. Without LTO, you end up with duplicate copies of common utility functions, unused feature flags compiled into dead code, and function call overhead across every boundary.
Consider a web server that uses serde for JSON parsing and tokio for async runtime. Both crates export hundreds of small helper functions. Many of those helpers are only called once or twice in your codebase. LTO identifies those cold paths. It inlines the hot paths directly into your request handlers. It strips away the unused serialization formats you never actually enabled. The result is a tighter binary with fewer cache misses and less instruction fetch overhead.
LTO also interacts with codegen-units. By default, Cargo splits your crate into multiple compilation units to speed up builds. Each unit gets optimized independently. LTO merges those units back together at link time, undoing the fragmentation. You get the fast parallel compilation of multiple units, followed by the thorough optimization of a single unified graph.
The trade-off is compile time. LTO moves work from the compilation phase to the linking phase. The linker now has to optimize everything at once. Memory usage spikes. The final link step can take longer than the entire compilation process combined. You pay for the smaller binary and faster runtime with slower builds.
Measure before you optimize. If your binary is already fast enough and your CI pipeline is tight, skip LTO. If size and speed matter more than build minutes, turn it on.
Where LTO bites back
LTO is not a magic bullet. It introduces friction in specific scenarios.
Cross-compilation can break if your target linker does not support LLVM bitcode. Some embedded targets or older toolchains expect traditional object files. The linker will fail with a format mismatch error. You will need to switch to lto = "thin" or disable it entirely for those targets.
Debugging becomes harder. When the optimizer inlines functions across crate boundaries, stack traces lose their original function names. Symbols get merged or dropped. If you rely on debuggers to step through third-party code, LTO will make the experience frustrating. Keep debug symbols with debug = true in your release profile, but expect mangled names and flattened call stacks.
Memory limits are real. The linker loads the entire IR graph into RAM. Large monorepos or projects with heavy dependency trees can trigger out-of-memory kills during the link step. If your CI runner crashes at ninety-eight percent, LTO is usually the culprit.
Compiler flags can conflict. Aggressive inlining combined with LTO can sometimes bloat the binary if the optimizer decides to duplicate large functions instead of calling them. You can mitigate this by tuning opt-level or using strip = "symbols" to remove debug metadata after linking.
Watch your memory usage during the link step. If the linker runs out of RAM, drop back to thin LTO or increase your build runner limits.
Choosing your optimization strategy
Use lto = true when you ship a final binary and want the best balance of size reduction, runtime speed, and acceptable compile times. Use lto = "thin" when you need multi-threaded linking to keep build times predictable on large dependency graphs. Use lto = "fat" when you are squeezing every last cycle out of a performance-critical binary and can afford a single-threaded link step that might take several minutes. Skip LTO entirely when you are compiling a library crate, because your downstream users will apply their own optimization settings during their final link. Skip LTO when cross-compiling to a target with limited linker support or when debugging third-party code requires intact stack frames. Skip LTO when your CI pipeline has strict time limits and the binary size savings do not justify the extra link minutes.
Pick the setting that matches your deployment constraints. The compiler will do the rest.