Cache-Friendly Data Structures in Rust

Use contiguous structures like Vec<T> instead of pointer-heavy types to improve CPU cache performance.

Cache-Friendly Data Structures in Rust

You spent three hours building a custom event log for your application. You used a doubly-linked list because your computer science textbook said linked lists are great for insertions. The code compiles. The logic is sound. You run it on a million records, and the stopwatch hits four seconds. You switch the storage to a Vec, keep the exact same iteration logic, and the time drops to two hundred milliseconds. The algorithm didn't change. The data layout did.

Rust gives you control over memory layout. That control is a superpower, but it comes with a responsibility. If you scatter your data across the heap, the CPU spends more time waiting for memory than doing work. Cache-friendly structures pack data so the hardware can fetch it in big, efficient chunks. Cache-unfriendly structures force the CPU to make thousands of tiny, expensive trips to RAM.

The CPU cache and spatial locality

Your processor has a tiny, super-fast scratchpad called the cache. It sits right next to the cores. Main memory (RAM) is huge but slow by comparison. When the CPU needs a value, it checks the cache first. If the value is there, access takes a few nanoseconds. If it is not, the CPU fetches data from RAM, which can take hundreds of nanoseconds. That gap is where performance lives or dies.

The cache does not fetch single values. It fetches blocks called cache lines, typically 64 bytes. When the CPU asks for byte 100, the memory controller loads bytes 96 through 159 into the cache. This works beautifully if your data is packed together. It fails if your data is scattered.

Spatial locality is the principle that if you access one piece of data, you are likely to access nearby data soon. A Vec<T> has perfect spatial locality. The elements sit next to each other in memory. When the CPU loads one element, it loads fifteen others for free. A linked list has terrible spatial locality. Each node is a separate allocation. The nodes can end up anywhere on the heap. Loading one node tells the CPU nothing about where the next node lives.

Think of a chef cooking a stew. All ingredients are in one bowl on the counter. The chef grabs a spoon, stirs, and moves on. Now imagine the ingredients are in separate jars scattered across three different pantries. The chef runs to pantry A for salt, pantry B for pepper, pantry C for onions. The cooking takes the same time. The running kills the throughput. The CPU is the chef. The cache is the counter. RAM is the pantry.

Contiguous memory in action

The standard library gives you tools for both patterns. The difference shows up immediately in iteration speed.

use std::collections::LinkedList;

fn main() {
    // Vec stores elements in a single contiguous block.
    // The CPU prefetcher sees the sequential pattern and loads the next cache line automatically.
    let vec_data: Vec<i32> = (0..100_000).collect();
    let mut vec_sum = 0;
    for item in &vec_data {
        vec_sum += item;
    }
    println!("Vec sum: {}", vec_sum);

    // LinkedList scatters nodes across the heap.
    // Each node is a separate allocation. The CPU cannot predict where the next node lives.
    let mut list_data = LinkedList::new();
    for i in 0..100_000 {
        list_data.push_back(i);
    }
    let mut list_sum = 0;
    for item in &list_data {
        list_sum += item;
    }
    println!("List sum: {}", list_sum);
}

The Vec version streams through memory. The hardware prefetcher detects the stride and keeps the cache fed ahead of the loop. The LinkedList version chases pointers. Each node contains a value and a pointer to the next node. To get the next value, the CPU must load the node, read the pointer, and jump to a random address. That address is likely not in the cache. The CPU stalls while RAM delivers the data.

Pointer chasing is the enemy of performance. Every dereference is a gamble. If the data is hot, you win. If it is cold, you pay the penalty. In a linked list, the data is almost always cold.

Structure of Arrays vs Array of Structures

Contiguous memory helps, but packing data intelligently helps more. Consider a physics simulation with thousands of particles. You need to update positions every frame.

The naive approach is an Array of Structures. You define a struct with all fields and put them in a vector.

struct ParticleAoS {
    x: f32,
    y: f32,
    z: f32,
    mass: f32,
    color: [u8; 4],
    // ... other fields
}

fn update_positions_aos(particles: &mut [ParticleAoS]) {
    for p in particles.iter_mut() {
        // We only need x, y, z for this step.
        p.x += 1.0;
        p.y += 1.0;
        p.z += 1.0;
    }
}

This works. The struct is contiguous. But look at the cache line. A cache line holds 64 bytes. ParticleAoS is roughly 24 bytes. One cache line holds two particles. When you update x, y, z, you load the cache line. That line also contains mass and color. You loaded data you do not need. You wasted bandwidth. If the struct grows, you might fit only one particle per line. The waste doubles.

The Structure of Arrays pattern separates the fields. You store each field in its own vector.

struct ParticleSoA {
    xs: Vec<f32>,
    ys: Vec<f32>,
    zs: Vec<f32>,
    masses: Vec<f32>,
    colors: Vec<[u8; 4]>,
}

fn update_positions_soa(particles: &mut ParticleSoA) {
    // Access only the hot data.
    // The cache line fills with x coordinates, not mass or color.
    for x in particles.xs.iter_mut() {
        *x += 1.0;
    }
    for y in particles.ys.iter_mut() {
        *y += 1.0;
    }
    for z in particles.zs.iter_mut() {
        *z += 1.0;
    }
}

Now the cache line is pure signal. Every byte loaded is used. You get four times more work per cache fetch for the position data. The mass and color vectors stay in RAM until you need them. This is how high-performance engines process millions of entities.

Rust does not force you to choose. You can wrap the SoA in a clean API. The caller sees a Particle interface. The implementation uses separate vectors. The convention is to keep the SoA internal. Expose methods that operate on the hot fields. Hide the layout details.

Pitfalls and compiler hints

Cache issues are silent. The compiler does not warn you about cache misses. You have to measure. But some patterns trigger compiler errors that hint at layout problems.

You want a list of shapes. You define a trait and try to store them in a vector.

trait Shape {
    fn area(&self) -> f64;
}

struct Circle { radius: f64 }
impl Shape for Circle {
    fn area(&self) -> f64 { 3.14 * self.radius * self.radius }
}

fn main() {
    // This fails. The compiler does not know the size of a trait object.
    // let shapes: Vec<Shape> = vec![Circle { radius: 1.0 }];
}

The compiler rejects this with E0277 (the trait Shape has no size known at compile-time). You fix it by boxing.

fn main() {
    let shapes: Vec<Box<dyn Shape>> = vec![Box::new(Circle { radius: 1.0 })];
    for shape in &shapes {
        let _ = shape.area();
    }
}

This compiles. You lost cache locality. The vector holds pointers. The actual shapes are scattered on the heap. Iterating the list chases pointers. You also paid for heap allocations and virtual dispatch.

Use an enum instead. Enums pack variants into a single contiguous value.

enum Shape {
    Circle { radius: f64 },
    Rect { width: f64, height: f64 },
}

impl Shape {
    fn area(&self) -> f64 {
        match self {
            Shape::Circle { radius } => 3.14 * radius * radius,
            Shape::Rect { width, height } => width * height,
        }
    }
}

fn main() {
    // Contiguous storage. No pointers. No heap allocations for the shapes.
    let shapes: Vec<Shape> = vec![
        Shape::Circle { radius: 1.0 },
        Shape::Rect { width: 2.0, height: 3.0 },
    ];
    for shape in &shapes {
        let _ = shape.area();
    }
}

The vector stores the shapes directly. The layout is compact. The CPU streams through the data. The compiler can even inline the match. This is faster and more cache-friendly than trait objects.

Clippy also watches for layout mistakes. The clippy::vec_box lint fires when you put boxed items in a vector.

fn main() {
    // Clippy warns: Vec<Box<T>> is usually inefficient.
    // It suggests Vec<T> or Box<[T]>.
    let items: Vec<Box<i32>> = vec![Box::new(1), Box::new(2)];
}

The lint exists because Vec<Box<T>> defeats the purpose of the vector. You get the overhead of boxing plus the overhead of indirection. Reach for Vec<T> unless you have a specific reason to box.

Convention aside: Rust aligns fields automatically to optimize access. You rarely need #[repr(packed)]. If you pack a struct, you risk unaligned access penalties on some architectures. Trust the compiler's alignment. If you need to interface with a specific binary format, use #[repr(C)] or #[repr(packed)] explicitly, but measure the impact.

When to use what

Choose data structures based on access patterns, not textbook definitions. The decision matrix follows.

Use Vec<T> when you iterate over data, append frequently, or need random access by index. The contiguous layout feeds the CPU prefetcher and minimizes cache misses.

Use LinkedList<T> only when you must splice segments of the list in constant time and never iterate the whole list. In Rust, LinkedList is rarely the right choice; Vec with splice or a crate like circular-queue usually wins.

Use Structure of Arrays when you process specific fields of many records in a tight loop. Separate the hot data from the cold data to maximize cache line utilization.

Use enums for polymorphism when performance matters. An enum packs variants into a single contiguous value, avoiding the pointer indirection of Box<dyn Trait> that scatters data across the heap.

Use Box<T> when you need to move a large value across a function boundary without copying, or when the size is dynamic. Accept the cache penalty if the allocation happens once and the data is accessed rarely.

Use HashMap<K, V> when you need fast lookups by key and iteration order does not matter. Accept that iteration is random access. If you iterate the map often, copy the keys or values to a Vec first.

Use arrays ([T; N]) when the size is known at compile time and small. Arrays live on the stack or inline in structs. They are the ultimate cache-friendly structure.

Where to go next

Data layout is an algorithm. Measure your hot loops. Pack your data. Trust the prefetcher.