How to use rayon crate in Rust parallelism

When one core isn't enough

You have a vector of a million integers. You need to square them and sum them up. In Python, you write a loop and hope the GIL doesn't slow you down, or you spin up a process pool and manage serialization. In JavaScript, you block the event loop until the work finishes. You want the CPU cores to scream, but you don't want to write thread management code. You just want to say "do this in parallel" and get the result.

Rayon gives you data parallelism with the ergonomics of iterators. You write code that looks sequential, and Rayon splits the work across threads automatically. It handles thread pooling, load balancing, and result merging behind the scenes. You focus on the logic. Rayon focuses on the cores.

How Rayon works

Rayon uses a work-stealing scheduler. Think of a busy kitchen with multiple chefs. The head chef has a huge pile of dishes to prep. Instead of doing them all, the chef splits the pile and hands half to a sous-chef. The sous-chef splits their half and hands a chunk to another cook. This continues until the chunks are small enough to handle quickly.

If one cook finishes their stack early, they don't sit idle. They look at the other stations. If they see a neighbor with a huge backlog, they steal a few dishes from the bottom of that neighbor's stack. The busy cook keeps working on the top, the idle cook grabs from the bottom. Work gets balanced automatically without a central manager shouting orders.

Rayon does this with tasks. When you call par_iter(), Rayon wraps your iterator logic in a task. It splits the collection into chunks. Each chunk becomes a sub-task. Threads pick up tasks from their queues. When a task is large enough, Rayon splits it further. This recursive splitting continues until chunks are small enough to process without overhead. Once chunks are done, Rayon combines the results using the same reduction logic as sequential iterators.

The key insight is that Rayon requires your operation to be reducible. The sum() operation works because addition is associative. You can sum chunks independently and add the partial sums at the end. Rayon relies on this property to parallelize safely.

Minimal example

Add Rayon to your dependencies and swap iter for par_iter. The code structure stays the same. The execution changes.

[dependencies]
rayon = "1.10"

use rayon::prelude::*;

fn main() {
    // Create a large vector to justify parallel overhead.
    // Small collections run faster sequentially.
    let data: Vec<i32> = (0..100_000).collect();

    // par_iter() replaces iter(). Rayon splits the range
    // and distributes chunks to the thread pool.
    let sum: i32 = data.par_iter().sum();

    println!("Sum: {}", sum);
}

The rayon::prelude::* import is the community standard. It brings par_iter, into_par_iter, and other parallel methods into scope. You'll see this import in almost every Rayon example. Stick with it.

Swap iter for par_iter and let Rayon handle the rest.

What happens under the hood

When your program starts, Rayon initializes a global thread pool. By default, it creates one thread per CPU core. This pool lives for the lifetime of the program. Rayon reuses these threads for every parallel operation. You don't pay thread creation costs on every call.

When you call par_iter(), Rayon creates a job describing your iterator chain. The job gets submitted to the thread pool. A thread picks up the job and starts processing. If the job represents a large collection, Rayon splits it. The split creates two sub-jobs. One stays with the current thread, the other gets pushed to a queue where other threads can steal it.

This splitting happens recursively. Rayon tracks the size of the work. When chunks get small enough, splitting stops and threads process the data. The threshold is tuned to balance parallelism against overhead.

Once chunks are processed, Rayon reduces the results. For sum(), each thread computes a partial sum. Rayon combines these partial sums into a final result. The reduction follows the same associativity rules as sequential iterators.

If you use into_par_iter() instead of par_iter(), Rayon takes ownership of the collection. This allows parallel operations that consume the data. The choice between borrow and move follows the same rules as sequential iterators.

Trust the borrow checker here too. If Rayon can't move your data, it can't parallelize it.

Realistic example

Parallel iterators compose just like sequential ones. You can chain filter, map, fold, and other adapters. Rayon parallelizes the entire chain.

use rayon::prelude::*;

struct Image {
    id: u32,
    brightness: f32,
}

fn process_images(images: &[Image]) -> Vec<u32> {
    // Filter bright images, extract IDs, collect results.
    // Rayon parallelizes filter and map automatically.
    images.par_iter()
        .filter(|img| img.brightness > 0.7)
        .map(|img| img.id)
        .collect()
}

Rayon's collect preserves order by default. The resulting vector has elements in the same order as the original collection. This guarantee costs performance. Rayon has to coordinate threads to ensure elements land in the right slots.

If order doesn't matter, use collect_unchecked or collect_into_vec. These methods drop the ordering guarantee and return results as threads finish. You get a speed boost in exchange for a shuffled result.

// Fast collection when order doesn't matter.
let ids: Vec<u32> = images.par_iter()
    .filter(|img| img.brightness > 0.7)
    .map(|img| img.id)
    .collect_unchecked();

The convention is clear. Use collect when you need order. Use collect_unchecked when you don't. Profile to confirm the gain.

If order doesn't matter, drop the guarantee and grab the speed.

Pitfalls and gotchas

Parallelism introduces new failure modes. Rayon makes parallelism easy, but it doesn't make it free.

Overhead kills small workloads. Splitting work, coordinating threads, and merging results takes time. If your operation is fast or your dataset is small, sequential code wins. Rayon has internal thresholds, but you should still be mindful. A loop over ten elements will always be slower in parallel. Measure before you parallelize.

Types must be Send. Rayon moves data between threads. Your types must implement the Send trait. If you try to parallelize a collection containing Rc<T> or raw pointers, the compiler rejects you with E0277 (the trait bound Send is not satisfied). Rc is not thread-safe. Use Arc instead. The same rule applies to any data you capture in closures.

use rayon::prelude::*;
use std::rc::Rc;

fn main() {
    let data: Vec<Rc<String>> = vec![Rc::new("hello".into())];

    // This fails to compile. Rc is not Send.
    // error[E0277]: the trait bound `Rc<String>: Send` is not satisfied
    let _ = data.par_iter().count();
}

Replace Rc with Arc to fix the error. Arc provides atomic reference counting safe for multi-threaded use.

Floating point precision changes. Addition is not strictly associative for floating point numbers due to rounding errors. Parallel sums may produce slightly different results than sequential sums. The difference is usually negligible, but it exists. If you need bitwise reproducibility, stick to sequential code or use a deterministic reduction strategy.

Non-associative operations need care. You can't parallelize arbitrary operations. The operation must be associative for reduce to work. If you have a custom aggregation that isn't associative, you need fold followed by a second reduction pass. fold creates a local accumulator for each thread. You then reduce the accumulators manually.

use rayon::prelude::*;

fn main() {
    let data = vec![1, 2, 3, 4, 5];

    // fold creates a local Vec per thread.
    // reduce combines the Vecs into one.
    let result: Vec<i32> = data.par_iter()
        .fold(Vec::new, |mut acc, x| {
            acc.push(x * 2);
            acc
        })
        .reduce(Vec::new, |mut acc, mut other| {
            acc.append(&mut other);
            acc
        });
}

The fold/reduce pattern is the escape hatch for complex aggregations. It gives you control over local accumulation and final merging.

Measure first. Parallel code that runs slower than sequential code is a debug session waiting to happen.

Mixing parallel and sequential

Sometimes you need a sequential step inside a parallel pipeline. Rayon provides par_bridge() to convert a ParIterator back to a sequential Iterator. This lets you use standard iterator adapters that Rayon doesn't support, or perform operations that must run on a single thread.

use rayon::prelude::*;

fn main() {
    let data = vec![1, 2, 3, 4, 5];

    // Parallel filter, then sequential processing.
    let result: Vec<i32> = data.par_iter()
        .filter(|x| **x % 2 == 0)
        .par_bridge()
        // Now we're sequential. zip_with_next isn't in Rayon.
        .zip(data.iter().skip(1))
        .map(|(a, b)| a + b)
        .collect();
}

Use par_bridge() sparingly. It forces serialization and kills parallelism for the rest of the chain. Only cross the bridge when you have no other option.

Keep the bridge short. Serialization is the enemy of throughput.

When to use Rayon

Rayon shines for data-parallel workloads. It's not a general-purpose threading library. Choose the right tool for the job.

Use Rayon for data-parallel workloads where you process a collection with an operation that doesn't depend on order. Use Rayon when you want to parallelize existing iterator chains by swapping iter for par_iter. Use Rayon when your operation is computationally expensive and the dataset is large enough to amortize overhead. Use rayon::scope when you need to spawn parallel tasks dynamically based on runtime conditions rather than iterating over a collection. Reach for std::thread or an async runtime when you need I/O concurrency or fine-grained control over thread creation and joining. Stick to sequential iterators when the dataset is small or the per-item cost is low enough that thread overhead dominates.

Pick the tool that matches your workload. Rayon handles collections. Threads handle coordination. Async handles I/O.

Where to go next

Rayon is a tool that lets your Rust code run multiple tasks at the same time using all your computer's processor cores. Instead of processing a list of items one by one, it splits the work up so many items are handled simultaneously. Think of it like having a team of workers sort a pile of mail together instead of one person doing it alone.