Rust vs Python for Data Processing

When to Choose Rust

Choose Rust for high-performance, memory-safe data processing and Python for rapid development with rich libraries.

The 20-gigabyte log file

You have a log file that grows by 50 gigabytes a day. Your Python script reads it, filters errors, and aggregates metrics. It works fine until the file hits 20 gigabytes. Then memory usage climbs until the operating system kills the process. You try chunking. You try generators. The code gets messy, and the runtime is still sluggish because the Global Interpreter Lock prevents true parallelism. You need a tool that handles the data without the overhead, or you need to accept that Python is the wrong hammer for this specific nail.

Python is the fastest way to write code. Rust is the fastest way to run it.

The abstraction tax

Every language charges a tax for convenience. Python's tax is runtime overhead. Rust's tax is compile-time reasoning.

In Python, every value is a heap-allocated object. An integer is not a 32-bit number in a register. It is a struct containing a type pointer, a reference count, and the value itself. A list of one million integers is an array of one million pointers, each pointing to a separate object scattered across the heap. The interpreter must dereference every pointer, check the type, and manage reference counts during iteration. If the memory pressure gets high, the garbage collector pauses all threads to reclaim unused objects.

Rust compiles to native machine code. A Vec<i32> stores integers contiguously in memory. There are no pointers to objects. There are no reference counts. The CPU fetches data in cache lines, and the compiler can optimize the loop based on the exact type. Memory is freed the instant the value goes out of scope. There is no garbage collector. There are no pauses.

The trade-off is control. Python hides memory management. Rust forces you to declare how data flows. You specify ownership, borrowing, and lifetimes. The compiler enforces these rules. If you violate them, the code does not compile. This shifts complexity from debugging runtime crashes to satisfying the type system.

Rust shifts the complexity from runtime debugging to compile-time reasoning.

Minimal example: Counting items

Compare a simple counting task. Python uses Counter from the standard library. Rust uses a HashMap.

use std::collections::HashMap;

/// Count occurrences of items in a slice.
/// Returns a map from item to count.
fn count_items(data: &[i32]) -> HashMap<i32, usize> {
    let mut counts = HashMap::new();
    // Iterate by reference to avoid moving data.
    for item in data {
        // Entry API avoids double lookup.
        *counts.entry(*item).or_insert(0) += 1;
    }
    counts
}
from collections import Counter

def count_items(data: list[int]) -> dict[int, int]:
    # Counter handles iteration and counting internally.
    return dict(Counter(data))

The Python version is two lines. The Rust version is eight lines. Python wins on brevity. The Counter implementation is optimized in C, so it runs fast for small datasets. Rust wins on memory efficiency and predictability. The Rust function takes a slice &[i32], which is a pointer and a length. It does not allocate a new collection for the input. The HashMap grows as needed, but the integers themselves are stored directly in the map's buckets. There is no indirection.

Write Python to explore. Write Rust to execute.

Under the hood: Memory and speed

The performance gap widens as data volume increases. Consider a dataset of one million records.

In Python, a list of one million integers consumes roughly 9 megabytes for the list structure plus the memory for each integer object. On a 64-bit system, each integer object takes 28 bytes. The total memory is around 37 megabytes. The data is scattered. The CPU cache cannot hold the entire dataset. Every access might trigger a cache miss, forcing the CPU to wait for RAM.

In Rust, a Vec<i32> of one million elements consumes exactly 4 megabytes. The data is contiguous. The CPU prefetches cache lines automatically. Iteration is a linear scan of memory. The compiler can unroll the loop and use SIMD instructions if the target architecture supports it. The same operation that takes seconds in Python might take milliseconds in Rust.

Cache misses are the silent killer. Rust keeps the cache happy.

Realistic example: Aggregating logs

Data processing often involves parsing structured data. Rust uses serde for serialization. Python uses json or pandas.

use serde::{Deserialize, Serialize};
use std::collections::HashMap;

/// Represents a log event from the system.
#[derive(Deserialize, Debug)]
struct LogEvent {
    level: String,
    message: String,
    timestamp: u64,
}

/// Aggregate error counts by level from a stream of events.
/// Returns a map of level to count.
fn aggregate_errors(events: Vec<LogEvent>) -> HashMap<String, usize> {
    let mut errors = HashMap::new();
    // Filter and count in a single pass.
    for event in events {
        if event.level == "ERROR" {
            *errors.entry(event.level).or_insert(0) += 1;
        }
    }
    errors
}

The #[derive(Deserialize)] attribute generates code that parses JSON into the struct. The compiler checks that the fields match the data structure. If the JSON contains a missing field or a wrong type, the parsing fails at runtime with a clear error, or you can use serde_json::from_str to handle the Result.

Convention aside: In the Rust ecosystem, serde is the universal serialization library. You will see #[derive(Serialize, Deserialize, Debug, Clone)] stacked on data structs. This is idiomatic. It signals that the struct is a data carrier. Keep fields private unless they are part of the public API. Expose getters if you need to control access. For data processing pipelines, public fields are common because the struct is often just a row in a table.

Define the shape. Let the compiler enforce it.

Rust's data ecosystem

Rust is not starting from zero. The data ecosystem is maturing rapidly.

polars is a dataframe library written in Rust. It uses arrow for memory layout, which stores data in columns rather than rows. Columnar storage is efficient for analytics because operations like sum or filter access contiguous memory. polars supports lazy evaluation, parallel execution, and query optimization. It is faster than pandas for large datasets. You can call polars from Python, or use it directly in Rust.

ndarray provides n-dimensional arrays similar to numpy. It supports broadcasting, slicing, and linear algebra operations. The API is different from numpy, but the performance is comparable to compiled C code.

arrow2 is a Rust implementation of the Apache Arrow memory format. It allows zero-copy data exchange between processes and languages. Many Rust data libraries use arrow2 as the underlying representation.

Convention aside: When working with large datasets in Rust, prefer columnar formats like arrow over row-based Vec structures. Row-based layouts are fine for small data or transactional workloads. Columnar layouts win for analytics because of cache efficiency and compression. Libraries like polars handle this automatically.

You don't have to reinvent the wheel. Use the crates.

Pitfalls and compiler errors

Rust forces you to think about data flow. If you try to mutate a collection while iterating, the compiler stops you.

let mut data = vec![1, 2, 3];
// This fails to compile.
for item in &data {
    data.push(*item);
}

The compiler rejects this with E0502 (cannot borrow as mutable because it is also borrowed as immutable). The iterator holds an immutable borrow of data. Calling push requires a mutable borrow. Rust prevents this because modifying a collection during iteration can invalidate iterators or cause undefined behavior. In Python, this might work or might raise a runtime error depending on the implementation. Rust catches it at compile time.

Another common trap is moving data. If you pass a Vec by value, the function takes ownership.

fn process(data: Vec<i32>) {
    // ...
}

fn main() {
    let data = vec![1, 2, 3];
    process(data);
    // This fails to compile.
    println!("{:?}", data);
}

The compiler rejects this with E0382 (use of moved value). The process function owns data. After the call, data is gone. You fix this by passing a reference &Vec<i32> or using .iter(). The compiler errors are verbose, but they point exactly to the problem. Read the error. It tells you how to fix it.

The compiler is your pair programmer. Listen to it.

Decision matrix

Use Rust when your pipeline is CPU-bound and processes millions of records per second. Use Rust when memory usage is constrained, such as on edge devices or in high-density container deployments. Use Rust when you need strict latency guarantees and cannot tolerate garbage collection pauses. Use Rust when you are building a library that Python will call via pyo3, giving you the best of both worlds. Use Rust when the dataset exceeds available RAM and requires streaming or chunking with minimal overhead.

Use Python when you need to prototype a data analysis workflow in minutes. Use Python when your team relies on the rich ecosystem of pandas, numpy, scikit-learn, and pytorch. Use Python when development speed and developer availability outweigh runtime efficiency. Use Python when the data volume is small enough that the interpreter overhead is negligible. Use Python when you need to integrate with existing Python services or APIs without a bridge.

Pick the tool that matches the bottleneck. If the bottleneck is your time, pick Python. If the bottleneck is the CPU, pick Rust.

Where to go next