Rust vs Python for ML: Performance Comparison

The notebook works. The server doesn't

You trained a model in a Jupyter notebook. It runs fast enough. You wrap it in a FastAPI endpoint and deploy it to production. Suddenly, latency spikes. The server is eating RAM like it's going out of style. You're paying for cloud instances you don't need, and the garbage collector is pausing your requests every few seconds.

The bottleneck isn't the algorithm. The bottleneck is the runtime.

Python is an incredible tool for research. It lets you prototype models in minutes and access a massive ecosystem of libraries. But Python carries overhead on every single operation. That overhead is invisible during development. It becomes expensive at scale.

Rust compiles to machine code. It knows the size and type of every value at compile time. It manages memory without a garbage collector. The result is code that runs close to the metal, with predictable latency and minimal memory footprint.

Python pays a tax on every operation

Python is an interpreted language with dynamic typing. When you write x = 3.14, Python doesn't just store the number. It creates a heap-allocated object. That object contains a reference count, a pointer to the type structure, and the actual value. On a 64-bit system, a Python float takes 24 bytes. A Python integer takes 28 bytes.

Rust's f32 is 4 bytes. Rust's f64 is 8 bytes. No metadata. No reference count. Just the bits.

This difference compounds. A Python list is an array of pointers. Each pointer points to a separate object scattered across the heap. When you iterate over a list, the CPU has to chase pointers. This causes cache misses. The CPU spends more time waiting for memory than doing math.

A Rust slice is a pointer to contiguous data. The values sit side-by-side in memory. The CPU prefetches the next chunk automatically. Iteration is a straight line through memory. Cache hits are high. Throughput is fast.

Python also has the Global Interpreter Lock. The GIL prevents multiple threads from executing Python bytecode simultaneously. If you spawn threads to parallelize inference, they still run on one core. Rust has no GIL. You can use all available cores for CPU-bound work.

The garbage collector adds another tax. Python reclaims memory by scanning objects and counting references. This scan pauses your program. In a latency-sensitive service, a GC pause feels like a timeout to the user. Rust drops values deterministically when they go out of scope. No pause. No scan.

Rust compiles to the metal

Rust is a compiled language with static typing. The compiler checks types before the code runs. It lays out memory structures at compile time. It generates machine code that runs directly on the CPU.

Rust offers zero-cost abstractions. High-level constructs like iterators and closures compile to the same assembly as hand-written C loops. You write readable code. The compiler optimizes it to the limit of the hardware.

The compiler also auto-vectorizes loops. If you write a loop that sums a slice of floats, the compiler detects the pattern and replaces it with SIMD instructions. SIMD processes multiple values in a single cycle. A modern CPU can add eight 32-bit floats in one instruction. Rust gets this for free. Python cannot.

Rust's ownership system isn't just about safety. It's a performance feature. Ownership guarantees that data has a single owner. This allows the compiler to move values without copying. It allows the compiler to prove that references don't alias. These proofs unlock optimizations that dynamic languages can't touch.

Minimal example: Summing a vector

Compare a simple prediction function in both languages. The function sums a slice of input values.

/// Sums a slice of f32 values.
/// The compiler unrolls this loop and uses SIMD instructions automatically.
fn predict(input: &[f32]) -> f32 {
    // No heap allocation. The slice is a pointer and a length.
    // The compiler knows the size of f32 at compile time.
    input.iter().sum()
}

# Python list is a dynamic array of pointers to objects.
# Each float is a separate heap object with type info.
# The interpreter checks types on every addition.
def predict(input):
    return sum(input)

The Rust function takes a slice. A slice is a fat pointer: a memory address and a length. It points to contiguous f32 values. The iter() method creates a lightweight iterator. The sum() method reduces the iterator to a single value.

The compiler sees through the abstractions. It generates a tight loop. It loads chunks of memory into SIMD registers. It accumulates the result. No function call overhead. No type checks. No allocations.

The Python function takes a list. The sum() builtin iterates over the list. For each item, it retrieves the pointer. It checks the object type. It calls the __add__ method. It creates a new float object for the result. It updates reference counts. It stores the result back.

The Python version does dozens of operations per element. The Rust version does one.

What happens under the hood

When the Rust compiler processes input.iter().sum(), it performs monomorphization. It generates a specialized version of the code for f32. It knows the exact size and alignment. It can use optimized instructions for that type.

The compiler also performs loop unrolling. If the slice length is known at compile time, it duplicates the loop body to reduce branch overhead. Even with dynamic lengths, the compiler unrolls partially to improve instruction-level parallelism.

Auto-vectorization kicks in. The compiler recognizes the reduction pattern. It emits SIMD instructions like vaddps and vhaddps. These instructions operate on vectors of floats. The throughput increases by a factor of four or eight.

In Python, the interpreter executes bytecode. Each bytecode instruction is a small operation. The interpreter fetches the instruction, decodes it, and executes it. This dispatch overhead adds up. The dynamic type system means every operation must check types at runtime. The garbage collector may interrupt execution at any point.

The performance gap widens with data size. For small inputs, the overhead is constant. For large inputs, the overhead scales. Rust scales linearly with CPU cycles. Python scales with interpreter overhead plus CPU cycles.

Realistic example: Inference batch

Machine learning inference often involves matrix operations. You apply weights to input features and add a bias. Here's how you'd write that in Rust for a batch of predictions.

/// Applies a linear transformation to a batch of inputs.
/// Uses zero-copy slices to avoid allocating intermediate arrays.
fn apply_weights(input: &[f32], weights: &[f32], bias: f32) -> Vec<f32> {
    // Pre-allocate output once. Avoids repeated heap growth.
    // The capacity matches the input length exactly.
    let mut output = vec![0.0; input.len()];

    // Zip avoids index bounds checks inside the loop.
    // The compiler can prove no out-of-bounds access occurs.
    // This enables aggressive vectorization.
    for (out, (inp, w)) in output.iter_mut().zip(input.iter().zip(weights.iter())) {
        *out = inp * w + bias;
    }

    output
}

The function takes slices for input and weights. Slices are zero-copy views. They don't allocate memory. They just point to existing data. This is crucial for performance. Copying data to and from temporary buffers wastes time and memory.

The output vector is pre-allocated with vec![0.0; input.len()]. This allocates the exact capacity needed upfront. If you used push in a loop, the vector would grow incrementally. Each growth might trigger a reallocation and copy. Pre-allocation eliminates that cost.

The loop uses zip to iterate over multiple slices simultaneously. zip creates an iterator that yields tuples. The compiler knows the iterators have the same length. It removes bounds checks. It can vectorize the loop safely.

The body performs a fused multiply-add. inp * w + bias. Modern CPUs have a single instruction for this. The compiler emits vfmsub or similar. The operation is efficient.

Convention aside: The Rust ML community favors ndarray for tensor operations. ndarray provides views that are zero-copy and support broadcasting. It integrates with BLAS libraries for heavy lifting. For simple operations, raw slices and iterators are often sufficient and compile to equally fast code.

Pitfalls and compiler errors

Rust gives you performance, but you have to use the right types. Wrapping everything in smart pointers kills the advantage. If you box every value, you add indirection that Python doesn't have. If you use Rc or Arc unnecessarily, you add reference counting overhead.

The compiler helps you avoid mistakes. If you try to mutate data through a shared reference, the compiler rejects you with E0596 (cannot borrow as mutable). This forces you to structure data access correctly. It often leads to better cache locality because you're forced to think about ownership.

If you try to use a value after moving it, the compiler rejects you with E0382 (use of moved value). This prevents accidental copies. In Python, assigning a variable copies the reference. In Rust, assigning a variable moves the value. You have to be explicit about cloning. This discipline prevents hidden performance costs.

If you use a type that doesn't implement a required trait, the compiler rejects you with E0277 (trait bound not satisfied). This catches errors early. It also ensures that generic code works efficiently. Trait bounds allow the compiler to monomorphize and optimize.

Pitfall: Don't use String when &str will do. String owns heap memory. &str borrows data. If you're processing text in a model, borrow the input. Don't allocate a new string for every token.

Pitfall: Don't use Vec when a slice is enough. Functions should take slices as arguments. Slices are flexible. They work with arrays, vectors, and stack buffers. They don't force allocation.

Convention aside: The community calls this the "minimum unsafe surface" rule. Keep unsafe blocks small and isolated. Most performance-critical code in Rust is safe. The compiler optimizes safe code aggressively. You rarely need unsafe for speed. You need it for FFI or low-level hardware access.

When to use Rust vs Python

Pick the tool for the job. Python excels at development speed. Rust excels at execution speed.

Use Python for research and prototyping when you need to iterate on model architecture quickly and rely on the massive ecosystem of libraries like PyTorch and TensorFlow.

Use Python for data pipelines when the workload is I/O bound and the overhead of the GIL doesn't block your progress.

Use Rust for production inference when latency requirements are strict and you need predictable response times without garbage collection pauses.

Use Rust for memory-constrained environments when you're deploying to edge devices, embedded systems, or serverless functions where cold starts and memory footprint matter.

Use Rust for custom operators when the standard Python libraries don't provide the specific optimization you need and you're willing to write C-level code with safety guarantees.

Reach for Python bindings when you want to use a Rust core for performance but keep the Python API for user convenience. Libraries like pyo3 let you expose Rust functions to Python seamlessly.

Trust the metrics. Benchmark with cargo bench in Rust and timeit in Python. Measure latency, throughput, and memory. Don't guess. The numbers tell the truth.

Where to go next

Rust is like a custom-built sports car: it requires more effort to build and tune, but it runs faster and uses less fuel once on the road. Python is like a rental car: it is easy to drive immediately and comes with many built-in accessories, but it is slower and less efficient for long-distance racing. You choose Rust when you need maximum speed and control in a final product, and Python when you need to build a prototype quickly.