How to Use Rust for ETL/Data Pipelines

When data flows faster than memory

You have a 50GB CSV file. You write a script to read it, transform the rows, and write the results to a database. The script works on a sample of 1,000 rows. You run it on the full file. Memory usage spikes. The system starts swapping to disk. The script crawls to a halt and crashes with an out-of-memory error. Or you have 10,000 API endpoints to poll every minute. Doing them sequentially takes hours. Spawning a thread for each one burns your CPU on context switching.

Rust handles this by treating data as a stream and using async runtimes to juggle I/O without burning threads. You process one chunk at a time. Memory usage stays flat regardless of file size. The type system guarantees that data flows correctly between stages. If stage A outputs a User, stage B must accept a User. The compiler rejects mismatches before the pipeline runs.

Pipelines as composable streams

ETL stands for Extract, Transform, Load. In Rust, this maps to reading data, applying functions, and writing results. The power comes from composition. You chain operations together. Each operation takes a stream of items, does something, and produces a new stream.

Think of a pipeline like a factory assembly line. Raw materials enter one end. Each station performs a specific task. Finished products exit the other end. Rust's type system acts as the quality control inspector. It checks that the output of station A matches the input of station B. If a station expects a bolt but receives a nut, the build fails.

Async runtimes like tokio allow you to run many pipelines concurrently on a few threads. When a pipeline waits for disk or network, the runtime pauses that task and runs another one. This keeps the CPU busy. You get the throughput of thousands of threads with the memory footprint of a handful.

Minimal async pipeline

Start with a simple file processing loop. This example reads a CSV file, uppercases each line, and prints the result. It uses tokio for async I/O.

use tokio::fs::File;
use tokio::io::{AsyncBufReadExt, AsyncWriteExt, BufReader};

/// Reads a file line by line, transforms, and prints.
async fn simple_pipeline() -> std::io::Result<()> {
    // Open the file asynchronously to avoid blocking the runtime.
    let file = File::open("input.csv").await?;
    // Buffer reads for efficiency; raw reads are too slow.
    let reader = BufReader::new(file);
    // Get a stream of lines from the buffered reader.
    let mut lines = reader.lines();
    
    // Loop until the stream is exhausted.
    while let Some(line) = lines.next_line().await? {
        // Transform the data in memory; this is CPU work.
        let transformed = line.to_uppercase();
        // Write result; println is sync but fast enough for this demo.
        println!("{}", transformed);
    }
    Ok(())
}

The File::open call returns a future. The .await keyword yields control to tokio. tokio runs other tasks while waiting for the OS to open the file. When the file is ready, tokio resumes this task. BufReader batches reads from the disk. Reading byte by byte is inefficient. BufReader fetches a large chunk and serves lines from memory. lines() splits the buffer on newlines. The loop processes one line at a time. Memory usage stays constant. The file can be terabytes in size. The pipeline uses only a few kilobytes of RAM.

Don't load the whole file into memory. Stream it.

Multi-stage pipelines with channels

Real pipelines have multiple stages. You might read from a file, parse JSON, filter invalid records, enrich data from a database, and write to a queue. A single loop gets messy. Channels decouple the stages.

tokio::sync::mpsc provides a multi-producer, single-consumer channel. One task sends items into the channel. Another task receives them. The channel has a buffer. If the buffer fills up, the sender blocks until the receiver consumes an item. This is backpressure. It prevents the producer from overwhelming the consumer.

use tokio::sync::mpsc;
use tokio::fs::File;
use tokio::io::{AsyncBufReadExt, BufReader};

/// Represents a parsed row in our pipeline.
struct Row {
    id: u64,
    value: String,
}

/// Reads lines and sends parsed rows to the channel.
async fn producer(mut tx: mpsc::Sender<Row>) -> std::io::Result<()> {
    let file = File::open("data.csv").await?;
    let reader = BufReader::new(file);
    let mut lines = reader.lines();
    
    while let Some(line) = lines.next_line().await? {
        // Parse the line; skip malformed rows for simplicity.
        if let Some(row) = parse_line(&line) {
            // Send to the next stage; backpressure applies here.
            tx.send(row).await.expect("Pipeline broken");
        }
    }
    Ok(())
}

/// Worker that transforms rows.
async fn worker(mut rx: mpsc::Receiver<Row>) {
    while let Some(row) = rx.recv().await {
        // Do heavy transformation here.
        let result = process_row(row);
        println!("Processed: {:?}", result);
    }
}

fn parse_line(line: &str) -> Option<Row> {
    let parts: Vec<&str> = line.split(',').collect();
    if parts.len() == 2 {
        Some(Row {
            id: parts[0].parse().ok()?,
            value: parts[1].to_string(),
        })
    } else {
        None
    }
}

fn process_row(row: Row) -> String {
    format!("ID: {}, Val: {}", row.id, row.value.to_uppercase())
}

The producer task reads the file and sends Row items into the channel. The worker task receives items and processes them. You can spawn multiple workers to parallelize processing. Each worker gets a clone of the receiver? No, mpsc has one receiver. To parallelize, you need a different pattern. Use tokio::sync::broadcast or a work-stealing queue, or split the work differently. For simple parallelism, spawn multiple producer tasks sending to one worker, or use a thread pool for CPU work.

Convention aside: The community prefers explicit channel sizes. mpsc::channel(100) creates a buffer of 100 items. A buffer that is too small causes the producer to block frequently, reducing throughput. A buffer that is too large wastes memory and hides latency. Start with 100 to 1000 and tune based on profiling.

Tune the buffer size. A well-sized channel is the difference between a smooth flow and a memory leak.

Backpressure and buffer tuning

Backpressure is a feature, not a bug. When the consumer is slower than the producer, the channel buffer fills up. Once full, the producer blocks on send. This slows the producer down to match the consumer. The pipeline self-regulates. You don't need complex flow control logic. The channel handles it.

If you use try_send instead of send, the producer doesn't block. It gets an error if the buffer is full. You can drop items or retry later. This is useful for real-time streams where old data is worthless. For ETL, you usually want to process everything. Use send and let backpressure do its job.

Monitor the buffer fill level. If the buffer stays full, the consumer is the bottleneck. Optimize the consumer or add more workers. If the buffer stays empty, the producer is the bottleneck. Optimize the producer or reduce buffer size to save memory.

Pitfalls and compiler errors

Async pipelines have specific traps. Blocking the runtime is the most common. If you call a synchronous function that takes a long time inside an async task, you freeze the entire runtime. Other tasks can't run. The pipeline stalls.

If you accidentally block, the compiler won't catch it. Rust treats sync functions as valid inside async blocks. You must discipline yourself. Wrap blocking code in tokio::task::spawn_blocking. This runs the code on a separate thread pool.

// BAD: Blocks the runtime thread.
async fn bad_worker() {
    let result = heavy_cpu_work(); // Freezes tokio.
}

// GOOD: Offloads to blocking thread pool.
async fn good_worker() {
    let result = tokio::task::spawn_blocking(heavy_cpu_work).await.unwrap();
}

Trait bound errors appear when you mix sync and async types. If you try to await a value that isn't a future, the compiler rejects it with E0277 (trait bound not satisfied). This usually means you forgot .await or passed the wrong type to an async function.

// E0277: `String` is not a `Future`.
let x = "hello".await;

Move errors happen when closures capture variables. If a closure moves a value, you can't use it again. The compiler rejects this with E0382 (use of moved value). Use Arc or Rc to share ownership, or clone the value before moving it.

Error propagation requires care. Async functions return Result<T, E>. Use the ? operator to propagate errors. If a stage fails, the pipeline should stop. Don't swallow errors. Log them and abort.

Wrap blocking code in spawn_blocking. The runtime will thank you.

Decision: choosing your pipeline architecture

Pick the right tool for your data flow. Different patterns fit different workloads.

Use tokio when your pipeline is bound by disk or network latency and you need to juggle thousands of streams without spawning threads. Use std::thread when your pipeline is pure CPU crunching and you want to saturate cores with heavy computation. Use mpsc channels when you need to decouple pipeline stages or run multiple workers in parallel. Use Stream combinators when you prefer a functional chain of operations and single-threaded execution is sufficient. Use spawn_blocking when you must call a synchronous library that doesn't support async. Use broadcast channels when multiple consumers need to receive the same data independently.

Trust the type system. If the pipeline compiles, the data shapes match.

Where to go next

Rust is a programming language that lets you build data pipelines that are incredibly fast and safe from crashes. It handles multiple data streams at once without slowing down, making it ideal for processing large files or real-time data. Think of it as a highly efficient assembly line that never jams or drops parts.