How to Use Serde with Streams for Large Data

When memory is the bottleneck

You have a 500MB JSON file of sensor readings. You write serde_json::from_str(&content). Your laptop fans spin up like a jet engine, the disk light stays solid, and then the operating system kills your process for using too much memory. You just tried to load half a gigabyte of text into RAM, parse it all at once, and hold the entire object tree in memory just to calculate an average. That is a memory bomb waiting to happen.

Rust's ownership model makes memory management explicit, but it does not automatically prevent you from allocating more memory than you have. If you read a file into a String and deserialize it, you hold the raw text and the parsed structs simultaneously. For large data, this doubles your memory footprint and spikes it during parsing.

Streaming deserialization solves this by processing the data as a flow of events. You never hold the whole file. You read a chunk, parse an object, use it, drop it, and repeat. Your memory usage stays flat, no matter how large the file grows.

Streaming deserialization

Streaming works like an assembly line instead of a warehouse. Instead of dumping the entire shipment into a storage unit and then walking through it, you process each item as it rolls off the truck. You inspect the box, extract what you need, and toss the cardboard. The next box arrives. You never hold more than one box at a time.

In Rust, serde_json::Deserializer::from_reader creates a parser that reads from any Read source. The deserializer maintains internal state: it tracks where it is in the byte stream, what JSON tokens it has seen, and what it expects next. When you ask it to deserialize a struct, it pulls just enough bytes to fill the fields, constructs the value, and hands it to you. The deserializer remembers its position so the next call continues exactly where the last one left off.

This pattern requires the input to be a stream of values. The deserializer expects a sequence of JSON objects, or a single JSON array that you iterate over. It does not load the structure into a tree. It walks the bytes linearly.

Minimal example: processing a stream

The most common case is a file containing a stream of JSON objects. This might be JSON Lines format, where each line is a valid JSON object, or concatenated JSON where objects follow each other without separators.

use serde::Deserialize;
use std::io::BufReader;

/// A single reading from a temperature sensor.
#[derive(Deserialize)]
struct SensorReading {
    id: u32,
    temp: f64,
}

fn process_stream(file: &std::fs::File) -> Result<(), Box<dyn std::error::Error>> {
    // Buffer the file reads to avoid syscall overhead per byte.
    let reader = BufReader::new(file);

    // Create a deserializer that reads from the buffer incrementally.
    let mut deserializer = serde_json::Deserializer::from_reader(reader);

    // Deserialize items one by one until the stream ends or an error occurs.
    while let Some(reading) = SensorReading::deserialize(&mut deserializer)? {
        println!("Sensor {}: {}°C", reading.id, reading.temp);
    }

    Ok(())
}

The BufReader wraps the file handle. Reading from a file without buffering triggers a system call for every byte. The kernel has to switch context, check permissions, and fetch data from disk for each call. That kills performance. BufReader fetches large chunks into memory and serves bytes from the buffer. The convention is to always wrap file handles in BufReader unless you have a specific reason not to. The default buffer size is usually sufficient.

The Deserializer::from_reader takes ownership of the reader. It consumes the stream. You cannot reuse the deserializer after the stream is exhausted. The deserialize method returns Result<Option<T>, Error>. The Option handles the end of the stream. Some(value) means an object was parsed. None means the stream is done. The Result handles parsing errors. The ? operator propagates errors out of the function. If the JSON is malformed, the loop breaks and the error bubbles up.

Realistic example: early exit and aggregation

Streaming shines when you need to aggregate data or stop early. You can break out of the loop as soon as you have enough information. This saves time and I/O.

use serde::Deserialize;
use std::io::BufReader;

/// A log entry from an application server.
#[derive(Deserialize)]
struct LogEntry {
    level: String,
    message: String,
}

fn count_errors(file: &std::fs::File) -> Result<usize, Box<dyn std::error::Error>> {
    let reader = BufReader::new(file);
    let mut deserializer = serde_json::Deserializer::from_reader(reader);
    let mut error_count = 0;

    // Iterate through the stream, stopping if we hit a critical threshold.
    while let Some(entry) = LogEntry::deserialize(&mut deserializer)? {
        if entry.level == "ERROR" {
            error_count += 1;
        }

        // Early exit pattern: stop processing once we find enough errors.
        if error_count > 1000 {
            break;
        }
    }

    Ok(error_count)
}

The break statement exits the loop immediately. The deserializer and reader are dropped. The remaining bytes in the file are never read. This is efficient. You only process what you need.

The array trap: `into_iter` vs loop

A common mistake is assuming that a loop over deserialize works for a JSON array. It does not.

If your file contains a single JSON array like [{...}, {...}], the deserializer sees the opening bracket and expects an array structure. If you call deserialize in a loop, the first call might succeed if the deserializer is configured to parse a sequence, but the standard loop pattern expects concatenated top-level values. The second iteration will fail because the deserializer encounters a comma or closing bracket instead of a new object.

The error manifests at runtime as a serde_json::Error about unexpected data. The compiler cannot catch this because the structure of the JSON is not known at compile time.

For a JSON array, use into_iter. This method consumes the deserializer and returns an iterator that yields results for each element in the array.

use serde::Deserialize;
use std::io::BufReader;

#[derive(Deserialize)]
struct SensorReading {
    id: u32,
    temp: f64,
}

fn process_array(file: &std::fs::File) -> Result<(), Box<dyn std::error::Error>> {
    let reader = BufReader::new(file);
    let deserializer = serde_json::Deserializer::from_reader(reader);

    // into_iter handles the array structure and yields elements one by one.
    for result in deserializer.into_iter::<SensorReading>() {
        let reading = result?;
        println!("Sensor {}: {}°C", reading.id, reading.temp);
    }

    Ok(())
}

into_iter returns an iterator of Result<T, E>. You must handle the result for each item. The iterator manages the array brackets and commas internally. You only see the elements. This is the idiomatic way to process a JSON array without allocating a Vec.

Single large objects and the visitor pattern

What if you have a single massive JSON object, not a stream? For example, a configuration file with a huge nested structure, or a database dump of one record with thousands of fields.

#[derive(Deserialize)] builds the entire struct in memory. It cannot stream a single object. If the object is too large, you will still hit memory limits.

To handle a single large object, you need a custom Deserialize visitor. A visitor walks the JSON tree field by field. You can process fields as they arrive and skip fields you do not need. This allows you to parse a gigabyte JSON object without allocating the whole thing.

use serde::de::{self, Visitor, MapAccess};
use serde::Deserialize;

/// A visitor that processes a large object field by field.
struct LargeObjectVisitor;

impl<'de> Visitor<'de> for LargeObjectVisitor {
    // The value type is () because we process on the fly and return nothing.
    type Value = ();

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
        formatter.write_str("a JSON object")
    }

    fn visit_map<A>(self, mut map: A) -> Result<(), A::Error>
    where
        A: MapAccess<'de>,
    {
        // Iterate over key-value pairs in the object.
        while let Some(key) = map.next_key::<String>()? {
            match key.as_str() {
                "small_field" => {
                    // Deserialize and process the value.
                    let value: String = map.next_value()?;
                    println!("Small: {}", value);
                }
                "huge_blob" => {
                    // Skip the value without allocating memory.
                    let _ = map.next_value::<serde::de::IgnoredAny>()?;
                }
                _ => {
                    // Skip unknown fields.
                    let _ = map.next_value::<serde::de::IgnoredAny>()?;
                }
            }
        }
        Ok(())
    }
}

fn parse_large_object(file: &std::fs::File) -> Result<(), Box<dyn std::error::Error>> {
    let reader = BufReader::new(file);
    let deserializer = serde_json::Deserializer::from_reader(reader);

    // Deserialize using the custom visitor.
    deserializer.deserialize_map(LargeObjectVisitor)?;

    Ok(())
}

The visitor implements visit_map. The deserializer calls this method when it encounters a JSON object. You iterate over keys using next_key. For each key, you call next_value to get the corresponding value. If you need the value, deserialize it into a type. If you want to skip it, use serde::de::IgnoredAny. This type consumes the JSON tokens for the value without allocating any Rust data structures.

Writing a visitor is more verbose than derive. You handle the control flow manually. Use this only when memory constraints force you to. The convention is to keep visitors small and focused. Extract helper functions for complex logic.

Pitfalls and errors

Streaming deserialization has specific failure modes.

If you forget BufReader, performance degrades drastically. The code works, but it runs slowly. The compiler does not warn you. The error is silent until you measure throughput.

If you loop deserialize on a JSON array, you get a runtime parsing error. The error message mentions unexpected data or invalid token. The fix is to use into_iter.

If the JSON stream contains a malformed object, the deserialize call returns an error. The ? operator propagates it. The loop stops. You need to decide whether to skip the bad object or abort. Serde does not skip errors automatically. If you want to skip bad objects, you must handle the error inside the loop.

while let Some(result) = SensorReading::deserialize(&mut deserializer) {
    match result {
        Ok(reading) => println!("{:?}", reading),
        Err(e) => eprintln!("Skipping bad object: {}", e),
    }
}

This pattern catches errors and continues. The deserializer might be in an inconsistent state after an error. Depending on the error, it might recover for the next object or fail permanently. Test with malformed data to see how your stream behaves.

If you use a visitor and forget to consume a value, the deserializer state gets out of sync. You must call next_value for every key you retrieve. Skipping a value requires IgnoredAny. Leaving a value unconsumed causes the next next_key to fail.

Decision matrix

Use serde_json::from_str when the data fits comfortably in memory and you need the full object graph for random access.

Use Deserializer::from_reader with a while let loop when the input is a stream of concatenated JSON objects, like JSON Lines or a log stream, and you want to process items as they arrive.

Use Deserializer::from_reader with into_iter when the input is a single JSON array and you want to iterate over elements without allocating the entire vector.

Use a custom Deserialize visitor when you are dealing with a single massive object and need to process fields incrementally, or when you must skip large fields to save memory.

Match the deserializer to the shape of the data. Wrong shape means parsing errors or wasted memory.

Where to go next

Using Serde with Streams for Large Data lets your program read huge files piece by piece instead of trying to fit the whole thing in your computer's memory at once. It's like reading a book page by page rather than trying to memorize the entire text before starting. You use this when handling massive logs, datasets, or API responses that would otherwise crash your application.