Grouping consecutive elements with itertools
You have a stream of log entries sorted by timestamp. The logs show bursts of activity from different services. Service A emits ten errors, then Service B emits five warnings, then Service A returns with three more errors. You need to process each burst as a unit. You want to count the errors in the first A burst, handle the B warnings, and then handle the second A burst separately. The data is already ordered by the key you care about. You don't want to shuffle everything into a hash map just to group it. You want to process the groups as they appear in the stream.
That is what itertools::group_by does. It bundles consecutive elements that share the same key into sub-iterators. It respects the order of your data. It does not reorder elements. It does not scan the entire collection to find all matches. It stops the moment the key changes. If the same key appears later, it starts a new group.
How group_by works
group_by takes an iterator and a closure. The closure extracts a key from each item. The method compares the key of the current item with the key of the previous item. As long as the keys match, items accumulate into the current group. The moment the key differs, the current group closes and a new group begins.
Think of sorting mail by zip code. If the mail is already sorted, you can grab all the letters for zip code 10001, then grab all the letters for 10002. You process them in piles. If a letter for 10001 appears after the 10002 pile, it starts a new pile for 10001. You don't jump back to merge it with the first pile. group_by works exactly this way. It groups consecutive runs.
The method returns an iterator of tuples. Each tuple contains the key and a sub-iterator over the items in that group. The sub-iterator is lazy. It yields items from the original iterator as you request them. This means you can process groups one by one without allocating memory for all groups at once.
Minimal example
Add itertools to your dependencies. Import the Itertools trait to bring group_by into scope. Chain .group_by() with a closure that extracts the key.
use itertools::Itertools;
fn main() {
// Data must be sorted by the key for meaningful grouping.
// Here, consecutive identical numbers form groups.
let numbers = vec![1, 1, 2, 2, 2, 3];
// group_by returns an iterator of (key, sub_iterator) pairs.
// The closure defines the key for each item.
// We use a reference pattern to extract the value.
let groups: Vec<_> = numbers.iter().group_by(|&x| x).collect();
// Iterate over the groups.
// Each group is a tuple of the key and a lazy sub-iterator.
for (key, sub_iter) in groups {
// Collect the sub-iterator to consume the group.
// This step is necessary to see the items.
let items: Vec<_> = sub_iter.collect();
println!("Key {}: {:?}", key, items);
}
}
Output:
Key 1: [1, 1]
Key 2: [2, 2, 2]
Key 3: [3]
The closure |&x| x extracts the value from the reference. group_by calls this closure for every element to determine the key. The key type must implement PartialEq so the method can compare keys. If your key is a custom struct, ensure it derives PartialEq. The compiler rejects the code with E0277 (trait bound not satisfied) if the key type lacks this trait.
Under the hood: lazy evaluation and borrowing
group_by is fully lazy. Calling .group_by() does not scan the data. It returns a wrapper iterator. The work happens when you pull items from the wrapper.
When you request the first group, the wrapper peeks at the first element of the source iterator. It calls your closure to get the key. It then keeps pulling elements as long as the key matches. It hands you a sub-iterator for that group. The sub-iterator holds a reference to the main iterator. This borrow is crucial. The sub-iterator cannot own the main iterator, because the main iterator might be shared or have a longer lifetime.
Because the sub-iterator borrows the main iterator, you cannot hold onto a sub-iterator while trying to get the next group. The borrow checker enforces this. If you try to store a sub-iterator and then advance the main iterator, the compiler rejects you with E0502 (cannot borrow as mutable because it is also borrowed as immutable). The sub-iterator must be fully consumed or dropped before you can request the next group.
This borrowing model prevents data races and ensures memory safety. It also means you cannot collect all groups into a Vec of Vecs in a single pass without intermediate steps. You must consume each sub-iterator before moving to the next group. The pattern is to loop over the groups and collect each sub-iterator inside the loop.
Convention aside: The community imports itertools::Itertools to access these methods. You won't find group_by on the standard iterator type. The trait import is the standard way to extend iterator capabilities. Don't fight the trait system. Embrace the import.
Realistic scenario: aggregating sorted transactions
Consider a financial system processing transactions. Transactions arrive sorted by account ID. You need to calculate the total amount for each contiguous block of transactions per account. This might represent daily batches where each account's transactions are grouped together.
use itertools::Itertools;
/// Represents a financial transaction with an account ID and amount.
#[derive(Debug)]
struct Transaction {
account_id: u32,
amount: f64,
}
/// Summarizes transactions by grouping consecutive entries with the same account ID.
/// Assumes the input is sorted by account_id.
fn summarize_batches(transactions: Vec<Transaction>) -> Vec<(u32, f64)> {
// group_by clusters consecutive transactions for the same account.
// The closure extracts the account_id as the key.
transactions.iter().group_by(|t| t.account_id).into_iter()
.map(|(account_id, group)| {
// Sum the amounts for this contiguous block.
// The sub-iterator yields references to transactions.
let total: f64 = group.map(|t| t.amount).sum();
(account_id, total)
})
.collect()
}
fn main() {
let txs = vec![
Transaction { account_id: 1, amount: 10.0 },
Transaction { account_id: 1, amount: 5.0 },
Transaction { account_id: 2, amount: 20.0 },
Transaction { account_id: 2, amount: 15.0 },
Transaction { account_id: 1, amount: 1.0 }, // New batch for account 1
];
let summary = summarize_batches(txs);
println!("{:?}", summary);
// Output: [(1, 15.0), (2, 35.0), (1, 1.0)]
}
Notice the output. Account 1 appears twice. The first group sums to 15.0. The second group sums to 1.0. This is correct behavior. group_by sees the key change from 2 to 1 and starts a new group. It does not merge the two account 1 blocks. If you need a single total for account 1, you must sort the data so all account 1 transactions are consecutive, or use a hash map to aggregate across the entire collection.
The closure |t| t.account_id runs for every transaction. If extracting the key is expensive, consider caching it. You can map the iterator to include the key before grouping. This avoids recomputing the key for every comparison.
// Optimization: compute key once per item.
transactions.iter()
.map(|t| (t.account_id, t))
.group_by(|(key, _)| *key)
.into_iter()
// ...
This pattern is common when the key extraction involves I/O or complex calculation. The community calls this the "key caching" pattern. It trades a small amount of memory for reduced computation.
Pitfalls and compiler errors
Unsorted data produces fragments. If your data is [1, 2, 1], group_by produces three groups: [1], [2], [1]. It does not merge the two 1s. You must sort the data first if you want all items with the same key in a single group. Sort your data before grouping, or your groups will be fragments.
Sub-iterator lifetimes cause borrow errors. The sub-iterator borrows the main iterator. You cannot hold a reference to a sub-iterator while advancing the main iterator. The compiler rejects this with E0502. You must consume the sub-iterator before requesting the next group. Trust the borrow checker here. It prevents you from accessing invalid memory.
Key extraction costs add up. The closure runs for every element. If the closure is slow, it dominates the runtime. Cache the key if extraction is expensive. Profile before optimizing. Measure the bottleneck.
Convention aside: itertools is the standard crate for this functionality. You will see itertools in most Rust projects that do heavy iterator manipulation. Don't write a custom loop with peekable unless you have a specific reason. The community expects group_by. It is well-tested and optimized.
Decision: when to use group_by
Use itertools::group_by when your data is already sorted by the key and you want to process consecutive runs of identical keys. Use group_by when memory is tight and you need to stream groups without allocating a full hash map. Use std::collections::HashMap when your data is unsorted and you need to aggregate all items with the same key regardless of order. Use HashMap when the number of unique keys is small compared to the total items and you need random access to groups after grouping. Use itertools::chunks_by when you have a slice and want to collect groups into a Vec<&[T]> of slices, avoiding the overhead of sub-iterators. Use chunks_by when the grouping logic depends on comparing adjacent elements rather than extracting a key.
Pick the tool that matches your data's order and your memory budget.