How to Normalize Unicode Strings in Rust

The search box that finds nothing

You build a search feature. A user types "café" and hits enter. The database contains the exact same word, "café", but the query returns zero results. You stare at the screen. The strings look identical. You print them to the console. They look identical. You compare byte-for-byte. They differ.

The culprit isn't a typo. It's Unicode.

The user's keyboard sent "e" followed by a combining acute accent. The database stores "é" as a single precomposed character. Visually, they are the same. To the computer, they are different sequences of bytes. This is the normalization trap. It breaks equality checks, corrupts sorting, and causes silent data duplication. Normalization is the process of converting text into a standard form so that equivalent sequences become identical.

Unicode has multiple representations

Unicode allows characters to be represented in more than one way. The letter "é" can be a single code point, U+00E9. It can also be the letter "e" (U+0065) followed by a combining acute accent (U+0301). Both render as "é". Both are valid Unicode.

Think of it like chemical formulas. Water is H₂O. You could write it as H-H-O or O-H-H, but chemists agree on H₂O as the standard form so everyone knows they're talking about the same molecule. Unicode normalization does the same for text. It defines rules for converting between representations so that "é" and "e + ́" can be treated as the same value.

The Unicode standard defines four normalization forms. NFC and NFD are the most common. NFC stands for Normalization Form Canonical Composition. It merges base characters with combining marks whenever a precomposed equivalent exists. NFD stands for Normalization Form Canonical Decomposition. It breaks precomposed characters into their base plus marks.

NFKC and NFKD add compatibility normalization. They also fold ligatures, fractions, and special forms. The ligature "ﬁ" becomes "fi". The fraction "½" becomes "1/2". These forms are aggressive. They change meaning in some contexts. Use them only when you need to flatten text for search indexing or data exchange.

Minimal example

The unicode-normalization crate provides methods for all four forms. Add it to your dependencies.

[dependencies]
unicode-normalization = "0.1"

Import the trait and call the normalization method. The method returns an iterator. Collect it into a String to get the result.

use unicode_normalization::UnicodeNormalization;

fn main() {
    // Precomposed form: single code point for é
    let s1 = "café";
    // Decomposed form: e followed by combining acute accent
    let s2 = "cafe\u{0301}";

    // Visual equality does not imply byte equality
    assert_ne!(s1, s2);

    // NFC merges base characters with combining marks
    let n1: String = s1.nfc().collect();
    let n2: String = s2.nfc().collect();

    // Normalization creates a canonical representation
    assert_eq!(n1, n2);
}

The nfc() method scans the string and applies composition rules. It finds "e" followed by "́" and replaces them with "é". The result is a single code point. The collect() call consumes the iterator and allocates a new String.

NFC is the safe default for storage and comparison. NFD shines when you need to strip accents or analyze grapheme structure.

Grapheme clusters and slicing

Normalization changes the code point count. "café" in NFC has four code points. In NFD, it has five. The visual length stays the same, but the internal representation shifts. This matters when you slice strings or count characters.

If you slice by index, you might cut in the middle of a combining sequence. Decomposed text is fragile. A slice that ends after "e" but before "́" leaves a dangling combining mark. The next character absorbs it, or the display breaks.

use unicode_normalization::UnicodeNormalization;

fn main() {
    let s = "cafe\u{0301}"; // NFD form

    // Slicing by byte index is dangerous with combining marks
    // This slice captures "e" but drops the accent
    let broken = &s[..4];

    // The accent is lost, changing the meaning
    println!("{}", broken); // prints "cafe"
}

Always work with grapheme clusters when dealing with user-visible text. A grapheme cluster is what the user perceives as a single character. "é" is one grapheme, whether it's one code point or two. The unicode-segmentation crate provides grapheme iteration.

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let s = "cafe\u{0301}";

    // Graphemes respect combining marks
    let graphemes: Vec<&str> = s.graphemes(true).collect();

    // Safe slicing by grapheme count
    let first_three: String = graphemes[..3].concat();
    println!("{}", first_three); // "caf"
}

Slice by graphemes, not by code points. Your UI won't break, and your data won't corrupt.

Realistic usage: normalize at the boundary

In a real application, you normalize at the I/O boundary. When data enters your system from a user, a file, or an API, you don't know which form it arrived in. iOS keyboards often produce decomposed text. Windows and web forms often produce composed text. If you store raw input, your database becomes a mix of forms. Queries fail randomly. Duplicates appear.

The fix is to normalize immediately upon ingestion. Create a wrapper function that handles the conversion. Keep the rest of your codebase unaware of the details.

use unicode_normalization::UnicodeNormalization;

/// Normalizes user input to NFC for consistent storage and lookup.
fn normalize_input(input: &str) -> String {
    // Collect the iterator into a new String to own the data
    input.nfc().collect()
}

fn main() {
    // Simulate input from different sources
    let ios_input = "naïve\u{0301}"; // Decomposed ï
    let web_input = "naïve";         // Precomposed ï

    // Normalize before storage or comparison
    let normalized_ios = normalize_input(ios_input);
    let normalized_web = normalize_input(web_input);

    // Now equality works reliably
    assert_eq!(normalized_ios, normalized_web);
}

Normalize at the edge. Keep the core clean. Your database stays consistent, and your search logic stops guessing.

Pitfalls and compiler errors

The nfc() method returns an iterator, not a String. This is a performance optimization. It yields normalized characters on demand. If you try to use the result directly where a &str is expected, the compiler rejects you with E0308 (mismatched types). You must call .collect::<String>() to materialize the result.

use unicode_normalization::UnicodeNormalization;

fn main() {
    let s = "café";

    // E0308: mismatched types
    // expected `&str`, found struct `unicode_normalization::nfc::NFC`
    let result = s.nfc();
    println!("{}", result);
}

The iterator avoids allocation until you need it. You can chain operations to skip intermediate strings. This is useful when you normalize and transform in one pass.

use unicode_normalization::UnicodeNormalization;

fn main() {
    let input = "  café  ";

    // Chain normalization with trimming and filtering
    // No intermediate String is allocated
    let result: String = input
        .nfc()
        .filter(|c| !c.is_whitespace())
        .collect();

    println!("{}", result); // "café"
}

Beware the difference between canonical and compatibility normalization. NFC and NFD handle canonical equivalence. They treat "é" and "e + ́" as the same. NFKC and NFKD handle compatibility equivalence. They also fold ligatures and special forms. The ligature "ﬁ" becomes "fi" under NFKC. NFC leaves "ﬁ" alone. Using NFKC can destroy intentional formatting or obscure data.

NFKC is a sledgehammer. Use it only when you need to flatten text for search indexing, and never for display.

Performance and allocation

Normalization requires scanning the string and applying rules. It's not free. The cost depends on the input. ASCII text is fast. Text with many combining marks is slower. The unicode-normalization crate is optimized. It uses lookup tables and avoids unnecessary work.

The iterator interface lets you control allocation. If you only need to compare two strings, you can normalize them into iterators and compare the iterators. This avoids allocating new strings entirely.

use unicode_normalization::UnicodeNormalization;

fn main() {
    let s1 = "café";
    let s2 = "cafe\u{0301}";

    // Compare normalized iterators without allocation
    let equal = s1.nfc().eq(s2.nfc());

    assert!(equal);
}

This pattern is ideal for hot loops or large datasets. You get correctness without the memory overhead.

Chain the iterator. Skip the intermediate allocation.

Decision matrix

Use nfc() when you need a compact, standard form for storage, display, and comparison. Use nfd() when you need to strip accents, sort by base character, or analyze combining marks individually. Use nfkc() when you are building a search index and want to treat ligatures like "ﬁ" as their base letters "fi". Use nfkd() when you need full decomposition including compatibility characters for low-level text processing. Reach for unicode-segmentation when you need to slice or count user-visible characters safely. Reach for plain string methods when you only need case folding; normalization requires the external crate.

Where to go next

Unicode normalization ensures that different ways of writing the same character are treated as identical. It matters when comparing text or storing data to avoid duplicates caused by hidden formatting differences. Think of it like standardizing how you write a name so "Café" and "Cafe\u0301" are recognized as the same thing.