The string slice trap
You write a quick parser to extract a substring. The input is "CafΓ©". You grab the fourth character using s[3]. In Python, that works. In JavaScript, that works. In Rust, the compiler rejects you with E0277 (the trait Index<usize> is not implemented for str). You switch to slicing: s[0..4]. That compiles. You run the program. It panics: "byte index 4 is not a char boundary".
The panic happens because Γ© in UTF-8 takes two bytes. Index 4 lands in the middle of that character. Rust refuses to slice a string at a point that breaks a character. The language treats strings as sequences of bytes, not sequences of characters. You have to ask for characters explicitly.
Bytes versus characters
Rust stores &str and String as UTF-8 encoded bytes. UTF-8 is a variable-length encoding. ASCII characters like A, z, or 5 take one byte. Characters from many scripts like δΈ or Γ± take two or three bytes. Emojis and rare symbols like π¦ take four bytes.
This design gives Rust strings two properties. First, ASCII is fast and compact. A string of English text uses exactly one byte per character, matching C strings. Second, you cannot jump to the Nth character in constant time. To find the fifth character, the runtime must scan the bytes and decode them one by one until it counts five.
Think of a string like a train. The cargo cars are bytes. The passengers are characters. Small passengers fit in one car. Large passengers need four cars chained together. If you count cars, you don't know how many passengers are on board. If you cut the train between cars, you might slice a passenger in half. Rust prevents you from cutting the train mid-passenger.
Iterating with .chars()
When you need to process text logically, use .chars(). This method returns an iterator that yields char values. Each char is a Unicode scalar value, which is a code point. The iterator decodes the UTF-8 bytes on the fly and hands you complete characters.
fn main() {
let text = "Hi π";
// .chars() decodes UTF-8 lazily and yields Unicode scalar values.
// This is the safe way to loop over text.
for c in text.chars() {
// len_utf8() shows how many bytes this char occupied in the string.
println!("Char: '{}', Bytes: {}", c, c.len_utf8());
}
}
The output shows H and i taking one byte each. The space takes one byte. The globe emoji takes four bytes. The iterator handles the decoding. You get logical characters, not raw bytes.
Trust the iterator. It decodes UTF-8 correctly and keeps your text intact.
The char is not what you think
A char in Rust is a Unicode scalar value. That definition hides a trap. A scalar value is a code point, not necessarily a user-perceived character.
Consider the letter Γ©. Unicode has two ways to represent it. You can use the precomposed code point Γ© (U+00E9). Or you can use the letter e followed by a combining acute accent Β΄ (U+0301). Both render as Γ© on screen. Rust's .chars() treats them differently.
fn main() {
let composed = "Γ©"; // Single code point
let decomposed = "e\u{0301}"; // Two code points: e + combining accent
println!("Composed count: {}", composed.chars().count());
println!("Decomposed count: {}", decomposed.chars().count());
}
The composed string yields one char. The decomposed string yields two char values. Both strings look identical when printed. If you count characters using .chars().count(), you get different results for visually identical text.
This matters for text editors, password length limits, and UI layout. If you need "grapheme clusters" (what a user sees as one character), you need the unicode-segmentation crate. The standard library stops at scalar values.
Counter-intuitive but true: char is a code point, not a grapheme. If your logic depends on user-perceived characters, reach for unicode-segmentation.
Finding byte boundaries with .char_indices()
Slicing a string requires byte indices, not character indices. If you want the first three characters of "Rust π¦", you can't write s[0..3]. That slices three bytes, which cuts off the emoji and might panic if the emoji starts early.
Use .char_indices() to map character positions to byte offsets. This iterator yields tuples of (byte_index, char). The byte index tells you exactly where each character starts in the underlying buffer.
fn main() {
let s = "Rust π¦";
// char_indices yields (byte_offset, char).
// You need the byte offset to slice safely.
for (i, c) in s.char_indices() {
println!("Byte {} is '{}'", i, c);
}
}
The output shows indices 0, 1, 2, 3 for R, u, s, t. The space is at 4. The crab emoji starts at 5 and takes four bytes. To slice the first three characters, you find the byte index of the fourth character and slice up to that point.
fn main() {
let s = "Rust π¦";
// Find the byte index after the third character.
// nth(3) gives the fourth item. We want the byte index of that item.
let byte_index = s.char_indices().nth(3).map(|(i, _)| i).unwrap_or(s.len());
// Slice using the byte index.
let first_three = &s[..byte_index];
println!("{}", first_three);
}
This approach is safe. The byte index comes from char_indices, which guarantees it lands on a character boundary. Slicing with that index never panics.
Slicing by character index is a trap. Always find the byte boundary first.
When bytes win
.chars() decodes UTF-8. Decoding has a small cost. If you are processing raw binary data, parsing a network protocol, or optimizing a hot loop where you know the input is pure ASCII, .bytes() is faster.
.bytes() returns an iterator over u8 values. It skips decoding entirely. You get the raw bytes. This is useful for checksums, hashing, or binary formats.
fn main() {
let s = "ABC";
// .bytes() yields raw u8 values.
// No UTF-8 decoding happens. This is zero-cost iteration.
for b in s.bytes() {
println!("Byte: 0x{:02x}", b);
}
}
If the string contains multi-byte characters, .bytes() yields the individual bytes. You lose character semantics. A three-byte character becomes three separate u8 values. Use this only when you understand the byte layout or when performance demands it and the data is ASCII.
Convention aside: when calling .clone() on an Rc<String>, the community prefers Rc::clone(&data) over data.clone(). The explicit form signals that you are cloning the reference, not the string. Similarly, when iterating, .chars() signals text processing, while .bytes() signals binary processing. Pick the method that matches your intent.
Pitfalls and panics
Indexing a string by usize is a compile error. str does not implement Index<usize>. The compiler rejects s[6] with E0277 because the trait is missing. This prevents accidental character indexing.
Slicing with s[0..6] compiles because str implements Index<Range<usize>>. The slice checks bounds at runtime. If the range starts or ends inside a multi-byte character, the program panics with "byte index N is not a char boundary".
fn main() {
let s = "Hello δΈη";
// This panics at runtime.
// Index 6 is inside the first byte of 'δΈ', which starts at 6 and ends at 9.
// The slice boundary cuts the character.
let _bad = &s[0..6];
}
To avoid panics, check boundaries before slicing. Use s.is_char_boundary(index). This method returns true if the index points to the start of a character.
fn main() {
let s = "Hello δΈη";
if s.is_char_boundary(6) {
println!("Safe to slice at 6");
} else {
println!("Index 6 is inside a character");
}
}
The check is cheap. It looks at the byte at the index and verifies the UTF-8 structure. Use this when you receive byte indices from external sources or calculations.
Don't guess byte offsets. Verify with is_char_boundary or derive them from char_indices.
Decision matrix
Use .chars() when you need to process text logically, handling emojis, accented characters, and non-Latin scripts correctly. Use .bytes() when you are working with raw binary data, parsing protocols, or optimizing a hot loop where you know the input is pure ASCII. Use .char_indices() when you need to slice the string or map character positions back to byte offsets for substring extraction. Use .as_bytes() when you need a &[u8] slice for APIs that expect byte arrays, accepting that you lose character boundaries.
Pick the tool that matches your data. Text gets .chars(). Bytes get .bytes(). Mixing them up is how panics happen.