How to index a String in Rust
Asked Answered
H

8

152

I am attempting to index a string in Rust, but the compiler throws an error. My code (Project Euler problem 4, playground):

fn is_palindrome(num: u64) -> bool {
    let num_string = num.to_string();
    let num_length = num_string.len();

    for i in 0 .. num_length / 2 {
        if num_string[i] != num_string[(num_length - 1) - i] {
            return false;
        }
    }
    
    true
}

The error:

error[E0277]: the trait bound `std::string::String: std::ops::Index<usize>` is not satisfied
 --> <anon>:7:12
  |
7 |         if num_string[i] != num_string[(num_length - 1) - i] {
  |            ^^^^^^^^^^^^^
  |
  = note: the type `std::string::String` cannot be indexed by `usize`

Is there a reason why String can not be indexed? How can I access the data then?

Hardin answered 2/7, 2014 at 22:29 Comment(1)
This answer might help: #22118721Schematize
E
195

Yes, indexing into a string is not available in Rust. The reason for this is that Rust strings are encoded in UTF-8 internally, so the concept of indexing itself would be ambiguous, and people would misuse it: byte indexing is fast, but almost always incorrect (when your text contains non-ASCII symbols, byte indexing may leave you inside a character, which is really bad if you need text processing), while char indexing is not free because UTF-8 is a variable-length encoding, so you have to traverse the entire string to find the required code point.

If you are certain that your strings contain ASCII characters only, you can use the as_bytes() method on &str which returns a byte slice, and then index into this slice:

let num_string = num.to_string();

// ...

let b: u8 = num_string.as_bytes()[i];
let c: char = b as char;  // if you need to get the character as a unicode code point

If you do need to index code points, you have to use the char() iterator:

num_string.chars().nth(i).unwrap()

As I said above, this would require traversing the entire iterator up to the ith code element.

Finally, in many cases of text processing, it is actually necessary to work with grapheme clusters rather than with code points or bytes. With the help of the unicode-segmentation crate, you can index into grapheme clusters as well:

use unicode_segmentation::UnicodeSegmentation

let string: String = ...;
UnicodeSegmentation::graphemes(&string, true).nth(i).unwrap()

Naturally, grapheme cluster indexing has the same requirement of traversing the entire string as indexing into code points.

Eikon answered 2/7, 2014 at 23:9 Comment(4)
FWIW, String could never be indexed. The indexing removal was only for &str.Desiccate
I think nowadays, char_at() was also removed... (rustc 1.23.0-nightly (79cfce3d3 2017-11-12))Foulard
Be aware that chars().nth(i) is an iterator, so the operation would be O(n) and not O(1) as with vec indexing.Curia
I'm a bit confused by this snippet which do index on String : google.github.io/comprehensive-rust/types-and-values/…Weanling
P
50

The correct approach to doing this sort of thing in Rust is not indexing but iteration. The main problem here is that Rust's strings are encoded in UTF-8, a variable-length encoding for Unicode characters. Being variable in length, the memory position of the nth character can't determined without looking at the string. This also means that accessing the nth character has a runtime of O(n)!

In this special case, you can iterate over the bytes, because your string is known to only contain the characters 0–9 (iterating over the characters is the more general solution but is a little less efficient).

Here is some idiomatic code to achieve this (playground):

fn is_palindrome(num: u64) -> bool {
    let num_string = num.to_string();
    let half = num_string.len() / 2;

    num_string.bytes().take(half).eq(num_string.bytes().rev().take(half))
}

We go through the bytes in the string both forwards (num_string.bytes().take(half)) and backwards (num_string.bytes().rev().take(half)) simultaneously; the .take(half) part is there to halve the amount of work done. We then simply compare one iterator to the other one to ensure at each step that the nth and nth last bytes are equivalent; if they are, it returns true; if not, false.

Prissy answered 2/7, 2014 at 23:20 Comment(2)
FWIW, String has a direct as_bytes. Furthermore, you can use std::iter::order::equals. rather than the all: equals(iter.take(n), iter.rev().take(n)).Desiccate
BTW, convention implies importing std::iter::order and calling order::equals(..., ...) (I only didn't do this in my comment because it would've been noisy).Desiccate
P
37

If what you are looking for is something similar to an index, you can use

.chars() and .nth() on a string.


.chars() -> Returns an iterator over the chars of a string slice.

.nth() -> Returns the nth element of the iterator, in an Option


Now you can use the above in several ways, for example:

let s: String = String::from("abc");
//If you are sure
println!("{}", s.chars().nth(x).unwrap());
//or if not
println!("{}", s.chars().nth(x).expect("message"));
Pendleton answered 20/5, 2017 at 2:3 Comment(2)
It is important to note that Chars::nth(n) consumes n characters, rather than just being plain indexing. As stated by the documentation calling nth(0) multiple times on the same iterator will return different elements.Sestina
If you are indeed not sure whether the Nth character exists, using expect() versus unwrap() will not prevent a panic. The code will panic regardless, but expect will provide a custom panic message. See also: #61302081Gallardo
B
23

You can convert a String or &str to a vec of a chars and then index that vec.

For example:

fn main() {
    let s = "Hello world!";
    let my_vec: Vec<char> = s.chars().collect();
    println!("my_vec[0]: {}", my_vec[0]);
    println!("my_vec[1]: {}", my_vec[1]);
}

Here you have a live example

Brandish answered 27/2, 2020 at 9:8 Comment(1)
How about the performance? I think the string bytes are copied.Zorazorah
S
4

Indexing on String is not allowed because (please check the book):

  • it is not clear what the indexed value should be: a byte, a character, or a grapheme cluster (which we call a letter in common sense)
  • strings are vectors of bytes (u8) encoded with UTF-8 and UTF-8 is a variable length encoding, i.e. every character can take different number of bytes - from 1 to 4. So to get a character or grapheme cluster by index would require a whole string traversal (O(n) in average and the worst cases) from the beginning to determine valid bytes bounds of the character or the grapheme.

So if you input doesn't contain diacritics (considered as a separate character) and it's ok to approximate letter with character, you can use chars() iterator and DoubleEndedIterator trait for two pointers approach:

    fn is_palindrome(num: u64) -> bool {
        let s = num.to_string();
        let mut iterator = s.chars();
        loop  {
            let ch = iterator.next();
            let ch_end = iterator.next_back();
            
            if ch.is_none() || ch_end.is_none() {
                break;
            }
            if ch.unwrap() != ch_end.unwrap() {
                return false
            }
        }
        true
    }
Soledadsolely answered 27/3, 2022 at 11:16 Comment(0)
P
1

this is not suitable for all uses by any means, but if you just need to reference the previous character (or, with a little rework, the next character), then it's possible to do so without iterating through the entire str.

the scenario here is that there is a str slice, string, and pattern was found in the slice. i want to know the character immediately before the pattern.

call prev_char like prev_char(string.as_bytes(), pattern_index) where pattern index is the index of the first byte of pattern in string.

utf-8 encoding is well defined and this works just by backing up until it finds one of the starting bytes (either high order bit 0 or bits 11) and then converting that 1-4 byte [u8] slice to a str.

this code just unwraps it because the pattern was found in a valid utf-8 str to begin with, so no error is possible. if your data has not been validated it might be best to return a result rather than an Option.

enum PrevCharStates {
    Start,
    InEncoding,
}

fn prev_char(bytes: &[u8], starting_index: usize) -> Option<&str> {
    let mut ix = starting_index;
    let mut state = PrevCharStates::Start;

    while ix > 0 {
        ix -= 1;
        let byte = bytes[ix];
        match state {
            PrevCharStates::Start => {
                if byte & 0b10000000 == 0 {
                    return Some(std::str::from_utf8(&bytes[ix..starting_index]).unwrap());
                } else if byte & 0b11000000 == 0b10000000 {
                    state = PrevCharStates::InEncoding;
                }
            },
            PrevCharStates::InEncoding => {
                if byte & 0b11000000 == 0b11000000 {
                    return Some(std::str::from_utf8(&bytes[ix..starting_index]).unwrap());
                } else if byte & 0b11000000 != 0b10000000 {
                    return None;
                }
            }
        }
    }
    None
}
Pears answered 2/11, 2021 at 22:1 Comment(2)
This function can be written, with a slightly different signature, as string[..index].chars().next_back() (playground)Selfanalysis
thanks. i'm pretty new to rust and seem to learn something new every day.Pears
T
1

The bellow code works fine, not sure about performance and O complexity and hopefully someone can add more information about this solution.

fn is_palindrome(num: u64) -> bool {
    let num_string = String::from(num.to_string());
    let num_length = num_string.len();
    for i in 0..num_length / 2 {
        let left = &num_string[i..i + 1];
        let right = &num_string[((num_length - 1) - i)..num_length - i];
        if left != right {
            return false;
        }
    }
    true
}
Thicket answered 30/12, 2021 at 1:9 Comment(1)
I think [i..i] directly slices into the string as bytes. to test, try ¹⁄₂₂⁄¹ I think it gives false, because you code reverses inside unicode code pointsIsa
M
1

There are two reasons indexing is not working in Rust:

  • In rust, strings are stored as a collection of utf-8 encoded bytes. In memory, strings are just collections of 1's and 0's. a program needs to be able to interpret those 1's and 0's and print out the correct characters. that's where encoding comes into play.

       fn main(){
           let sample:String=String::from("2bytesPerChar")
           // we could this in higher programming languages. in rust we get error. cannot be indexed by an integer
           let c:char=sample[0]
       }
    

String is a collection of bytes. so what is the lenght of our "2bytesPerChar". Because some chars can be 1 to 4 bytes long. Assume that first character has 2 bytes. If you want to get the first char in string, using the indexing, hello[0] will specify the first byte which is the only half of the first string.

  • Another reason is there are 3 relevant ways a word in represented in unicode: Bytes, scalar values, grapheme clusters. If we use indexing rust does not know what we will receive. Bytes, scalar value or grapheme clusters. so we have to use more specific methods.

How to access the characters in String

  • Return bytes

       for b in "dsfsd".bytes(){
           // bytes method returns a collection of bytes and here we are iterating over every byte and printing it out
           println!("{}",b)
       }
    
  • Return scalar values:

   // we could iterate over scalar values using char methods
   for c in "kjdskj".chars(){
       println!("{}",c)
   }
  • return grapheme values:

In order to keep rust standard library lean, the ability iterate over graphene clusters is not included by default. we need to import a crate

// in cargo.toml
   [dependencies]
   unicode-segmentation="1.7.1"

then:

   use unicode_segmentation::UnicodeSegmentation;
   // we pass true to get extended grapheme clusters
   for g in "dada"graphemes(true){
       println!("{}",g)
   }

Motivity answered 11/4, 2022 at 1:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.