Find a string starting from given index
Asked Answered
L

3

10

What is the correct way how to find a substring if I need to start not from 0?

I have this code:

fn SplitFile(reader: BufReader<File>) {
  for line in reader.lines() {
    let mut l = line.unwrap();
    // l contains "06:31:53.012   index0:2015-01-06 00:00:13.084
    ...

I need to find third : and parse the date behind it. Still no idea how to do it, because find doesn't have any param like begin - see https://doc.rust-lang.org/std/string/struct.String.html#method.find.

(I know I can use regex. I have it done, but I'd like to compare the performance - whether parsing by hand might the quicker than using regex.)

Larrylars answered 7/7, 2015 at 21:38 Comment(1)
What kind of begin parameter are you thinking of? If you mean begin is an offset, then you'd just slice and then find s[begin..].find(...)Widner
U
4

You are right, there doesn't appear to be any trivial way of skipping several matches when searching a string. You can do it by hand though.

fn split_file(reader: BufReader<File>) {
    for line in reader.lines() {
        let mut l = &line.as_ref().unwrap()[..]; // get a slice
        for _ in 0..3 {
            if let Some(idx) = l.find(":") {
                l = &l[idx+1..]
            } else {
                panic!("the line didn't have enough colons"); // you probably shouldn't panic
            }
        }
        // l now contains the date
        ...

Update:

As faiface points out below, you can do this a bit cleaner with splitn():

fn split_file(reader: BufReader<File>) {
    for line in reader.lines() {
        let l = line.unwrap();
        if let Some(datetime) = l.splitn(4, ':').last() {
            // datetime now contains the timestamp string
            ...
        } else {
            panic!("line doesn't contain a timestamp");
        }
    }
}

You should go upvote his answer.

Underhanded answered 7/7, 2015 at 21:52 Comment(3)
Thank you, I'll try that. Can you also tell me how performant is that? What does that l = &l[idx+1..] do? Does it create new slice on stack? Does it copy the appropriate bytes? I'm asking because I try to process large files and any such extra work might kill performance significantly.Larrylars
Slices are references, they never represent copying nor do they represent any allocation of their ownWidner
@stej: measure, measure, measure :) In this specific case, it might well be that the bounds check l[idx+1..] will cost more than the the assignment itself. You can check how a slice is implemented in the std::raw module: just an integer and pointer.Fremd
B
5

There is a lot simpler solution to this problem in my opinion, and that is to use a .splitn() method. This method splits a string by a given pattern at most n times. For example:

let s = "ab:bc:cd:de:ef".to_string();
println!("{:?}", s.splitn(3, ':').collect::<Vec<_>>());
// ^ prints ["ab", "bc", "cd:de:ef"]

In your case, you need to split the line into 4 parts separated by ':' and take the 4th one (indexed from 0):

// assuming the line is correctly formatted
let date = l.splitn(4, ':').nth(3).unwrap();

If you don't want to use unwrap (the line might not be correctly formatted):

if let Some(date) = l.splitn(4, ':').nth(3) {
    // parse the date and time
}
Bounder answered 9/7, 2015 at 16:29 Comment(2)
Good call. I remembered that Rust had split() but for some reason I didn't think of splitn(). BTW, it might be conceptually cleaner to use l.splitn(4, ':').last() instead of using .nth(3).Underhanded
That sounds good also. Already implemented the other solution, but I'll probably make some benchmarks including this solution as well.Larrylars
U
4

You are right, there doesn't appear to be any trivial way of skipping several matches when searching a string. You can do it by hand though.

fn split_file(reader: BufReader<File>) {
    for line in reader.lines() {
        let mut l = &line.as_ref().unwrap()[..]; // get a slice
        for _ in 0..3 {
            if let Some(idx) = l.find(":") {
                l = &l[idx+1..]
            } else {
                panic!("the line didn't have enough colons"); // you probably shouldn't panic
            }
        }
        // l now contains the date
        ...

Update:

As faiface points out below, you can do this a bit cleaner with splitn():

fn split_file(reader: BufReader<File>) {
    for line in reader.lines() {
        let l = line.unwrap();
        if let Some(datetime) = l.splitn(4, ':').last() {
            // datetime now contains the timestamp string
            ...
        } else {
            panic!("line doesn't contain a timestamp");
        }
    }
}

You should go upvote his answer.

Underhanded answered 7/7, 2015 at 21:52 Comment(3)
Thank you, I'll try that. Can you also tell me how performant is that? What does that l = &l[idx+1..] do? Does it create new slice on stack? Does it copy the appropriate bytes? I'm asking because I try to process large files and any such extra work might kill performance significantly.Larrylars
Slices are references, they never represent copying nor do they represent any allocation of their ownWidner
@stej: measure, measure, measure :) In this specific case, it might well be that the bounds check l[idx+1..] will cost more than the the assignment itself. You can check how a slice is implemented in the std::raw module: just an integer and pointer.Fremd
R
2

Just the date and not also the time, right?

let test: String = "06:31:53.012   index0:2015-01-06 00:00:13.084".into();

let maybe_date = test.split_whitespace()
    .skip(1)
    .next()
    .and_then(|substring| substring.split(":").skip(1).next());

assert_eq!(maybe_date, Some("2015-01-06"));
Rubino answered 7/7, 2015 at 22:9 Comment(2)
I assumed the time was part of the date that stej wanted to parse. Together they represent a fully-specified timestamp.Underhanded
It's true, I wanted to parse the time as well. This code might look good for small files and shows another way how to solve my problem, so I upvoted as well..Larrylars

© 2022 - 2024 — McMap. All rights reserved.