Read yaml where integers contain underscore in Rust
Asked Answered
H

1

6

I'd like to be able to read integers containing underscores (a thousands separator)

- instrument: 5_000_000
  other_field: this string contains an _
- instrument: 5_000_000
  other_field: this string contains an _

How is this possible using serde_yaml?

Hyaluronidase answered 10/12, 2021 at 11:35 Comment(4)
As far as the yaml deserializer is concerned, the 5_000_000 is and will always be a string, as it can't be an integer. Since you probably don't want to fork a custom yaml deserializer, the approach probably should be to have a FooSerialized struct with instrument defined as a String/&str, and an implementation of TryFrom into Foo (where instrument is u64), where conversion may result in a std::num::ParseIntError in case the string can't be converted to an integer after stripping the underscores.Standee
@Standee Nice idea, but this would mean custom code everytime a field name gets changed, and seems like a lot of work as we have a lot of fields already.Hyaluronidase
You can use deserialize_with to specify a custom function for deserializing instrument and other similar fields.Neolithic
@Hyaluronidase Instead of modifying your structs, you could also modify the Deserializer. Something similar to here: github.com/serde-rs/json/issues/833#issuecomment-981989078Gareri
S
6

As I said in my comment, as far as the yaml-deserializer is concerned, the text "5_000_000" is always a string, never an integer. So you need to tell serde that this field needs special treatment. Either you create a FooSerialized struct as described in the comment (which would duplicate a lot of definitions) or you use the deserialize_with attribute to customize the fields deserialization:

use serde::{de, Deserialize};

#[derive(Deserialize, Debug)]
pub struct Foo {
    #[serde(deserialize_with = "deserialize_underscored_integer")]
    pub instrument: u64,
    pub other_field: String,
}

fn deserialize_underscored_integer<'de, D, T>(deserializer: D) -> Result<T, D::Error>
where
    D: de::Deserializer<'de>,
    T: std::str::FromStr,
{
    // First, deserialize the value as a string (which might fail...)
    let s: String = de::Deserialize::deserialize(deserializer)?;

    // next, filter out the underscores (and invalid chars while we are at it),
    // collect the remaining chars into a new string, parse that as an integer
    // and return that
    s.chars()
        .filter_map(|c| match c {
            c @ '0'..='9' => Some(Ok(c)),
            '_' => None,
            _ => Some(Err(de::Error::custom("invalid char in string"))),
        })
        .collect::<Result<String, _>>()?
        .parse()
        .map_err(|_: <T as std::str::FromStr>::Err| {
            de::Error::custom("string does not represent an integer")
        })
}

fn main() {
    // This will succeed
    let inp = r#"
- instrument: 1_2_34_567_8_9____0
  other_field: this string contains an _
- instrument: 5_000_000
  other_field: this string contains an _
"#;
    println!("{:?}", serde_yaml::from_str::<Vec<Foo>>(inp).unwrap());

    // This will fail because its not a integer
    let inp = r#"
- instrument: 5000 abcdef
  other_field: this string contains some other stuff
"#;
    println!("{:?}", serde_yaml::from_str::<Vec<Foo>>(inp).unwrap_err());

    // This looks like an integer but is not a u64
    let inp = r#"
- instrument: 5_000_000_000_000_000_000_000
  other_field: this string is too large to be a u64
"#;
    println!("{:?}", serde_yaml::from_str::<Vec<Foo>>(inp).unwrap_err());
}
Standee answered 10/12, 2021 at 16:39 Comment(1)
It is better to use retain() function of String instead of allocating second one. Allocations are costly.Debbydebee

© 2022 - 2024 — McMap. All rights reserved.