How do I match a CSV-style quoted string in nom?
Asked Answered
R

2

7

A CSV style quoted string, for the purposes of this question, is a string in which:

  1. The string starts and ends with exactly one ".
  2. Two double quotes inside the string are collapsed to one double quote. "Alo""ha"Alo"ha.
  3. "" on its own is an empty string.
  4. Error inputs, such as "A""" e", cannot be parsed. It's an A", followed by junk e".

I've tried several things, none of which have worked fully.

The closest I've gotten, thanks to some help from user pinkieval in #nom on the Mozilla IRC:

use std::error as stderror; /* Avoids needing nightly to compile */

named!(csv_style_string<&str, String>, map_res!(
   terminated!(tag!("\""), not!(peek!(char!('"')))),
   csv_string_to_string
));

fn csv_string_to_string(s: &str) -> Result<String, Box<stderror::Error>> {
   Ok(s.to_string().replace("\"\"", "\""))
}

This does not catch the end of the string correctly.

I've also attempted to use the re_match! macro with r#""([^"]|"")*""#, but that always results in an Err::Incomplete(1).

I've determined that the given CSV example for Nom 1.0 doesn't work for a quoted CSV string as I'm describing it, but I do know implementations differ.

Rufous answered 7/6, 2018 at 11:55 Comment(2)
Not an answer to your question, but why not just use an existing high-quality CSV parser?Pelagias
@Pelagias Two reasons: (1) I want to rewrite the CSV example for Nom to one that's correct to these specifications, thereby helping the community, (2) I have another project where I'm more interested in the PostgreSQL queries than the CSV files.Rufous
Z
3

Here is one way of doing it:

use nom::types::CompleteStr;

use nom::*;

named!(csv_style_string<CompleteStr, String>,
    delimited!(
        char!('"'),
        map!(
            many0!(
                alt!(
                    // Eat a " delimiter and  the " that follows it
                    tag!("\"\"") => { |_| '"' }

                |    // Normal character
                    none_of!("\"")
                )
            ),
             // Make a string from a vector of chars
            |v| v.iter().collect::<String>()
        ),
        char!('"')
    )
);

fn main() {
    println!(r#""Alo\"ha" = {:?}"#, csv_style_string(CompleteStr(r#""Alo""ha""#)));
    println!(r#""" = {:?}"#, csv_style_string(CompleteStr(r#""""#)));
    println!(r#"bad format: {:?}"#, csv_style_string(CompleteStr(r#""A""" e""#)));
}

(I wrote it in full nom, but a solution like yours, based on an external function instead of map!() each character, would work too, and may be more efficient.)

The magic here, that would also solve your regexp issue, is to use CompleteStr. This basically tells nom that nothing will come after that input (otherwise, nom assumes you're doing a streaming parser, so more input may follow).

This is needed because we need to know what to do with a " if it is the last character fed to nom. Depending on the character that comes after it (another ", a normal character, or EOF), we have to take a different decision -- hence the Incomplete result, meaning nom does not have enough input to make the decision. Telling nom that EOF comes next solves this indecision.

Further reading on Incomplete on nom's author's blog: http://unhandledexpression.com/general/2018/05/14/nom-4-0-faster-safer-simpler-parsers.html#dealing-with-incomplete-usage


You may note that this parser does not actually rejects the invalid input, but parses the beginning and returns the rest. If you use this parser as a subparser in another parser, the latter would then feed the remainder to the next subparser, which would crash as well (because it would expect a comma), causing the overall parser to fail.

If you don't want that, you could make csv_style_string match peek!(alt!(char!(',')|char!('\n")|eof!())).

Zuckerman answered 9/6, 2018 at 23:17 Comment(0)
P
1

The other answer is nom 4's macro style. Here is the new function style.

fn csv_style_string(input: &str) -> IResult<&str, String> {
    delimited(
        char('"'),
        map(
            many0(
                alt((
                    // Eat a " delimiter and the " that follows it
                    value('"', tag("\"\"")),

                    // Normal character
                    none_of("\""),
                )),
            ),
            // Make a string from a vector of chars
            |v| v.iter().collect::<String>(),
        ),
        char('"'),
    )(input)
}
Primrose answered 16/9, 2023 at 15:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.