How to use PARSE dialect to read in a line from a CSV?
Asked Answered
R

3

5

I'm trying to use PARSE to turn a CSV line into a Rebol block. Easy enough to write in open code, but as with other questions I am trying to learn what the dialect can do without that.

So if a line says:

"Look, that's ""MR. Fork"" to you!",Hostile Fork,,http://hostilefork.com

Then I want the block:

[{Look, that's "MR. Fork" to you!} {Hostile Fork} none {http://hostilefork.com}]

Issues to notice:

  • Embedded quotes in CSV strings are indicated with ""
  • Commas can be inside quotes and hence part of the literal, not a column separator
  • Adjacent column-separating commas indicate an empty field
  • Strings that don't contain quotes or commas can appear without quotes
  • For the moment we can keep things like http://rebol.com as STRING! instead of LOADing them into types such as URL!

To make it more uniform, the first thing I do is append a comma to the input line. Then I have a column-rule which captures a single column terminated by a comma...which may either be in quotes or not.

I know how many columns there should be due to the header line, so the code then says:

unless parse line compose [(column-count) column-rule] [
    print rejoin [{Expected } column-count { columns.}]
]

But I'm a bit stuck on writing column-rule. I need a way in the dialect to express "Once you find a quote, keep skipping quote pairs until you find a quote standing all on its own." What's a good way to do that?

Raskin answered 19/11, 2012 at 9:36 Comment(0)
B
3

As with most parse problems, I try to build a grammar that best describes the elements of the input format.

In this case, we have nouns:

[comma ending value-chars qmark quoted-chars value header row]

Some verbs:

[row-feed emit-value]

And the operative nouns:

[current chunk current-row width]

I suppose I could possibly break it down a little more, but is enough to work with. First, the foundation:

comma: ","
ending: "^/"
qmark: {"}
value-chars: complement charset reduce [qmark comma ending]
quoted-chars: complement charset reduce [qmark]

Now the value structure. Quoted values are built up from chunks of valid chars or quotes as we find them:

current: chunk: none
quoted-value: [
    qmark (current: copy "")
    any [
        copy chunk some quoted-chars (append current chunk)
        |
        qmark qmark (append current qmark)
    ]
    qmark
]

value: [
    copy current some value-chars
    | quoted-value
]

emit-value: [
    (
        delimiter: comma
        append current-row current
    )
]

emit-none: [
    (
        delimiter: comma
        append current-row none
    )
]

Note that delimiter is set to ending at the beginning of each row, then changed to comma as soon as we pass a value. Thus, an input row is defined as [ending value any [comma value]].

All that remains is to define the document structure:

current-row: none
row-feed: [
    (
        delimiter: ending
        append/only out current-row: copy []
    )
]

width: none
header: [
    (out: copy [])
    row-feed any [
        value comma
        emit-value
    ]
    value body: ending :body
    emit-value
    (width: length? current-row)
]

row: [
    row-feed width [
        delimiter [
            value emit-value
            | emit-none
        ]
    ]
]

if parse/all stream [header some row opt ending][out]

Wrap it up to shield all those words, and you have:

REBOL [
    Title: "CSV Parser"
    Date: 19-Nov-2012
    Author: "Christopher Ross-Gill"
]

parse-csv: use [
    comma ending delimiter value-chars qmark quoted-chars
    value quoted-value header row
    row-feed emit-value emit-none
    out current current-row width
][
    comma: ","
    ending: "^/"
    qmark: {"}
    value-chars: complement charset reduce [qmark comma ending]
    quoted-chars: complement charset reduce [qmark]

    current: none
    quoted-value: use [chunk][
        [
            qmark (current: copy "")
            any [
                copy chunk some quoted-chars (append current chunk)
                |
                qmark qmark (append current qmark)
            ]
            qmark
        ]
    ]

    value: [
        copy current some value-chars
        | quoted-value
    ]

    current-row: none
    row-feed: [
        (
            delimiter: ending
            append/only out current-row: copy []
        )
    ]
    emit-value: [
        (
            delimiter: comma
            append current-row current
        )
    ]
    emit-none: [
        (
            delimiter: comma
            append current-row none
        )
    ]

    width: none
    header: [
        (out: copy [])
        row-feed any [
            value comma
            emit-value
        ]
        value body: ending :body
        emit-value
        (width: length? current-row)
    ]

    row: [
        opt ending end break
        |
        row-feed width [
            delimiter [
                value emit-value
                | emit-none
            ]
        ]
    ]

    func [stream [string!]][
        if parse/all stream [header some row][out]
    ]
]
Boschbok answered 19/11, 2012 at 13:42 Comment(1)
Fantastic response time on an answer that seems (thus far) to work on the wacky data I've given it!Raskin
B
2

I had to do that years ago. I have updated my funcs to handle all the cases I found since that. I hope it is more solid now.

Notice that it can handle strings with newlines inside BUT:

  1. newlines in strings must be LF only and...
  2. newline between records must be CRLF and...
  3. you must load the file with read/binary so Rebol does not convert newlines automaticaly.

(1. and 2. is what Excel give, for example)

; Conversion function from CSV format
csv-to-block: func [
    "Convert a string of CSV formated data to a Rebol block. First line is header."
    csv-data [string!] "CSV data."
    /separator separ [char!] "Separator to use if different of comma (,)."
    /without-header "Do not include header in the result."
    /local out line start end this-string header record value data chars spaces chars-but-space
    ; CSV format information http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
] [
    out: copy []
    separ: any [separ #","]

    ; This function handle replacement of dual double-quote by quote while copying substring
    this-string: func [s e] [replace/all copy/part s e {""} {"}]
    ; CSV parsing rules
    header: [(line: copy []) value any [separ value | separ (append line none)] (if not without-header [append/only out line])]
    record: [(line: copy []) value any [separ value | separ (append line none)] (append/only out line)]
    value: [any spaces data any spaces (append line this-string start end)]
    data: [start: some chars-but-space any [some spaces some chars-but-space] end: | #"^"" start: any [some chars | {""} | separ | newline] end: #"^""]
    chars: complement charset rejoin [ {"} separ newline]
    spaces: charset exclude { ^-} form separ
    chars-but-space: exclude chars spaces

    parse/all csv-data [header any [newline record] any newline end]
    out
]

If needed, I have the counterpart block-to-csv.

[Edit] OK, the counterpart (note: all string! will be enclosed with double quote and header must be in the first line of the block if you want it in the result):

block-to-csv: func [
    "Convert a block of blocks to a CSV formated string." 
    blk-data [block!] "block of data to convert"
    /separator separ "Separator to use if different of comma (,)."
    /local out csv-string record value v
] [
    out: copy ""
    separ: any [separ #","]
    ; This function convert a string to a CSV formated one
    csv-string: func [val] [head insert next copy {""} replace/all replace/all copy val {"} {""} newline #{0A} ]
    record: [into [some [value (append out separ)]]]
    value: [set v string! (append out csv-string v) | set v any-type! (append out form v)]

    parse/all blk-data [any [record (remove back tail out append out crlf)]]
    out
]
Batho answered 20/11, 2012 at 8:52 Comment(1)
Hey, thanks! I actually do need a block-to-csv for this task, so if you want to edit the answer to throw that in, it would keep me from having to write it (even though it's the easier of the two).Raskin
B
2

Additionally, find the %csv-tools.r script on rebol.org from BrianH.

http://www.rebol.org/view-script.r?script=csv-tools.r

Great piece of code. Works with R2 and R3.

Babylon answered 6/12, 2012 at 16:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.