What's the most robust way to efficiently parse CSV using awk?
Asked Answered
B

8

46

The intent of this question is to provide a canonical answer.

Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:

$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",

What's the most robust way efficiently using awk to identify the separate records and fields:

Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

so it can be used as those records and fields internally by the rest of the awk script.

A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.

The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.

Bleacher answered 31/7, 2017 at 16:2 Comment(2)
See also #1560893Bellbottoms
related: #4205931Uniseptate
B
60

Updated to reflect the release of GNU awk 5.3 for the --csv option to enable CSV parsing (also available now in Kernighan's One True Awk so I expect the gawk 5.3 scripts below will also work there but I don't have a copy of onetrueawk to test with):

If you have GNU awk 5.3 or later for --csv (or equivalently -k) and you neither have to retain the input quotes nor have a separator other than ,:

--csv reads multi-line records and splits quoted fields correctly.

awk --csv -v OFS=',' '
    {
        printf "Record %d:\n", NR
        for (i=1;i<=NF;i++) {
            printf "    $%d=<%s>\n", i, $i
        }
        print "----"
    }
'

This automatically reads multi-line records that have newlines within quoted fields but you can only use , as a separator and " as the field-surrounding quote with --csv.

Note that --csv automatically strips quotes from around quoted fields and converts escaped quotes ("") within fields to single quotes ("). That is usually the desired behavior but if it's not what you want there are alternatives below.

If you have GNU awk 4.0 or later for FPAT:

FPAT splits quoted fields correctly but does not automatically read multi-line records so you need to implement that part.

awk -v FPAT='[^,]*|("([^"]|"")*")' -v OFS=',' '
    function readRec(      line,rec) {
        # Keep reading lines until we have an even number of quotes
        # as an incomplete record will have an uneven number.
        rec = $0
        while ( (gsub(/"/,"&",rec) % 2) && ((getline line) > 0) ) {
            rec = rec RS line
            NR--
            FNR--
        }
        $0 = rec
    }

    {
        readRec()
        printf "Record %d:\n", ++recNr
        for (i=1;i<=NF;i++) {
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            gsub(/^"|"$/, "", $i)
            gsub(/""/, "\"", $i)

            printf "    $%d=<%s>\n", i, $i
        }
        print "----"
    }
'

You can use any character as a separator and any character as the field-surrounding quote with FPAT since you define those and write some of the code to parse the input for them but you have to write your own code to read multi-line records.

See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the FPAT setting I use above. That would also be a good choice if you have gawk 5.3+ but --csv isn't a good option for you, e.g. because you need to retain quotes or use a different separator character.

If you have any modern* awk:

With neither --csv nor FPAT you have to implement both reading multi-line fields and splitting quoted fields correctly.

awk -F',' -v OFS=',' '
    function readRec(      line,rec) {
        # Keep reading lines until we have an even number of quotes
        # as an incomplete record will have an uneven number.
        rec = $0
        while ( (gsub(/"/,"&",rec) % 2) && ((getline line) > 0) ) {
            rec = rec RS line
            NR--
            FNR--
        }
        $0 = rec
    }

    function splitRec(      orig,tail,fpat,fldNr) {
        # You must call this function every time you change $0 to
        # repopulate the fields taking quoted fields into account.
    
        orig = tail = $0
        $0 = ""
        fpat = "([^" FS "]*)|(\"([^\"]|\"\")*\")"
        while ( (tail != "") && match(tail,fpat) ) {
            $(++fldNr) = substr(tail,RSTART,RLENGTH)
            tail = substr(tail,RSTART+RLENGTH+1)
        }
    
        # If the original $0 ended with a null field we would exit the
        # loop above before handling it so handle it here.
        if ( orig ~ ("["FS"]$") ) {
            $(++fldNr) = ""
        }
    }
    
    {
        readRec()
        splitRec()
        printf "Record %d:\n", NR
        for (i=1;i<=NF;i++) {
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            gsub(/^"|"$/, "", $i)
            gsub(/""/, "\"", $i)
    
            printf "    $%d=<%s>\n", i, $i
        }
        print "----"
    }
'

You can use any character as a separator and any character as the field-surrounding quote with the the above "any modern* awk" solution since you define those and write all of the code to parse the input for them and you also have to write your own code to read multi-line records.

Decrementing NR and FNR above is done because getline increments them every time it reads a line to add to the record. Changing the value of NR and FNR is undefined behavior per POSIX, though, so if your awk doesn't support that then just create and use your own Fnr and Nr or similarly named variables to keep track of record numbers.


Some related information/scripts:

To add quotes to output fields when using --csv

By default given this input:

$ cat file
foo,"field 2 ""contains"" quotes, and comma",bar

you might expect this script which prints the first 2 fields:

awk --csv -v OFS=',' '{print $1,$2}' file

to produce this output:

foo,"field 2 ""contains"" quotes, and comma"

but it doesn't, it produces this output instead:

$ awk --csv -v OFS=',' '{print $1,$2}' file
foo,field 2 "contains" quotes, and comma

Note that field 2 from the input has now, probably undesirably, become 2 separate fields in the output as it's surrounding quotes were stripped so the , that was inside of those quotes is now a separator.

If you want the output fields quoted then you need to manually add the quotes by doing whichever of the following makes sense for your application:

To add quotes around a field or any other string:

enquote(str) {
    gsub(/"/,"\"\"",str)
    return "\"" str "\""
}

To add quotes around all fields:

enquote_all() {
    for ( i=1; i<=NF; i++ ) {
        $i = enquote($i)
    }
}

To add quotes around only the fields that NEED to be quoted:

enquote_needed() {
    for ( i=1; i<=NF; i++ ) {
        if ( $i ~ /[\n,"]/ ) {
            $i = enquote($i)
        }
    }
}

To retain the quotes that were present in the input

You need to use readRec() and/or splitRec() above, depending if you're using --csv or FPAT or neither. There is no way to make --csv not remove quotes when populating fields and you cannot use --csv and FPAT, FS or FIELDWIDTHS together as using --csv tells awk to ignore those other variables that otherwise control field splitting.

So, assuming your fields retain their original quotes, you can do the following or similar to conditionally reproduce those quotes in the output (and to also add new quotes if the field would otherwise be invalid CSV):

function dequote_cond(fldNr,    had_quotes) {
    had_quotes = ( gsub(/^"|"$/,"",$fldNr) || gsub(/""/,"\"",$fldNr) )
    return had_quotes+0
}
function enquote_cond(fldNr,had_quotes,    got_quotes) {
    if ( had_quotes || ($fldNr ~ /[\n",]/) ) {
        gsub(/"/,"\"\",$fldNr)
        $fldNr = "\"" $fldNr "\""
        got_quotes = 1
    }
    return got_quotes+0     # probably unused, for symmetry with dequote_cond() 
}
...
q = dequote_cond(i)
$i = new value
enquote_cond(i,q)

To print fields with quotes removed if your CSV cannot contain newlines:

  • With GNU awk 5.3 or later for --csv: Exactly the same as above --csv case that includes newlines in fields.
  • With GNU awk 4.0 or later for FPAT: The same as above FPAT case that includes newlines in fields except you don't need to define or call readRec().
  • With any modern* awk: The same as above "any modern* awk"" case that includes newlines in fields except you don't need to define or call readRec().

If you don't want the quotes removed then, if using --csv add quotes as described above, and if you're using either of the other solutions then don't remove the quotes with the 2 gsub()s.

To convert newlines within fields to blanks and commas within fields to semi-colons

If all you actually want to do is convert your CSV to a simpler version with individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields (and have quoted fields in the output) then:

With GNU awk 3.0 or later for RT:

awk -v RS='"([^"]|"")*"' -v ORS= '{
    gsub(/\n/," ",RT)
    gsub(/,/,";",RT)
    print $0 RT
}'

Otherwise: use one of the solutions discussed earlier that use readRec() and add the 2 gsub()s above in a loop on the fields.

If you want a CSV but have a different character than , as your delimiter.

If you have a CSV-like file that uses tabs or ;s or |s or some other character to separate fields then you can do the following using any awk to convert your file to CSV:

$ cat changeSeps.awk
BEGIN {
    FS = OFS = "\""

    if ( (old == "") || (new == "") ) {
        printf "Error: old=\047%s\047 and/or new=\047%s\047 separator string missing.\n", old, new |"cat>&2"
        printf "Usage: awk -v old=\047;\047 -v new=\047,\047 -f changeSeps.awk infile [> outfile]\n" |"cat>&2"
        err = 1
        exit
    }

    sanitized_old = old
    sanitized_new = new

    # Ensure all regexp and replacement chars get treated as literal
    gsub(/[^^\\]/,"[&]",sanitized_old)  # regexp: char other than ^ or \ -> [char]
    gsub(/\\/,"\\\\",sanitized_old)     # regexp: \ -> \\
    gsub(/\^/,"\\^",sanitized_old)      # regexp: ^ -> \^
    gsub(/[&]/,"\\\\&",sanitized_new)   # replacement: & -> \\&
}
{
    $0 = prev ors $0
    prev = $0
    ors = ORS
}
NF%2 {
    for ( i=1; i<=NF; i+=2 ) {
        cnt += gsub(sanitized_old,sanitized_new,$i)
    }
    print
    prev = ors = ""
}
END {
    if ( !err ) {
        printf "Converted %d \047%s\047 field separators to \047%s\047s.\n", cnt+0, old, new |"cat>&2"
    }
    exit err
}

which you'd call as, for example to change ;-separated to ,-separated format:

awk -v old=';' -v new=',' -f changeSeps.awk file

If you have DOS/Windows line endings

With Windows \r\n line endings, e.g. as in a CSV exported from Excel, processing can be simpler than the above as the "newlines" within each field can often actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings so all you additionally need to parse CSV is to set FPAT. SOmetimes, though, the underlying C primitives will not pass along the \r to gawk and you'd need to set -v BINDMODE=3 on the gawk command line to see them.

Other notes

*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:

gsub(/^"|"$/,"",fldStr)

to this:

sub(/^"/,"",fldStr)
sub(/"$/,"",fldStr)

Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:

  1. @mosvy for escaped double quotes within fields.
  2. @datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.

Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

Bleacher answered 31/7, 2017 at 16:6 Comment(1)
Yeah, fair enough. I was about to edit myself but upon second reading, I realized you weren't trying to say that gawk led and awk followed, and couldn't really come up with a more natural way of saying it myself either, so I will just delete my comment instead.Dave
N
12

An improvement upon @EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).

gawk -v FPAT='[^,]*|("[^"]*")+' ...

This STILL

  1. isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.

  2. assumes GNU awk (gawk), a standard awk won't do.

Example:

$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12

$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
  for(i=1; i<=NF;i++){
    if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
    print "<"$i">"
  }
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>
Neurogram answered 13/3, 2020 at 19:10 Comment(1)
I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, including this way of defining FPAT, thanks.Bleacher
L
2

This is exactly what csvquote is for - it makes things simple for awk and other command line data processing tools.

Some things are difficult to express in awk. Instead of running a single awk command and trying to get awk to handle the quoted fields with embedded commas and newlines, the data gets prepared for awk by csvquote, so that awk can always interpret the commas and newlines it finds as field separators and record separators. This makes the awk part of the pipeline simpler. Once awk is done with the data, it goes back through csvquote -u to restore the embedded commas and newlines inside quoted fields.

csvquote file.csv | awk -f my_awk_script | csvquote -u

EDIT:

For a complete description on csvquote, see: How it works. this also explains the `` characters which are shown in places where there was a carriage return.

csvquote file.csv | awk -f decsv.awk | csvquote -u

(for the source of decsv.awk see answer from Ed Morton ) output:

Record 1:
    $1=<rec1 fld1>
    $2=<>
    $3=<rec1","fld3.1",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3fld2">
    $3=<>
----
Lucerne answered 31/1, 2022 at 11:16 Comment(1)
I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, thanks.Bleacher
B
1

I have found csvkit a really useful toolkit to handle with csv files in command line.

line='test,t2,t3,"t5,"'
echo $line | csvcut -c 4
"t5,"
echo 'foo,"field,""with"",commas",bar'  | csvcut -c 3
bar

It also contains csvstat, csvstack etc. tools which are also very handy.

cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",

csvcut -c 1 file.csv

"rec1, fld1"
"rec2, fld1.1

fld1.2"
""""""

csvcut -c 3 file.csv
"rec1"",""fld3.1
"",
fld3.2"
""
""
Bresee answered 20/4, 2022 at 5:42 Comment(1)
This question is about how to process CSVs with awk, there's lots of ways with lots of other tools to handle CSVs. I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, thanks.Bleacher
R
1

As announced by Arnold Robbins, starting in version 5.3.0 GNU awk will have builtin support for CSV files including the edge cases discussed in this thread via the option --csv. The implementation follows Brian Kernighan's One True AWK.

Using the script provided in @Ben Hoyt's answer:

$ cat records.awk
{
    printf "Record %d:\n", NR
    for (i=1; i<=NF; i++)
        printf "    $%d=<%s>\n", i, $i
    print "----"
}

the program produces the desired output after the call:

gawk --csv -f records.awk file.csv
Rainband answered 27/10, 2023 at 21:31 Comment(1)
I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, thanks.Bleacher
P
0

Awk (gawk) actually provides extensions, one of which being csv processing, which is the most robust way to do so with gawk in my opinion. The extension takes care of many gotchas and parses the csv for you.

Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.

Assuming test.csv contains the following:

Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123

The following will print all lines where the Phone (aka the second field) is equal to 123:

gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print a[1]}'

The output is:

James T. Kirk

How does it work?

  • -l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
  • csvsplit($0, a) splits the current line, and stores each field into a new array named a
  • && a[2] == 123 checks that the second field is 123
  • if both conditions are true, it { print a[1] }, aka prints first csv field of the line.
Ponzo answered 26/7, 2022 at 9:46 Comment(1)
The library described above was never released and instead in version 5.3 gawk introduced a --csv option. I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, thanks.Bleacher
C
0

If you're using one of the common AWK interpreters (Gawk, onetrueawk, mawk), the other solutions are your best bet. However, if you're able to use a different interpreter, frawk and GoAWK have proper CSV support built-in.

frawk is a very fast AWK implementation written in Rust. Use -i csv to process input in CSV mode. Note that frawk is not quite POSIX compatible (see differences).

GoAWK is a POSIX-compatible AWK implementation written in Go. Also supports -i csv mode, as well as -H (parse header row) with @"named_field" syntax (read more). Disclaimer: I'm the author of GoAWK.

With file.csv as per the question, you can simply use an AWK script with a regular for loop over the fields as follows:

$ cat records.awk
{
    printf "Record %d:\n", NR
    for (i=1; i<=NF; i++)
        printf "    $%d=<%s>\n", i, $i
    print "----"
}

Then use either frawk -i csv or goawk -i csv to get the expected output. For example:

$ frawk -i csv -f records.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

$ goawk -i csv -f records.awk file.csv 
Record 1:
... same as above ...
----
Choir answered 16/1, 2023 at 4:15 Comment(1)
I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, thanks.Bleacher
L
0

Once you get into complicated CSV, it is really easier to use a proper parser. While the gawk scripts using FPAT are amazing, they are not a complete RFC 4180 solution.

My goto for CSV parsing at the command prompt is Ruby. There are others (Perl, Python are wonderful) but Ruby is closest to Awk syntax.

Here is a minimal Ruby to parse your example:

ruby -r csv -e '
CSV.parse($<.read).
    each_with_index{|record, idx| 
        puts "Record #{idx+1}:"
        record.each_with_index{|fld, i| 
            puts "\t$#{i+1}=<#{fld}>"
        }
        puts "----"
    }

' file

With your example input, prints:

Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

Note that nothing at all needed to be done to the records or fields since they were properly parsed with default values. (Well, you need +1 since Ruby arrays are 0 based.)

Loading answered 26/2, 2023 at 2:3 Comment(1)
This question is about how to process CSVs with awk, there's lots of ways with lots of other tools to handle CSVs. I've updated my answer to include and explain use/pros/cons of gawk --csv as well as various other awk possibilities now, thanks.Bleacher

© 2022 - 2024 — McMap. All rights reserved.