Can awk deal with CSV file that contains comma inside a quoted field?
Asked Answered
S

13

38

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:

id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

I was using this awk script to count the sum:

awk -F, '{sum+=$3} END {print sum}'

Some of the value in name field contains comma and this break my awk script. My question is: can awk solve this problem? If yes, and how can I do that?

Thank you.

Sulemasulf answered 29/6, 2010 at 6:35 Comment(0)
O
-6

you write a function in awk like below:

$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1

you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

Outdoors answered 30/6, 2010 at 17:14 Comment(3)
This is really clunky, it fails if the field is not a number. @Steve's answer is far better.Vauban
Not only that, it will seemingly succeed if the string contains a number. Hardly ever read such a bad accepted answer.Knickerbocker
The other problem with this answer is that if the "value" is missing in a line, it would assume the "id" is the value unless more logic is put into the awk program to say "if the item on the line is numeric and not the first element..."Mycah
C
36

One way using GNU awk and FPAT

awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt

Result:

192
Cinnamon answered 18/10, 2012 at 14:26 Comment(5)
The FPAT approach is great but it only works when the FS is a single char since you can't negate an RE. It doesn't work when the FS is a string as in this case where AFAIK it's ", " so though it works with the specific sample input data posted, it'd identify too many fields given an input line where a field contains spaces but isn't included in quotes.Sexism
Well, there's no one standard for CSV so YMMV but normally quotes are used when you need to include a field separator within a field, not when you just have spaces. MS-Excel, for example, would not use quotes when saving as CSV format if a cell contained spaces, only if it contained a comma.Sexism
This is great except you need to be able to match totally empty fields: FPAT = "([^, ]*)|(\"[^\"]+\")" }. Otherwise it mismatches fields on lines like 22,,,"some string"Vauban
Since fields included in double-quotes can contain double-quotes themselves (which are then escaped by an extra double-quote) I go with this pattern: FPAT = "([^,]*)|(\"([^\"]|(\"\"))*\")"Aubarta
@EdMorton yes you can negate a regular expression by switching the final- and non-final states in the corresponding automaton. But that usually results in very ugly regular expressions.Brantbrantford
H
10

I am using

`FPAT="([^,]+)|(\"[^\"]+\")" `

to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field. I changed it to:

`FPAT="([^,]*)|(\"[^\"]*\")"`

and replace "+" with "*". It works correctly.

I also find that GNU Awk User Guide also has this problem. https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

Heartrending answered 8/8, 2015 at 4:19 Comment(2)
This solution honestly needs to be pinned somewhere. You'd expect the default behaviour to count the empty fields as well!Endoparasite
I just have to emphasize how brilliant this update to the GNU example is in the hopes that they'll update their example.Franzen
A
5

You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.

Atlantic answered 30/6, 2010 at 17:32 Comment(2)
Yes, I agree on you, I just wonder how awk handle this problem. :)Sulemasulf
See the answer I posted for how to identify the fields in general but for your specific problem the answer given by @HaiVu is the correct one.Sexism
S
4

For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:

$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }

{
   decsv()

   for (i=1;i<=NF;i++) {
       printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
   }
   print ""
}

function decsv(         curr,head,tail)
{
   tail = $0
   while ( match(tail,/"[^"]+"/) ) {
       head = substr(tail, 1, RSTART-1);
       gsub(fs,RS,head)
       curr = curr head substr(tail, RSTART, RLENGTH)
       tail = substr(tail, RSTART + RLENGTH)
   }
   gsub(fs,RS,tail)
   $0 = curr tail
}

$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99

$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>

Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>

Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>

Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>

It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...

See What's the most robust way to efficiently parse CSV using awk? for more information.

Sexism answered 18/10, 2012 at 14:20 Comment(0)
J
4

You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.

Here is the command:

csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'

see https://github.com/dbro/csvquote for the code

Joshua answered 4/5, 2013 at 21:8 Comment(0)
T
4

Recent versions of awk and gawk (ao 2023) support this natively now! Even the newlines. From the manual:

For decades, anyone wishing to work with CSV files and awk had to “roll their own” solution. (For an example, see Defining Fields by Content). In 2023, Brian Kernighan decided to add CSV support to his version of awk. In order to keep up, gawk too provides the same support as his version. To use CSV data, invoke gawk with either of the -k or --csv options

https://www.gnu.org/software/gawk/manual/html_node/Comma-Separated-Fields.html

Throughout answered 28/2 at 3:57 Comment(0)
G
2

You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.

Update(as Dennis requested). A simple example

$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'

$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99

$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question

As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.

Gilletta answered 29/6, 2010 at 7:19 Comment(2)
Perhaps it would be helpful if you showed how to handle the quoted field.Classieclassification
Thanks, Dennis But the csv file is generated by the client, so can I do nothing about the format of file. :(Sulemasulf
B
2

This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.

If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!

grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile

The solution that empties the fields with commas within them but also maintains the record, of course is:

awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}

Megs of thanks to ghostdog74 for the great solution!

NetsGuy256/

Butler answered 18/10, 2012 at 0:35 Comment(1)
printf is a builtin not a function so the "("s don't do what you think and are inappropriate. Also, the synopsis for printf is "printf fmt, values" - doing "printf values" with user input is dangerous and to be avoided. Finally, don't hard-code ORS via printf "\n", just use print "" and let the ORS expand naturally.Sexism
S
2

FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:

awk -F"," '{sum+=$NF} END {print sum}'

To access the second to last column, you would use this:

awk -F"," '{sum+=$(NF-1)} END {print sum}'

Slurp answered 6/4, 2014 at 1:59 Comment(0)
U
1

If you know for sure that the 'value' column is always the last column:

awk -F, '{sum+=$NF} END {print sum}'

NF represents the number of fields, so $NF is the last column

Unlookedfor answered 30/6, 2010 at 17:44 Comment(0)
P
1

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){@f=$csv->fields();$sum+=$f[2]} END{print $sum}' file

allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

Pandurate answered 3/11, 2015 at 0:37 Comment(0)
S
0

you could try piping the file through a perl regex to convert the quoted , into something else like a |.

cat test.csv | perl -p -e "s/(\".+?)(,)(.+?\")/\1\|\3/g" | awk -F, '{...

The above regex assumes there is always a comma within the double quotes. so more work would be needed to make the comma optional

Sherilyn answered 9/9, 2021 at 20:27 Comment(0)
O
-6

you write a function in awk like below:

$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1

you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

Outdoors answered 30/6, 2010 at 17:14 Comment(3)
This is really clunky, it fails if the field is not a number. @Steve's answer is far better.Vauban
Not only that, it will seemingly succeed if the string contains a number. Hardly ever read such a bad accepted answer.Knickerbocker
The other problem with this answer is that if the "value" is missing in a line, it would assume the "id" is the value unless more logic is put into the awk program to say "if the item on the line is numeric and not the first element..."Mycah

© 2022 - 2024 — McMap. All rights reserved.