CSV Parsing with double quotes
Asked Answered
B

13

4

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.

Now this will not work if I have a value like:

"a",""b","x","y"","c"

then my output is:

'a'
'"b'
'x'
'y"'
'c'

but what I want is:

'a'
'"b","x","y"'
'c'

Is there any regex or any other logic I can use for this ?

Blen answered 25/11, 2008 at 8:6 Comment(5)
@Matt: Not everyone is a native English speaker. Don't know what caused the edit mania. I restored it back to a version that reflects the original intention, as the question's meaning started to degrade.Rooker
@xyz: Sorry for the anarchic changes that were made to your question with no good reason. I hope this has come to an end now.Rooker
Your CSV is invalid, it should be "a","""b"",""x"",""y""","c"Stabler
Here is how I wrote my version. I recommend against using RegEx for this. Once you write some low-level routines, it really isn't difficult to scan the string character by character.Tharpe
Here's my open source library for CSV parsing: code.google.com/p/csharp-csv-readerVerb
S
12

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.

*=remember that some locales use [tab] as the C in CSV...

Su answered 25/11, 2008 at 8:34 Comment(3)
My locale uses semicolons for the "C".. don't get me started with Excel and comma separated files that aren't parsed correctly because the comma is an actual comma ;)Teens
+1 Thanks for the Lumenworks CSV reader suggestion Marc, it works nicely. BTW it is available on NuGet: Install-Package LumenWorksCsvReaderWynne
@Jonathan ah, awesome - that's great, didn't know thatSu
R
10

CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

Raphael answered 6/1, 2009 at 20:35 Comment(0)
S
3

I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.

Just for sake of exercising my mind, quick & dirty working C# procedure:

public static List<string> SplitCSV(string line)
{
    if (string.IsNullOrEmpty(line))
        throw new ArgumentException();

    List<string> result = new List<string>();

    bool inQuote = false;
    StringBuilder val = new StringBuilder();

    // parse line
    foreach (var t in line.Split(','))
    {
        int count = t.Count(c => c == '"');

        if (count > 2 && !inQuote)
        {
            inQuote = true;
            val.Append(t);
            val.Append(',');
            continue;
        }

        if (count > 2 && inQuote)
        {
            inQuote = false;
            val.Append(t);
            result.Add(val.ToString());
            continue;
        }

        if (count == 2 && !inQuote)
        {
            result.Add(t);
            continue;
        }

        if (count == 2 && inQuote)
        {
            val.Append(t);
            val.Append(',');
            continue;
        }
    }

    // remove quotation
    for (int i = 0; i < result.Count; i++)
    {
        string t = result[i];
        result[i] = t.Substring(1, t.Length - 2);
    }

    return result;
}
Sulfaguanidine answered 25/11, 2008 at 11:11 Comment(1)
I really disliked FileHelpers. Too much manual configuration.Powder
G
2

There's an oft quoted saying:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.

Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:

    public IEnumerable<string> SplitCSV(string line)
    {
        int index = 0;
        int start = 0;
        bool inString = false;

        foreach (char c in line)
        {
            switch (c)
            {
                case '"':
                    inString = !inString;
                    break;

                case ',':
                    if (!inString)
                    {
                        yield return line.Substring(start, index - start);
                        start = index + 1;
                    }
                    break;
            }
            index++;
        }

        if (start < index)
            yield return line.Substring(start, index - start);
    }

Standard caveat - untested code, there may be off-by-one errors.

Limitations

  • The quotes around a value aren't removed automatically.
    To do this, add a check just before the yield return statement near the end.

  • Single quotes aren't supported in the same way as double quotes
    You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)

  • Whitespace isn't automatically removed
    Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

Galileo answered 25/11, 2008 at 8:53 Comment(7)
Don't forget that multi-line is also an option for quoted csv, and you'd need to test it with escaped quotes like "some""data" etc...Su
And then there are those people who roll their own string handling routine every time because they've heard that regular expressions just don't cut it. If you know the data your are dealing with regexes are fine. @Blen did not ask how to write a full-fledged ready-for-all-eventualities CSV parser.Rooker
What about 'inString' variable? Some part off logic is missing.Sulfaguanidine
@Tmoaluk: I agree that Regexes are a great tool, when used in the right place - but they are too frequently used in other places as well. When the problem is this simple, I feel that "Do the simplest thing that could possibly work" does not mean a Regex. YMMV.Galileo
Still doesn't work. This code leaves the quotes around double-quoted strings, which is unwanted behavior.Perversion
@Perversion - You're correct, the quotes aren't removed by this. I don't want to make the code more complicated, but will make a note of this limitation.Galileo
@Tomalak: I think this solution is probably more performant, just as easy to maintain and easy to extent (maybe simpler, I know a lot of accomplished developers who rather not dissect others regex's) and less prone for abuse (by nature it's lazy loaded through the Enumerator)Shredding
M
1

In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:

""

\"

In the second form your initial string would look like this:

"a","\"b\",\"x\",\"y\"","c"

If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.

Morrissey answered 25/11, 2008 at 8:13 Comment(2)
no, i don't think that's true. in his example, as long as you assume that the CSV is valid for as long as possible (and not just give up at the '"",b' part), then you could still parse that.Citronellal
There is every chance that you can achieve the right result - its just more work. I have code that successfully does this (though not using regex).Reichel
R
1

If all your values are guaranteed to be in quotes, look for values, not for commas:

("".*?""|"[^"]*")

This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.

If you don't want the enclosing quote to be part of the match, use:

"(".*?"|[^"]*)"

and go for the value in match group 1.

As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.

Rooker answered 25/11, 2008 at 8:17 Comment(1)
Thank you...for your informative reply ,this seems work for me.Blen
P
1

Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P

Powder answered 19/1, 2010 at 16:4 Comment(3)
Hmm, that really doesn't work in my scenario. I have a line I read from a text file, and I want to parse that string. I don't want a StreamReader as input source for a parser, I want to give it a string and have it return an array of field values...Flintlock
If you are reading line by line, you could run into issues if a field has a line break in it. You can read a string by using a StringReader. CsvHelper uses a TextReader, not a StreamReader.Powder
Duly noted. It's a design decision whether to support multiline values.Flintlock
E
1

FileHelpers supports multiline fields.

You could parse files like these:

a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"

Here is the datatype declaration:

[DelimitedRecord(",")]
public class MyRecord
{ 
 public string field1;
 [FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
 public string field2;
}

Here is the usage:

static void Main()
{
 FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
 MyRecord[] res = engine.ReadFile("file.csv");       
}
Eminence answered 25/8, 2010 at 16:8 Comment(0)
S
0

Well, I'm no regex wiz, but I'm certain they have an answer for this.

Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.

Each time you run into a quote toggle dontMatch.

each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.

This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.

For instance,

"a", ""b", ""c", "d"", "e""

will yield bad results.

This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.

To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.

-Adam

Sporades answered 25/11, 2008 at 8:15 Comment(0)
B
0

FileHelpers for .Net is your friend.

Bell answered 25/11, 2008 at 8:47 Comment(0)
J
0

See the link "Regex fun with CSV" at:

http://snippets.dzone.com/posts/show/4430

Janinajanine answered 25/11, 2008 at 9:1 Comment(0)
B
0

The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

Blockbusting answered 25/11, 2008 at 9:27 Comment(0)
B
0

I have just try your regular expression in my code..its work fine for formated text with quote ...

but wondering if we can parse below value by Regex..

"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"@sib.com"

I am looking for result as:

'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"@sib.com'

Thanx

Blen answered 25/11, 2008 at 12:7 Comment(3)
No, that is impossible with my solution, because my regex depends on the quotes to delimit the values. You should really think about using one of the other solutions (e.g. use a parser).Rooker
If you can do anything about it, change the CSV format to something less ambiguous. Don't use commas or quotes when both can occur within the values, or at least consistently escape commas and quotes within values.Rooker
If you use a delimiter in your DSV that isn't going to occur in your values (or is going to occur very rarely, and you just escape it if that happens), you don't even need to worry about the quotes. Colons are pretty common as a delimiter, in my experience.Housel

© 2022 - 2024 — McMap. All rights reserved.