Split a string ignoring quoted sections
Asked Answered
F

13

22

Given a string like this:

a,"string, with",various,"values, and some",quoted

What is a good algorithm to split this based on commas while ignoring the commas inside the quoted sections?

The output should be an array:

[ "a", "string, with", "various", "values, and some", "quoted" ]

Fannie answered 8/8, 2008 at 18:4 Comment(3)
What if an odd number of quotes appear in the original string?Wongawonga
That would imply an improperly quoted input string, so throwing an exception would be an option.Markusmarl
how would you put a quote inside the quotes?Auriscope
W
2

If my language of choice didn't offer a way to do this without thinking then I would initially consider two options as the easy way out:

  1. Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.

  2. Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatenating the entries until I reached a terminating quote.

These are hacks however, and if this is a pure 'mental' exercise then I suspect they will prove unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.

Wightman answered 8/8, 2008 at 18:18 Comment(1)
I am looking for an algorithm for a similar problem where I have to process huge text files (in GBs). These text files contain qualified data i.e. the field/record delimiter is part of the data when enclosed by single/double quote. I am looking for algorithm which can help me process these files in parallel (by multiple threads). The language we use is Java. Let me know if you have any suggestionsNewsman
M
21

Looks like you've got some good answers here.

For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.

Minuend answered 8/8, 2008 at 19:36 Comment(3)
a great link on secretgeek - very amusing. but it only answers the question for those using .NET sadly.Beanfeast
True; though the advice applies to devs everywhere: don't role your own CSV parser. Ruby has one built-in, and there are libraries out there for Python, C++, most any widely used language.Minuend
And although SecretGeek doesn't seem to know it, there's one built in to VB.NET too. msdn.microsoft.com/en-us/library/…Detective
D
6

Python:

import csv
reader = csv.reader(open("some.csv"))
for row in reader:
    print row
Dehumidifier answered 8/8, 2008 at 21:7 Comment(1)
I consider this as the best answer. Is exactly what I need!Tumbledown
M
2

Of course using a CSV parser is better but just for the fun of it you could:

Loop on the string letter by letter.
    If current_letter == quote : 
        toggle inside_quote variable.
    Else if (current_letter ==comma and not inside_quote) : 
        push current_word into array and clear current_word.
    Else 
        append the current_letter to current_word
When the loop is done push the current_word into array 
Mathre answered 8/8, 2008 at 18:12 Comment(0)
W
2

If my language of choice didn't offer a way to do this without thinking then I would initially consider two options as the easy way out:

  1. Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.

  2. Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatenating the entries until I reached a terminating quote.

These are hacks however, and if this is a pure 'mental' exercise then I suspect they will prove unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.

Wightman answered 8/8, 2008 at 18:18 Comment(1)
I am looking for an algorithm for a similar problem where I have to process huge text files (in GBs). These text files contain qualified data i.e. the field/record delimiter is part of the data when enclosed by single/double quote. I am looking for algorithm which can help me process these files in parallel (by multiple threads). The language we use is Java. Let me know if you have any suggestionsNewsman
E
1

The author here dropped in a blob of C# code that handles the scenario you're having a problem with:

CSV File Imports in .Net

Shouldn't be too difficult to translate.

Endurance answered 8/8, 2008 at 18:13 Comment(0)
P
1

What if an odd number of quotes appear in the original string?

This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:

field1, "field2, field3", field4, "field5, field6" field7

becomes

field1

field2, field3

field4

"field5

field6" field7

Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.

Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.

Palmetto answered 8/8, 2008 at 22:21 Comment(0)
C
1

Here's a simple python implementation based on Pat's pseudocode:

def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
    string_split = []
    current_word = ""
    inside_quote = False
    for letter in string:
      if letter == "'":
        if not remove_quotes:
           current_word += letter
        if inside_quote:
          inside_quote = False
        else:
          inside_quote = True
      elif letter == split_char and not inside_quote:
        string_split.append(current_word)
        current_word = ""
      else:
        current_word += letter
    string_split.append(current_word)
    return string_split
Charybdis answered 5/10, 2010 at 5:43 Comment(0)
C
0

I use this to parse strings, not sure if it helps here; but with some minor modifications perhaps?

function getstringbetween($string, $start, $end){
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0) return "";
    $ini += strlen($start);   
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
}

$fullstring = "this is my [tag]dog[/tag]";
$parsed = getstringbetween($fullstring, "[tag]", "[/tag]");

echo $parsed; // (result = dog) 

/mp

Corydalis answered 8/8, 2008 at 18:36 Comment(0)
D
0

Here's a simple algorithm:

  1. Determine if the string begins with a '"' character
  2. Split the string into an array delimited by the '"' character.
  3. Mark the quoted commas with a placeholder #COMMA#
    • If the input starts with a '"', mark those items in the array where the index % 2 == 0
    • Otherwise mark those items in the array where the index % 2 == 1
  4. Concatenate the items in the array to form a modified input string.
  5. Split the string into an array delimited by the ',' character.
  6. Replace all instances in the array of #COMMA# placeholders with the ',' character.
  7. The array is your output.

Heres the python implementation:
(fixed to handle '"a,b",c,"d,e,f,h","i,j,k"')

def parse_input(input):

    quote_mod = int(not input.startswith('"'))

    input = input.split('"')
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        if i % 2 == quoted_mod:
            input[i] = input[i].replace(",", "#COMMA#")

    input = "".join(input).split(",")
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        input[i] = input[i].replace("#COMMA#", ",")
    return input

# parse_input('a,"string, with",various,"values, and some",quoted')
#  -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
#  -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']
Dermatoglyphics answered 8/8, 2008 at 18:40 Comment(0)
K
0

This is a standard CSV-style parse. A lot of people try to do this with regular expressions. You can get to about 90% with regexes, but you really need a real CSV parser to do it properly. I found a fast, excellent C# CSV parser on CodeProject a few months ago that I highly recommend!

Kiowa answered 8/8, 2008 at 18:52 Comment(2)
There's also one in the .NET framework of course. Even though it's in Microsoft.VisualBasic you could still use it from C#. msdn.microsoft.com/en-us/library/…Detective
Thanks! As a C# guy, I always forget that there are a bunch of useful VB libraries out there that I can use. Honestly, I think they are poorly named, since they are not really VB. They are just .NET.Kiowa
E
0

Here's one in pseudocode (a.k.a. Python) in one pass :-P

def parsecsv(instr):
    i = 0
    j = 0

    outstrs = []

    # i is fixed until a match occurs, then it advances
    # up to j. j inches forward each time through:

    while i < len(instr):

        if j < len(instr) and instr[j] == '"':
            # skip the opening quote...
            j += 1
            # then iterate until we find a closing quote.
            while instr[j] != '"':
                j += 1
                if j == len(instr):
                    raise Exception("Unmatched double quote at end of input.")

        if j == len(instr) or instr[j] == ',':
            s = instr[i:j]  # get the substring we've found
            s = s.strip()    # remove extra whitespace

            # remove surrounding quotes if they're there
            if len(s) > 2 and s[0] == '"' and s[-1] == '"':
                s = s[1:-1]

            # add it to the result
            outstrs.append(s)

            # skip over the comma, move i up (to where
            # j will be at the end of the iteration)
            i = j+1

        j = j+1

    return outstrs

def testcase(instr, expected):
    outstr = parsecsv(instr)
    print outstr
    assert expected == outstr

# Doesn't handle things like '1, 2, "a, b, c" d, 2' or
# escaped quotes, but those can be added pretty easily.

testcase('a, b, "1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])

# odd number of quotes gives a "unmatched quote" exception
#testcase('a,b,"1, 2, 3" , "c', ['a', 'b', '1, 2, 3', 'c'])
Eight answered 8/8, 2008 at 19:15 Comment(0)
U
0

I just couldn't resist to see if I could make it work in a Python one-liner:

arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]

Returns ['a', 'string, with', 'various', 'values, and some', 'quoted']

It works by first replacing the ',' inside quotes to another separator (|), splitting the string on ',' and replacing the | separator again.

University answered 8/8, 2008 at 19:51 Comment(1)
How do you know there aren't any | in the original string? What about escaping quotes inside quoted strings?Detective
D
0

Since you said language agnostic, I wrote my algorithm in the language that's closest to pseudocode as posible:

def find_character_indices(s, ch):
    return [i for i, ltr in enumerate(s) if ltr == ch]


def split_text_preserving_quotes(content, include_quotes=False):
    quote_indices = find_character_indices(content, '"')

    output = content[:quote_indices[0]].split()

    for i in range(1, len(quote_indices)):
        if i % 2 == 1: # end of quoted sequence
            start = quote_indices[i - 1]
            end = quote_indices[i] + 1
            output.extend([content[start:end]])

        else:
            start = quote_indices[i - 1] + 1
            end = quote_indices[i]
            split_section = content[start:end].split()
            output.extend(split_section)

        output += content[quote_indices[-1] + 1:].split()                                                                 

    return output
Decant answered 15/10, 2018 at 2:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.