Keep double quotes in a text file using csv reader
Asked Answered
F

4

5

Hi I have a text file with string :

hello,"foo, bar"

i want to split it into a list as:

['hello', '"foo, bar"']

Is there a way I can acheive this ?

I am trying this as of now :

for line in sys.stdin: csv_file = StringIO.StringIO(line) csv_reader = csv.reader(csv_file)

I want them to split into two string i.e:

'hello' and '"foo, bar"'

Faustinafaustine answered 14/4, 2016 at 16:28 Comment(12)
What result do you expect?Pupa
The same as input.Faustinafaustine
The code works fine for me as it should as a csv.reader would not split words inside double quotesDenominational
Then why use csv.reader()?Pupa
To split them into list.Faustinafaustine
Are you sure it is not ['hello', 'foo, bar']?Denominational
Yes it is : hello,"foo, bar" and I want it to be split into two strings : hello and "foo, bar".Faustinafaustine
It should be split into ['hello', 'foo, bar'], how are you running it?Denominational
I want to keep ""foo, bar""Faustinafaustine
run the code on the sample input and add it to your question exactly as you run it and the output copy/pastedDenominational
I have edited the ques. I hope it makes more sense now.Faustinafaustine
I don't know of any way to wrap in extra quotes using the csv lib, you will have to do it yourselfDenominational
C
2

Say you read a row from a CSV:

from StringIO import StringIO
import csv

infile = StringIO('hello,"foo, bar"')
reader = csv.reader(infile)
row = reader.next()  # row is ['hello', 'foo, bar']

The second value in the row is foo, bar instead of "foo, bar". This isn't some Python oddity, it's a reasonable interpretation of CSV syntax. The quotes probably weren't placed there to be part of a value, but rather to show that foo, bar is one value and shouldn't be split into foo and bar based on the comma (,). An alternative solution would be to escape the comma when creating the CSV file, so the line would look like:

hello,foo \,bar

So it's quite a strange request to want to keep those quotes. If we know more about your use case and the bigger picture we can help you better. What are you trying to achieve? Where does the input file come from? Is it really a CSV or is it some other syntax that looks similar? For example if you know that every line consists of two values separated by a comma, and the first value never contains a comma, then you can just split on the first comma:

print 'hello,"foo, bar"'.split(',', 1)  # => ['hello', '"foo, bar"']

But I doubt the input has such restrictions which is why things like quotes are needed to resolve ambiguities.

If you're trying to write to a CSV again, then the quotes will be recreated as you're doing so. They don't have to be there in the intermediate list:

outfile = StringIO()
writer = csv.writer(outfile)
writer.writerow(row)
print outfile.getvalue()

This will print

hello,"foo, bar"

You can customise the exact CSV output by setting a new dialect.

If you want to grab the individual values in the row with the appropriate quoting rules applied to them, it's possible, but it's a bit of a hack:

# We're going to write individual strings, so we don't want a line terminator
csv.register_dialect('no_line_terminator', lineterminator='')

def maybe_quote_string(s):
    out = StringIO()

    # writerow iterates over its argument, so don't give it a plain string
    # or it'll break it up into characters
    csv.writer(out, 'no_line_terminator').writerow([s])

    return out.getvalue()

print maybe_quote_string('foo, bar')
print map(maybe_quote_string, row)

The output is:

"foo, bar"
['hello', '"foo, bar"']

This is the closest I can come to answering your question. It's not really keeping the double quotes, rather it's removing them and adding them back with likely the same rules that put them there in the first place.

I'll say it again, you're probably headed down the wrong path with this question. Others will probably agree. That's why you're struggling to get good answers. What is the bigger problem that you're trying to solve? We can help you better to achieve that.

Crider answered 15/4, 2016 at 9:39 Comment(2)
I Liked your answer. But I figured out a different way to do it.Faustinafaustine
@Faustinafaustine Can you please tell how did you solve the problem?Greenshank
S
1

Kinda depends upon you Use Case. If the only "s are there for values containing commas (e.g. "foo,bar"), then you can use CSV writer to put them back in.

import io
import csv

infile = io.StringIO('hello,"foo, bar"')
outfile = io.StringIO()
reader = csv.reader(infile)
for row in reader:
    inList = row
    break
print(inList)
# As an output string
writer = csv.writer(outfile)
writer.writerow(inList)
outList = outfile.getvalue().strip()
print(outList)
# As a List
outList = []
for i in range(len(inList)):
    outfile = io.StringIO()
    writer = csv.writer(outfile)
    writer.writerow([inList[i]])
    outList.append(outfile.getvalue().strip())
print(outList)

Output

['hello', 'foo, bar']
hello,"foo, bar"
['hello', '"foo, bar"']

However, if you have other, unnecessary "s that you want to preserve (e.g. '"hello","foo,bar",humbug') and all fields containing , will be correctly wrapped in "s then you could split the line on the , and look for 'broken' fields (start with " but don't end with ")

line = '"hello","foo, bar",humbug'
fields = line.split(',')
print(fields)
values = []
i = 0
while i < len(fields):
    # If a field doesn't start with a ", or starts and ends with "s
    if (fields[i][0] != '"') or (fields[i][-1] == '"'):
        values.append(fields[i])        # It's a stand alone value
        i += 1
        continue
    value = fields[i]           # A value that has been split
    i += 1
    while i < len(fields):
        value += ',' + fields[i]
        i += 1
        if value[-1] == '"':     # The last part would have ended in a "
            break
    values.append(value)
print(values)

Output

['"hello"', '"foo', ' bar"', 'humbug']
['"hello"', '"foo, bar"', 'humbug']
Shani answered 4/1, 2022 at 0:35 Comment(0)
D
0

Alright so this took a long time to get a solution and it is in no way pretty, but:

>>> import re
>>> s = 'hello,"foo, bar"'
>>> 
>>> replacements = {}
>>> m = re.search("\".*\"", s)
>>> while m:
...     key = 'unique_phrase_' + str(len(replacements))
...     replacements[key] = s[m.span()[0]:m.span()[1]]
...     s = re.sub("\".*\"", key, s, count=1)
...     m = re.search("\".*\"", s)
... 
>>> list_from_string = s.split(",")
>>> final_list = []
>>> for element in list_from_string:
...     for key in replacements.keys():
...             if re.match(key, element):
...                     final_list.append(re.sub(key, replacements[key],   element))
...             else:
...                     final_list.append(element)
... 
>>> 
>>> print final_list
['hello', '"foo, bar"']

Looks ugly to me but couldn't find any clear ways to make it more pythonic.

Dion answered 14/4, 2016 at 17:31 Comment(0)
C
0

A little late to the party but in the CSV library there's quoting which should do what you want (set to QUOTE_NONE)

Cristoforo answered 15/4, 2016 at 8:43 Comment(1)
That would cause it to split into "foo and bar", wouldn't it?Ohalloran

© 2022 - 2024 — McMap. All rights reserved.