Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?
Asked Answered
C

3

6

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work.

def print_each_line(line):
    print line

path = './input/testfile.csv'
# Here are the contents of testfile.csv
# foo,bar,"blah blah
# more blah blah",baz

p = apache_beam.Pipeline()

(p
 | 'ReadFromFile' >> apache_beam.io.ReadFromText(path)
 | 'PrintEachLine' >> apache_beam.FlatMap(lambda line: print_each_line(line))
 )

# Here is the output:
# foo,bar,"blah blah
# more blah blah",baz

The above code parses the input as two lines even though the standard for multi-line csv files is to wrap multi-line elements within double-quotes.

Cabinetwork answered 19/4, 2018 at 5:7 Comment(3)
You need a PCollection with only one line in it. Am I right?Lynnet
@ArjunKay Yes, currently the input I have is one line, but beam treats it as twoCabinetwork
Do you know guys know if support for multiline CSV has been improved on new versions? given that this was asked long ago? I couldn't find alot of relevat material.Frigorific
R
2

Beam doesn't support parsing CSV files. You can however use Python's csv.reader. Here's an example:

import apache_beam
import csv

def print_each_line(line):
  print line

p = apache_beam.Pipeline()

(p 
 | apache_beam.Create(["test.csv"])
 | apache_beam.FlatMap(lambda filename:
     csv.reader(apache_beam.io.filesystems.FileSystems.open(filename)))
 | apache_beam.FlatMap(print_each_line))

p.run()

Output:

['foo', 'bar', 'blah blah\nmore blah blah', 'baz']
Reproval answered 20/4, 2018 at 23:21 Comment(0)
D
1

None of the answers worked for me but this did

(
  p
  | beam.Create(['data/test.csv'])
  | beam.FlatMap(lambda filename:
    csv.reader(io.TextIOWrapper(beam.io.filesystems.FileSystems.open(known_args.input)))
  | "Take only name" >> beam.Map(lambda x: x[0])
  | WriteToText(known_args.output)
)
Destructible answered 8/11, 2020 at 5:17 Comment(0)
L
0

ReadFromText parses a text file as newline-delimited elements. So ReadFromText treats two lines as two elements. If you would like to have the contents of the file as a single element, you could do the following:

contents = []
contents.append(open(path).read()) 
p = apache_beam.Pipeline()
p | beam.Create(contents)
Lynnet answered 22/4, 2018 at 5:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.