Spark fails to read CSV when last column name contains spaces
Asked Answered
S

2

9

I have a CSV that looks like this:

+-----------------+-----------------+-----------------+
| Column One      | Column Two      | Column Three    |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+

In plain text, it actually looks like this:

Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value

My spark.read method looks like this:

val df = spark.read
    .format("csv")
    .schema(schema)
    .option("quote", "\"")
    .option("escape", "\"")
    .option("header", "true")
    .option("multiLine", "true")
    .option("mode", "DROPMALFORMED")
    .load(inputFilePath)

When multiLine is set to true, the df loads as empty. It loads fine when multiLine is set to false, but I need multiLine set to true.

If I change the name of Column Three to ColumnThree, and also update that in the schema object, then it works fine. It seems like multiLine is being applied to the header row! I was hoping that wouldn't be the case when header is also set to true.

Any ideas how to get around this? Should I be using the univocity parser instead of the default commons?

UPDATE:

I don't know why that mocked data was working fine. Here's a closer representation of the data:

CSV (Just 1 header and 1 line of data...):

Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test

Schema & the spark.read method:

val df = spark.read
  .format("csv")
  .schema(StructType(Array(
    StructField("Digital ISBN", StringType, true),
    StructField("Print ISBN", StringType, true),
    StructField("Title", StringType, true),
    StructField("File Name", StringType, true),
    StructField("Price", StringType, true),
    StructField("Description", StringType, true),
    StructField("Book Cover File Name", StringType, true)
  )))
  .option("quote", "\"")
  .option("escape", "\"")
  .option("header", "true")
  .option("multiLine", "true")
  .option("mode", "DROPMALFORMED")
  .load(inputFilePath)

df.show() result in spark-shell:

+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+

UDPATE 2:

I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....

UPDATE 3:

I opened the two files mentioned in Update2 in vim and noticed that the CSV saved by Excel had ^M instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...

Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...

Sealed answered 22/5, 2018 at 23:33 Comment(5)
I tested without schema and it works perfect. so can you share the schema?Unionism
StructType(Array(StructField("Column One", StringType, true), StructField("Column Two", StringType, true), StructField("Column Three", StringType, true)))Sealed
it works for me . what is the spark and scala versions you are working on?Unionism
@LXXIII Which spark version are you using? Because in Spark v2.3.0, your code works fine.Abran
@Abran 2.2.0. Granted, I simplified my code for this post. But now I realize, I should have tested this simplified code before posting, and if it worked, eliminate the differences until the real issue bubbled up. Okay. I've got some iterating to do haha.Sealed
S
2

Since the question has a few up-votes, here's the resolution to the original problem as an answer...

Newlines in files saved in the Windows world contain both carriage return and line feed. Spark (running on Linux) sees this as a malformed row and drops it, because in its world, newlines are just line feed.

Lessons:

  • It's important to be familiar with the origin of the file you're working with.
  • When debugging data processing issues, work with an editor that shows carriage returns.
Sealed answered 11/12, 2018 at 6:19 Comment(0)
R
1

I was facing the same issue with the option of multiLine being applied to the header. I solved it by adding the additional option for ignoring trailing white space.

  .option("header", true)
  .option("multiLine", true)
  .option("ignoreTrailingWhiteSpace", true)
Rickirickie answered 29/10, 2018 at 18:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.