How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?
Asked Answered
T

2

15

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

Townes answered 15/3, 2016 at 9:47 Comment(0)
G
30

If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")
Galloromance answered 15/3, 2016 at 9:55 Comment(6)
Thank you! I didn't know about the \u0 thing. Could you explain a bit more exactly what it means and does? I'm guessing 'u' is for unicode, but I want to understand this thing properly.Townes
Well the \ char marks the beginning of an escape sequence, meaning that the following character is not part of the string, but has a special meaning. The u character means that the following numbers are a Unicode code for a character, and 0001 is the Unicode code for that special character. So what it does, it just inserts that special character in the string.Galloromance
use '\x01' as the delimiter in case you are using pysparkAbduce
Did the above solution worked .option("delimiter", "\u0001"). Its giving me an error as given below java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001 at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)Summersummerhouse
If you are using Spark 2.x, then you are using the built-in csv parser, which does not support setting any character as the delimiter as of now.Galloromance
This answer worked for me: https://mcmap.net/q/333521/-custom-delimiter-csv-reader-sparkAnne
A
3

With Spark 2.x and the CSV API, use the sep option:

val df = spark.read
  .option("sep", "\u0001")
  .csv("path_to_csv_files")
Auricula answered 7/5, 2019 at 16:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.