How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

Asked 15/3, 2016 at 9:47 Answered 7/5, 2019 at 16:46

Solved scala apache-spark hive delimiter spark-csv

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

Townes answered 15/3, 2016 at 9:47 Comment(0)

If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")

Galloromance answered 15/3, 2016 at 9:55 Comment(6)

Thank you! I didn't know about the \u0 thing. Could you explain a bit more exactly what it means and does? I'm guessing 'u' is for unicode, but I want to understand this thing properly. – Townes 15/3, 2016 at 10:12

Well the \ char marks the beginning of an escape sequence, meaning that the following character is not part of the string, but has a special meaning. The u character means that the following numbers are a Unicode code for a character, and 0001 is the Unicode code for that special character. So what it does, it just inserts that special character in the string. – Galloromance 15/3, 2016 at 10:31

use '\x01' as the delimiter in case you are using pyspark – Abduce 10/8, 2017 at 21:41

Did the above solution worked .option("delimiter", "\u0001"). Its giving me an error as given below java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001 at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39) – Summersummerhouse 24/4, 2019 at 20:12

If you are using Spark 2.x, then you are using the built-in csv parser, which does not support setting any character as the delimiter as of now. – Galloromance 25/4, 2019 at 5:19

This answer worked for me: https://mcmap.net/q/333521/-custom-delimiter-csv-reader-spark – Anne 26/6, 2019 at 2:40

With Spark 2.x and the CSV API, use the sep option:

val df = spark.read
  .option("sep", "\u0001")
  .csv("path_to_csv_files")

Auricula answered 7/5, 2019 at 16:46 Comment(0)

Recommended topics

Hot tags