Adding custom Delimiter adds double quotes in the final spark data frame CSV outpu
Asked Answered
J

1

0

I have a data frame where i am replacing default delimiter , with |^|. it is working fine and i am getting the expected result also except where , is found in the records . For example i have one such records like below

4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|False|^||^||^||^||^|False|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|True|^||^|3014960|^||^|I|!|

So there is , in the 4th field .

Now i am doing like this to replace the ,

 val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val headerColumn = df.columns.filter(v => (!v.contains("^") && !v.contains("_c"))).toSeq

val header = headerColumn.dropRight(1).mkString("", "|^|", "|!|")

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)


dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")

And i get output like this in the saved output part file

"4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|"

My problem is " " at the start and end of the result .

If remove comma then i am getting correct result like below

4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense)net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|
Joan answered 29/10, 2017 at 16:15 Comment(0)
G
1

This is a standard CSV feature. If there's an occurrence of delimiter in the actual data (referred to as Delimiter Collision), the field is enclosed in quotes.

You can try

df.write.option("delimiter" , somechar)

where somechar should be a character that doesn't occur in your data.

EDIT:

A more robust solution would be to disable quoteMode entirely since you are writing a dataframe with only one column.

dfMainOutputFinalWithoutNull.repartition(1)
  .write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("quoteMode", "NONE")
//.option("delimiter", ";")           // assuming `;` is not present in data
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")
Gingivitis answered 29/10, 2017 at 17:15 Comment(8)
It's not. You are doing a concat_ws with |^|. The result is a single column. When you write it using the Spark CSV package, the default delimiter is , which is also present in your data and that's why it's getting enclosed. You have to change the delimiter while writing to HDFS.Gingivitis
Add .option("delimiter", ";") where you're writing the dataframe.Gingivitis
Sorry i did not get your solution ..Are you suggesting me to change my delimiter ?Joan
@Anupam By using this option and selecting a delimiter that does not exist in the data (;?), this problem will be avoided.Racemic
@Anupam You should also checkout the property quoteMode for spark CSV. I believe you can set it to .option("quoteMode", "NONE") which will be a more robust solution than this.Gingivitis
@Gingivitis .option("delimiter", ";") is not working because ; is available in some of the records ..Joan
@Anupam use option("delimiter", "\001") this is a non-printable character ^A and it shouldn't be present In your data unless there's something wrong with it .Gingivitis
this does not work for me. following is my code: df.coalesce(1).write.format("csv").option("quoteMode", "NONE").option("delimiter","\t").save(r'path'). The text column is still enclosed in double quotes and pretty sure that my text does not have tab as I clean it during preprocessing.Alys

© 2022 - 2024 — McMap. All rights reserved.