I have a data frame where i am replacing default delimiter ,
with |^|
.
it is working fine and i am getting the expected result also except where ,
is found in the records .
For example i have one such records like below
4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|False|^||^||^||^||^|False|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|True|^||^|3014960|^||^|I|!|
So there is ,
in the 4th field .
Now i am doing like this to replace the ,
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = df.columns.filter(v => (!v.contains("^") && !v.contains("_c"))).toSeq
val header = headerColumn.dropRight(1).mkString("", "|^|", "|!|")
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsmallfffile/FinancialLineItem/output")
And i get output like this in the saved output part file
"4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|"
My problem is " "
at the start and end of the result .
If remove comma then i am getting correct result like below
4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense)net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|
concat_ws
with|^|
. The result is a single column. When you write it using the Spark CSV package, the default delimiter is,
which is also present in your data and that's why it's getting enclosed. You have to change the delimiter while writing to HDFS. – Gingivitis