Hi i have output of my spark data frame which creates folder structure and creates so may part files . Now i have to merge all part files inside the folder and rename that one file as folder path name .
This is how i do partition
df.write.partitionBy("DataPartition","PartitionYear")
.format("csv")
.option("nullValue", "")
.option("header", "true")/
.option("codec", "gzip")
.save("hdfs:///user/zeppelin/FinancialLineItem/output")
It creates folder structure like this
hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz
hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00002-87a61115-92c9-4926-a803-b46315e55a08.c001.csv.gz
I have to create final file like this
hdfs:///user/zeppelin/FinancialLineItem/output/Japan.1971.currenttime.csv.gz
No part files here bith 001 and 002 is merged two one .
My data size it very big 300 GB gzip and 35 GB zipped so coalesce(1) and repartition
becomes very slow .
I have seen one solution here Write single CSV file using spark-csv but i am not able to implement it please help me with it .
Repartition throw error
error: value repartition is not a member of org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]
dfMainOutputFinalWithoutNull.write.repartition("DataPartition","StatementTypeCode")
rdd/df.repartition(x)
. x is the number of file you want create for that rdd/df – BedfellowdfMainOutputFinalWithoutNull.coalesce(5).write.partitionBy("DataPartition","StatementTypeCode")
work ? – Pelagiass3a
and nearest region for s3 bucket – Bedfellow