How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala
Asked Answered
P

1

2

Hi i have output of my spark data frame which creates folder structure and creates so may part files . Now i have to merge all part files inside the folder and rename that one file as folder path name .

This is how i do partition

df.write.partitionBy("DataPartition","PartitionYear")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")/
  .option("codec", "gzip")
  .save("hdfs:///user/zeppelin/FinancialLineItem/output")

It creates folder structure like this

hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz
hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00002-87a61115-92c9-4926-a803-b46315e55a08.c001.csv.gz

I have to create final file like this

hdfs:///user/zeppelin/FinancialLineItem/output/Japan.1971.currenttime.csv.gz

No part files here bith 001 and 002 is merged two one .

My data size it very big 300 GB gzip and 35 GB zipped so coalesce(1) and repartition becomes very slow .

I have seen one solution here Write single CSV file using spark-csv but i am not able to implement it please help me with it .

Repartition throw error

error: value repartition is not a member of org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]
       dfMainOutputFinalWithoutNull.write.repartition("DataPartition","StatementTypeCode")
Pelagias answered 18/10, 2017 at 14:17 Comment(13)
I suppose that your motivation to merge the files is to process it outside of Spark. I would say in this case that the approach is to merge them outside of Spark, since you are giving up on the distributed nature of your data which is in essence the reason to process it with Spark.Rooftree
Why do you have to merge all the files? The files being split into parts is ideal for reading with Spark. Also, HDFS is not meant to hold single large files like this, so if you are going to do it, it should be saved to the head node of your cluster. Is that an option instead of HDFS?Docker
@Anupam ok - why do you want to have them merged in a single file?Rooftree
@DanCiborowski-MSFT I have to deliver these files to clients and they want it in the same format ..Can we at least control the no of files per partition for example 5 files per partition ?Currently it creates more than 200 for the partition that has even 1 GB file also ..Pelagias
@AlexandreDupriez thats the requirement ..Can we at-least rename all the files like running number at suffix in all the files .Also if not merge all the part files but can we control the no of files per partition ...Pelagias
Currently it creates more than 200 for the partition that has even 1 GB file also, it might be because you might be running group by(shuffling) kind of transformation. certainly you can limit with rdd/df.repartition(x). x is the number of file you want create for that rdd/dfBedfellow
@Bedfellow here is detail question that my colleague has posted #46754932Pelagias
@Bedfellow will this dfMainOutputFinalWithoutNull.coalesce(5).write.partitionBy("DataPartition","StatementTypeCode") work ?Pelagias
It should help, Make sure you use s3a and nearest region for s3 bucketBedfellow
I had a discussion with EMR support team and they confirmed us s3a deprecated now ...We should use s3 only ..Pelagias
check this answer on removing empty or reducing partitionsBedfellow
@Bedfellow repartition is not working my case i have updated my question please suggestPelagias
Let us continue this discussion in chat.Bedfellow
D
4

Try this from the head node outside of Spark...

hdfs dfs -getmerge <src> <localdst>

https://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#getmerge

"Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file."

Docker answered 19/10, 2017 at 18:1 Comment(2)
I have so many folders approx 5K folders ...how can i rename the files ?Pelagias
This is a different question then you started with. For this function you provide the src directory, not src files.Docker

© 2022 - 2024 — McMap. All rights reserved.