Exporting spark dataframe to .csv with header and specific filename
Asked Answered
W

1

12

I am trying to export data from a spark dataframe to .csv file:

df.coalesce(1)\
  .write\
  .format("com.databricks.spark.csv")\
  .option("header", "true")\
  .save(output_path)

It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"

I want the filename to be "part-r-00000.csv" or "part-00000.csv"

As the file is being created on AWS S3, I am limited in how I can use os.system commands.

How can I set the file name while keeping the header in the file?

Thanks!

Wittenburg answered 6/2, 2018 at 21:13 Comment(5)
Does it matter what it's called on HDFS if you can pull it down and name it whatever you like? hadoop fs -cat path/to/output/part-r-* > path/to/local/file.csv will dump all the parts from Hadoop into one file on your local disk.Spasm
I'm trying to build an automated pyspark etl job, hence I'm avoiding all sorts of manual and local system interventions. my ask is simple 'when exporting data from a spark data frame to .csv file, the file name should be 'part-00000'. appreciate your response.Wittenburg
I've never seen this type of filename before. Maybe the file already exists and this long name was made to avoid overwriting the file?Spasm
no, I did cleanup the target folder and reran the job. that didn't help.Wittenburg
AFAIK, the file name is part-#{partition number}-#{random uuid}-#{something}. AFAIK, the UUID is to allow multiple executors to write to the same directory without worrying about trying to write to the same file.Delusive
W
9

Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.

Coming to the solution:

When we create a .csv file from spark dataframe,

The output file is by default named part-x-yyyyy where:

1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.

In order to rename the output file, running an os.system HDFS command should do.

import os, sys
output_path_stage = //set the source folder path here
output_path  = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + '  ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)

fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path) , file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.

Wittenburg answered 7/2, 2018 at 16:30 Comment(4)
Hey dude, sorry people dinged you for grammar. That's bad behavior on their part.Delusive
Hmm. Shoot. I came across your question and was hoping to be able to set it, rather than use commands to rename on the filesystem. Time to go digging!Delusive
It's not looking like it's possible with the libraries as-is, but people point to this for more info about adding this functionality: wiki.apache.org/hadoop/…Delusive
Would you be able to locate the output file and rename it? e.g. use glob to get the file and then rename or create a copy of it with the intended filename?Reddin

© 2022 - 2024 — McMap. All rights reserved.