Exporting spark dataframe to .csv with header and specific filename - McMap

About

Exporting spark dataframe to .csv with header and specific filename

Asked 6/2, 2018 at 21:13 Answered 7/2, 2018 at 16:30

python apache-spark pyspark export-to-csv databricks

W

1

12

I am trying to export data from a spark dataframe to .csv file:

df.coalesce(1)\
  .write\
  .format("com.databricks.spark.csv")\
  .option("header", "true")\
  .save(output_path)

It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"

I want the filename to be "part-r-00000.csv" or "part-00000.csv"

As the file is being created on AWS S3, I am limited in how I can use os.system commands.

How can I set the file name while keeping the header in the file?

Thanks!

Wittenburg answered 6/2, 2018 at 21:13 Comment(5)

Does it matter what it's called on HDFS if you can pull it down and name it whatever you like? hadoop fs -cat path/to/output/part-r-* > path/to/local/file.csv will dump all the parts from Hadoop into one file on your local disk. – Spasm 6/2, 2018 at 21:17

I'm trying to build an automated pyspark etl job, hence I'm avoiding all sorts of manual and local system interventions. my ask is simple 'when exporting data from a spark data frame to .csv file, the file name should be 'part-00000'. appreciate your response. – Wittenburg 6/2, 2018 at 21:59

I've never seen this type of filename before. Maybe the file already exists and this long name was made to avoid overwriting the file? – Spasm 6/2, 2018 at 22:0

no, I did cleanup the target folder and reran the job. that didn't help. – Wittenburg 6/2, 2018 at 22:9

AFAIK, the file name is part-#{partition number}-#{random uuid}-#{something}. AFAIK, the UUID is to allow multiple executors to write to the same directory without worrying about trying to write to the same file. – Delusive 28/12, 2018 at 1:33

W

9

Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.

Coming to the solution:

When we create a .csv file from spark dataframe,

The output file is by default named part-x-yyyyy where:

1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.

In order to rename the output file, running an os.system HDFS command should do.

import os, sys
output_path_stage = //set the source folder path here
output_path  = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + '  ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)

fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path) , file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.

Wittenburg answered 7/2, 2018 at 16:30 Comment(4)

Hey dude, sorry people dinged you for grammar. That's bad behavior on their part. – Delusive 28/12, 2018 at 1:31

Hmm. Shoot. I came across your question and was hoping to be able to set it, rather than use commands to rename on the filesystem. Time to go digging! – Delusive 28/12, 2018 at 1:36

It's not looking like it's possible with the libraries as-is, but people point to this for more info about adding this functionality: wiki.apache.org/hadoop/… – Delusive 28/12, 2018 at 1:44

Would you be able to locate the output file and rename it? e.g. use glob to get the file and then rename or create a copy of it with the intended filename? – Reddin 9/4, 2021 at 8:17

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.