Spark: can you include partition columns in output files?

About

Asked 10/1, 2018 at 14:54 Answered 10/1, 2018 at 15:1

I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar), if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output"), I get an output of

/tmp/output/foo=1/X.csv
/tmp/output/foo=2/Y.csv
...

However, the output CSV files only contain the value for bar, not foo. I know the value of foo is already captured in the directory name foo=N, but is it possible to also include the value of foo in the CSV file?

Amor answered 10/1, 2018 at 14:54 Comment(0)

Only if you make a copy under different name:

(df
    .withColumn("foo_", col("foo"))
    .write.mode("overwrite")
    .format("csv").partitionBy("foo_").save("/tmp/output"))

Goldberg answered 10/1, 2018 at 15:1 Comment(1)

Is this still the best solution to date? What if I want the exact same name? – Soapberry 5/9, 2023 at 12:32

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags