How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

S

2

6

df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+

df.rdd.getNumPartitions() - it has 1 partition

>>> df.rdd.getNumPartitions()

1

df.write.save("/user/hduser/data_check/test.parquet", format="parquet")

If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.

Found 4 items

-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47 
/user/hduser/data_check/test.parquet/_SUCCESS 
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata 
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?

Help would be much appreciated.

Sophist answered 15/3, 2017 at 7:36 Comment(5)

use coalesce(1) to get single file – Schulman 15/3, 2017 at 7:44

why do you need one file? if you need it just to move it along then use the .gz.parquet file as it should have everything you need. The other files are generated in the process for various things. – Possessive 15/3, 2017 at 7:57

Hi @Ashish Singh, I have tried below two commands, df.coalesce(1).write.save("/user/hduser/data_check/test_3.parquet", format="parquet"); df.coalesce(1).write.parquet("/user/hduser/data_check/test_4.parquet"); These commands are also saving or writing as directory with parquet data file and metadata files. – Sophist 15/3, 2017 at 9:32

Like this: hadoop fs -ls /user/hduser/data_check/test_3.pa‌rquet Found 4 items -rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/_SUCCESS -rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/_common_metadata -rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/_metadata -rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌rquet/part-r-00000-6593ef9d-45c1-49a3-9b23-a783a9075c24.gz.parquet – Sophist 15/3, 2017 at 9:39

@ShivaRam did this answer your question, if yes please respond with the solution if you have – Jimmiejimmy 21/6, 2020 at 13:23

A

0

Use coalesce(1) after write. it will solve your issue

df.coalesce(1).write

Alcaraz answered 24/5, 2018 at 21:9 Comment(1)

Your call order didn't work for me. I had to do df.coalesce(1).write – Haslet 16/1, 2019 at 21:4

C

0

This should solve the problem.

df.coalesce(1).write.parquet(parquet_file_path)
df.write.mode('append').parquet("/tmp/output/people.parquet")

Cowes answered 27/7, 2020 at 13:52 Comment(0)

Recommended topics

Hot tags