How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?
Asked Answered
S

2

6

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?

df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+

df.rdd.getNumPartitions() - it has 1 partition

>>> df.rdd.getNumPartitions()

1

df.write.save("/user/hduser/data_check/test.parquet", format="parquet")

If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.

Found 4 items

-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47 
/user/hduser/data_check/test.parquet/_SUCCESS 
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata 
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet

How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?

Help would be much appreciated.

Sophist answered 15/3, 2017 at 7:36 Comment(5)
use coalesce(1) to get single fileSchulman
why do you need one file? if you need it just to move it along then use the .gz.parquet file as it should have everything you need. The other files are generated in the process for various things.Possessive
Hi @Ashish Singh, I have tried below two commands, df.coalesce(1).write.save("/user/hduser/data_check/test_3.parquet", format="parquet"); df.coalesce(1).write.parquet("/user/hduser/data_check/test_4.parquet"); These commands are also saving or writing as directory with parquet data file and metadata files.Sophist
Like this: hadoop fs -ls /user/hduser/data_check/test_3.pa‌​rquet Found 4 items -rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/_SUCCESS -rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/_common_metadata -rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/_metadata -rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 09:02 /user/hduser/data_check/test_3.pa‌​rquet/part-r-00000-6593ef9d-45c1-49a3-9b23-a783a9075c24.gz.parquetSophist
@ShivaRam did this answer your question, if yes please respond with the solution if you haveJimmiejimmy
A
0

Use coalesce(1) after write. it will solve your issue

df.coalesce(1).write
Alcaraz answered 24/5, 2018 at 21:9 Comment(1)
Your call order didn't work for me. I had to do df.coalesce(1).writeHaslet
C
0

This should solve the problem.

df.coalesce(1).write.parquet(parquet_file_path)
df.write.mode('append').parquet("/tmp/output/people.parquet")
Cowes answered 27/7, 2020 at 13:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.