How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?
df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+
df.rdd.getNumPartitions()
- it has 1 partition
>>> df.rdd.getNumPartitions()
1
df.write.save("/user/hduser/data_check/test.parquet", format="parquet")
If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet"
in HDFS
and inside that directory multiple files .parquet
file, metadata file are getting saved.
Found 4 items
-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_SUCCESS
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet
How to write data in the dataframe into single .parquet
file(both data & metadata in single file) in HDFS
rather than folder with multiple files?
Help would be much appreciated.