What is the difference between bucketBy and partitionBy in Spark?

Asked 19/5, 2021 at 8:21 Answered 19/5, 2021 at 13:31

apache-spark hadoop pyspark hdfs partitioning

For example, I want to save a table, what is the difference between the two strategies?

bucketBy:

someDF.write.format("parquet")
      .bucketBy(4, "country")
      .mode(SaveMode.OverWrite)
      .saveAsTable("someTable")

partitionBy:

someDF.write.format("parquet")
      .partitionBy("country") # <-- here is the only difference
      .mode(SaveMode.OverWrite)
      .saveAsTable("someTable")

I guess, that bucketBy in first case creates 4 directories with countries, while partitionBy will create as many directories as many unique values in column "countries". is it correct understanding ?

Costar answered 19/5, 2021 at 8:21 Comment(1)

This is already answered. I hope this link helps. https://mcmap.net/q/150839/-what-is-the-difference-between-partitioning-and-bucketing-a-table-in-hive – Holcomb 21/5, 2021 at 13:3

Some differences:

bucketBy is only applicable for file-based data sources in combination with DataFrameWriter.saveAsTable() i.e. when saving to a Spark managed table, whereas partitionBy can be used when writing any file-based data sources.
bucketBy is intended for the write once, read many times scenario, where the up-front cost of creating a persistent bucketised version of a data source pays off by avoiding a costly shuffle on read in later jobs. Whereas partitionBy is useful to meet the data layout requirements of downstream consumers of the output of a Spark job.

I guess, that bucketBy in first case creates 4 directories with countries, while partitionBy will create as many directories as many unique values in column "countries". is it correct understanding?

Yes, for partitionBy. However bucketBy will create 4 bucket files (Parquet by default).

Marble answered 19/5, 2021 at 13:31 Comment(0)

Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of buckets and partitions. In other words, the number of bucketing files is the number of buckets multiplied by the number of task writers (one per partition).

You could also use bucketBy along with partitionBy, by which each partition (last level partition in case of multilevel paritioning) will have 'n' number of buckets.

Photochronograph answered 19/5, 2021 at 9:41 Comment(0)

bucketBy:

partitionBy:

Recommended topics

Hot tags