Here is some additional details/differences on code level:
Adding only function definitions here, for full code implementation check spark's github page.
Below are the different methods available for repartition on dataframe:
check full implementation here.
def repartition(numPartitions: Int): Dataset[T]
Whenever we call above method on dataframe it returns a new Dataset that has exactly numPartitions partitions.
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Above method returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.
def repartition(partitionExprs: Column*): Dataset[T]
Above method returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.
def repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Above method returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned.
def repartitionByRange(partitionExprs: Column*): Dataset[T]
Above method returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is range partitioned.
But for coalesce we have only below method on dataframe:
def coalesce(numPartitions: Int): Dataset[T]
Above method will return a new Dataset that has exactly numPartitions
partitions
Below are the methods available for repartition and coalesce on RDD:
check full implementation here.
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T]
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
Basically, repartition method calls coalesce method by passing shuffle value as true.
Now if we use coalesce method on RDD by passing shuffle value as true we can increase partitions too!.