Difference between sc.broadcast and broadcast function in spark sql
Asked Answered
I

2

4

I have used sc.broadcast for lookup files to improve the performance.

I also came to know there is a function called broadcast in Spark SQL Functions.

What is the difference between two?

Which one i should use it for broadcasting the reference/look up tables?

Islet answered 29/10, 2016 at 15:5 Comment(0)
G
9

If you want to achieve broadcast join in Spark SQL you should use broadcast function (combined with desired spark.sql.autoBroadcastJoinThreshold configuration). It will:

  • Mark given relation for broadcasting.
  • Adjust SQL execution plan.
  • When output relation is evaluated it will take care of collecting data, and broadcasting, and applying correct join mechanism.

SparkContext.broadcast is used to handle local objects and is applicable for use with Spark DataFrames.

Giron answered 29/10, 2016 at 16:50 Comment(0)
K
12

one word answer :

1) org.apache.spark.sql.functions.broadcast() function is user supplied,explicit hint for given sql join.

2) sc.broadcast is for broadcasting readonly shared variable.


More details about broadcast function #1 :

Here is scala doc from sql/execution/SparkStrategies.scala

which says.

  • Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the *
    [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling *
    performed. If both sides of the join are eligible to be broadcasted then the *
  • Shuffle hash join: if the average size of a single partition is small enough to build a hash * table.
  • Sort merge: if the matching join keys are sortable.
  • If there is no joining keys, Join implementations are chosen with the following precedence:
    • BroadcastNestedLoopJoin: if one side of the join could be broadcasted
    • CartesianProduct: for Inner join
    • BroadcastNestedLoopJoin
  • The below method controls the behavior based on size we set to spark.sql.autoBroadcastJoinThreshold by default it is 10mb

Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.

/** Matches a plan whose output should be small enough to be used in broadcast join.
         **/
        private def canBroadcast(plan: LogicalPlan): Boolean = {
          plan.statistics.isBroadcastable ||
            plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
        }

In future the below configurations will be deprecated in coming versions of spark. enter image description here

Katlaps answered 1/11, 2016 at 15:48 Comment(1)
@BdEngineer : broadcast variable is to read the data from local variables not a distributed structure like dataset/dataframe. is what jacek is saying. which is 100% correctKatlaps
G
9

If you want to achieve broadcast join in Spark SQL you should use broadcast function (combined with desired spark.sql.autoBroadcastJoinThreshold configuration). It will:

  • Mark given relation for broadcasting.
  • Adjust SQL execution plan.
  • When output relation is evaluated it will take care of collecting data, and broadcasting, and applying correct join mechanism.

SparkContext.broadcast is used to handle local objects and is applicable for use with Spark DataFrames.

Giron answered 29/10, 2016 at 16:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.