Skip to main content

Pipelines using the SCD processor fail with “Cannot broadcast the table that is larger than 8GB”

  • January 27, 2022
  • 0 replies
  • 411 views

AkshayJadhav
StreamSets Employee
Forum|alt.badge.img

What is Broadcast Join?

Broadcast join is an important part of Spark SQL’s execution engine. When used, it performs a join on two relations by first broadcasting the smaller one to all Spark executors, then evaluating the join criteria with each executor’s partitions of the other relation. When the broadcasted relation is small enough, broadcast joins are fast, as they require minimal data shuffling. Above a certain threshold, however, broadcast joins tend to be less reliable or performant than shuffle-based join algorithms, due to bottlenecks in-network and memory usage

Spark attempts to estimate the size of tables before deciding whether or not to use a broadcast join.  Sometimes Spark estimates incorrectly and tries to use a broadcast join with a table larger than the hard-coded max size of 8GB. 

To work around this, you can try applying the Spark config. This will tell Spark to disable broadcast joins. 

spark.sql.autoBroadcastJoinThreshold = -1

 

However, If your “master” and “changes” table schemas are identical, it’s possible Spark might choose a broadcast join anyway.  In that scenario make a schema change to the master table (e.g. add one of the SCD tracking columns) and Spark should no longer choose a broadcast to join.

 

Did this topic help you find an answer to your question?

0 replies

Be the first to reply!

Reply