What is Broadcast Join?
Broadcast join is an important part of Spark SQL’s execution engine. When used, it performs a join on two relations by first broadcasting the smaller one to all Spark executors, then evaluating the join criteria with each executor’s partitions of the other relation. When the broadcasted relation is small enough, broadcast joins are fast, as they require minimal data shuffling. Above a certain threshold, however, broadcast joins tend to be less reliable or performant than shuffle-based join algorithms, due to bottlenecks in-network and memory usage
Spark attempts to estimate the size of tables before deciding whether or not to use a broadcast join. Sometimes Spark estimates incorrectly and tries to use a broadcast join with a table larger than the hard-coded max size of 8GB.
To work around this, you can try applying the Spark config. This will tell Spark to disable broadcast joins.
spark.sql.autoBroadcastJoinThreshold = -1
However, If your “master” and “changes” table schemas are identical, it’s possible Spark might choose a broadcast join anyway. In that scenario make a schema change to the master table (e.g. add one of the SCD tracking columns) and Spark should no longer choose a broadcast to join.