Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Spark Shuffle Deep Dive
Bo Yang

Content
• Overview
• Major Classes
• Shuffle Writer
• Spark Serializer
• Shuffle Reader
• External Shuffle Service
• Suggestions

Shuffle Overview
Mapper 1
Orange 3
Apple 2
Peach 5
Pear 1
Mapper 2
Peach 3
Banana 2
Grape 5
Reducer 1
Apple 2
Peach 8
Pear 1
Reducer 2
Grape 5
Orange 3
Reducer 3
Banana 2

High Level Abstraction
• Pluggable Interface: ShuffleManager
• registerShuffle(…)
• getWriter(…)
• getReader(…)
• Configurable: spark.shuffle.manager=xxx
• Mapper: ShuffleWriter
• write(records: Iterator)
• Reducer: ShuffleReader
• read(): Iterator

Implementations
• SortShuffleManager (extends ShuffleManager)
• Three Writers (optimized for different scenarios)
• SortShuffleWriter: uses ExternalSorter
• BypassMergeSortShuffleWriter: no sorter
• UnsafeShuffleWriter: uses ShuffleExternalSorter
• One Reader
• BlockStoreShuffleReader, uses
• ExternalAppendOnlyMap
• ExternalSorter (if ordering)

Writer Output Example (Shuffle Files)
Mapper 1
Data File
Index File
Reducer 1 Reducer 2 Reducer 3
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Number of Partitions == Number of Reducers

Three Shuffle Writers
• Different Writer Algorithms
• SortShuffleWriter
• BypassMergeSortShuffleWriter
• UnsafeShuffleWriter
• Used in different situations (optimizations)
• Things to consider
• Reduce total number of files
• Reduce serialization/deserialization when possible

When Different Writers Are Used?
• Small number of partitions?
---> BypassMergeSortShuffleWriter
• Able to sort record in serialized form?
---> UnsafeShuffleWriter
• Otherwise
---> SortShuffleWriter

BypassMergeSortShuffleWriter
One file for each partition, then merge them
Mapper
BypassMergeSort
ShuffleWriter
Temp File: Partition 0
…
Temp File: Partition X
Index File
Data File
merge
Temp File: Partition 1
write

BypassMergeSortShuffleWriter (cont’d)
Used when
• No map side combine
• Number of partitions < spark.shuffle.sort.bypassMergeThreshold
Pros
• Simple
Cons
• 1 to 1 mapping between temp file and partition
• Many temp files

SortShuffleWriter
• Why sort?
• Sort records by PartitionId, to separate records by different partitions
• Reduce number of files: number of spill files < number of partitions
• Buffer (in memory):
• PartitionedAppendOnlyMap (when there is map side combine)
• PartitionedPairBuffer (when there is no map side combine)
Mapper
SortShuffleWriter
ExternalSorter Buffer
Spill File (Sorted)
…
Spill File (Sorted)
Index File
Data File
merge

SortShuffleWriter (cont’d)
Used when
• Has map side combine, or, many partitions
• Serializer supports record relocation
Pros
• Flexible, support all shuffle situations
Cons
• Serialize/deserialize multiple times
Internal configure to control spill behavior
(inside Spillable.scala):
spark.shuffle.spill.initialMemoryThreshold
spark.shuffle.spill.numElementsForceSpillThreshold

UnsafeShuffleWriter
• Record serialized once, then stored in memory pages
• 8 bytes record pointer (pointing to: memory page + offset)
• All record pointers stored in a long array
• Sort record pointers (long array)
• Small memory footprint
• Better fit CPU cache
• Sorter class: ShuffleExternalSorter
Memory
Page 1
Memory
Page 2
Memory
Page xxx
Record 1 (8 bytes)
Record 2 (8 bytes)
…
Store/Sort as
Array

UnsafeShuffleWriter (cont’d)
Used when
• Serializer supports record relocation
• No aggregator
Pros
• Single serialization, no deserialization/serialization for merging spill files
• Sorting is CPU cache friendly
Cons
• Not supported when using default serializer (JavaSerializer), supported
when using KryoSerializer

Serializer: JavaSerializer
• Default serializer in Spark
• spark.serializer=org.apache.spark.serializer.JavaSerializer
• Use object reference in serialized stream
• Write reference instead of whole object for repeated (same) object
• Not support record relocation
• Cannot move record in serialized stream due to object reference
• Pros: support serialization in all situations
• Cons: performance not good

Serializer: KryoSerializer
• Use kryo library
• Not use object reference in serialized stream by default
• Support record relocation
• Because there is no object reference, and each serialized object is independent
• Need to explicitly register classes for serialization, otherwise, it will write
fully qualified class name for each serialized object
• Pros: performance is good for common classes and registered classes (see
KryoSerializer.scala
• Cons: performance is bad for custom classes if not registered, need to
explicitly register them

Shuffle Reader: BlockStoreShuffleReader
Mapper 1
Data File
Index File
Reducer: BlockStoreShuffleReader
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Aggregator
ExternalAppend
OnlyMap
Spill File
…
Spill File
Iterator
Use:
HashComparator ExternalSorter
Iterator
If ordering
by key

External Shuffle Service
• YarnShuffleService / MesosExternalShuffleService
• YarnShuffleService: running inside YARN Node Manager as an
AuxiliaryService
• Run on each machine in YARN/Mesos cluster
• Get shuffle files from local disk and stream to reducers
• Use file name convention to locate shuffle files
(ExternalShuffleBlockResolver)
• "shuffle_" + shuffleId + "_" + mapId + "_0.index”
• "shuffle_" + shuffleId + "_" + mapId + "_0.data"

Suggestions / Takeaway
• Shuffle is expensive, avoid unnecessary shuffle
• Shuffle vs Cache (Dataset.persist(…))
• Shuffle files provide full data set for next stage execution
• Cache may not necessary when there is shuffle (unless want cache replicas)
• Use KryoSerializer if possible
• Tune different configures
• spark.shuffle.sort.bypassMergeThreshold
• spark.shuffle.spill.initialMemoryThreshold
• spark.shuffle.spill.numElementsForceSpillThreshold

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

In this document