Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
The document provides a detailed overview of Spark Shuffle, including its major components such as Shuffle Writers and Shuffle Readers, and different serialization strategies like JavaSerializer and KryoSerializer. It discusses various shuffle writer algorithms, their use cases, and the importance of optimizing shuffle operations to enhance performance. Key suggestions include minimizing shuffles, using caching judiciously, and leveraging KryoSerializer for improved efficiency.
Implementations
• SortShuffleManager (extendsShuffleManager)
• Three Writers (optimized for different scenarios)
• SortShuffleWriter: uses ExternalSorter
• BypassMergeSortShuffleWriter: no sorter
• UnsafeShuffleWriter: uses ShuffleExternalSorter
• One Reader
• BlockStoreShuffleReader, uses
• ExternalAppendOnlyMap
• ExternalSorter (if ordering)
6.
Writer Output Example(Shuffle Files)
Mapper 1
Data File
Index File
Reducer 1 Reducer 2 Reducer 3
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Number of Partitions == Number of Reducers
7.
Three Shuffle Writers
•Different Writer Algorithms
• SortShuffleWriter
• BypassMergeSortShuffleWriter
• UnsafeShuffleWriter
• Used in different situations (optimizations)
• Things to consider
• Reduce total number of files
• Reduce serialization/deserialization when possible
8.
When Different WritersAre Used?
• Small number of partitions?
---> BypassMergeSortShuffleWriter
• Able to sort record in serialized form?
---> UnsafeShuffleWriter
• Otherwise
---> SortShuffleWriter
9.
BypassMergeSortShuffleWriter
One file foreach partition, then merge them
Mapper
BypassMergeSort
ShuffleWriter
Temp File: Partition 0
…
Temp File: Partition X
Index File
Data File
merge
Temp File: Partition 1
write
10.
BypassMergeSortShuffleWriter (cont’d)
Used when
•No map side combine
• Number of partitions < spark.shuffle.sort.bypassMergeThreshold
Pros
• Simple
Cons
• 1 to 1 mapping between temp file and partition
• Many temp files
11.
SortShuffleWriter
• Why sort?
•Sort records by PartitionId, to separate records by different partitions
• Reduce number of files: number of spill files < number of partitions
• Buffer (in memory):
• PartitionedAppendOnlyMap (when there is map side combine)
• PartitionedPairBuffer (when there is no map side combine)
Mapper
SortShuffleWriter
ExternalSorter Buffer
Spill File (Sorted)
…
Spill File (Sorted)
Index File
Data File
merge
12.
SortShuffleWriter (cont’d)
Used when
•Has map side combine, or, many partitions
• Serializer supports record relocation
Pros
• Flexible, support all shuffle situations
Cons
• Serialize/deserialize multiple times
Internal configure to control spill behavior
(inside Spillable.scala):
spark.shuffle.spill.initialMemoryThreshold
spark.shuffle.spill.numElementsForceSpillThreshold
13.
UnsafeShuffleWriter
• Record serializedonce, then stored in memory pages
• 8 bytes record pointer (pointing to: memory page + offset)
• All record pointers stored in a long array
• Sort record pointers (long array)
• Small memory footprint
• Better fit CPU cache
• Sorter class: ShuffleExternalSorter
Memory
Page 1
Memory
Page 2
Memory
Page xxx
Record 1 (8 bytes)
Record 2 (8 bytes)
…
Store/Sort as
Array
14.
UnsafeShuffleWriter (cont’d)
Used when
•Serializer supports record relocation
• No aggregator
Pros
• Single serialization, no deserialization/serialization for merging spill files
• Sorting is CPU cache friendly
Cons
• Not supported when using default serializer (JavaSerializer), supported
when using KryoSerializer
15.
Serializer: JavaSerializer
• Defaultserializer in Spark
• spark.serializer=org.apache.spark.serializer.JavaSerializer
• Use object reference in serialized stream
• Write reference instead of whole object for repeated (same) object
• Not support record relocation
• Cannot move record in serialized stream due to object reference
• Pros: support serialization in all situations
• Cons: performance not good
16.
Serializer: KryoSerializer
• Usekryo library
• Not use object reference in serialized stream by default
• Support record relocation
• Because there is no object reference, and each serialized object is independent
• Need to explicitly register classes for serialization, otherwise, it will write
fully qualified class name for each serialized object
• Pros: performance is good for common classes and registered classes (see
KryoSerializer.scala
• Cons: performance is bad for custom classes if not registered, need to
explicitly register them
17.
Shuffle Reader: BlockStoreShuffleReader
Mapper1
Data File
Index File
Reducer: BlockStoreShuffleReader
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Aggregator
ExternalAppend
OnlyMap
Spill File
…
Spill File
Iterator
Use:
HashComparator ExternalSorter
Iterator
If ordering
by key
18.
External Shuffle Service
•YarnShuffleService / MesosExternalShuffleService
• YarnShuffleService: running inside YARN Node Manager as an
AuxiliaryService
• Run on each machine in YARN/Mesos cluster
• Get shuffle files from local disk and stream to reducers
• Use file name convention to locate shuffle files
(ExternalShuffleBlockResolver)
• "shuffle_" + shuffleId + "_" + mapId + "_0.index”
• "shuffle_" + shuffleId + "_" + mapId + "_0.data"
19.
Suggestions / Takeaway
•Shuffle is expensive, avoid unnecessary shuffle
• Shuffle vs Cache (Dataset.persist(…))
• Shuffle files provide full data set for next stage execution
• Cache may not necessary when there is shuffle (unless want cache replicas)
• Use KryoSerializer if possible
• Tune different configures
• spark.shuffle.sort.bypassMergeThreshold
• spark.shuffle.spill.initialMemoryThreshold
• spark.shuffle.spill.numElementsForceSpillThreshold