KEMBAR78
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark | PPTX
Spark Shuffle Deep Dive
Bo Yang
Content
• Overview
• Major Classes
• Shuffle Writer
• Spark Serializer
• Shuffle Reader
• External Shuffle Service
• Suggestions
Shuffle Overview
Mapper 1
Orange 3
Apple 2
Peach 5
Pear 1
Mapper 2
Peach 3
Banana 2
Grape 5
Reducer 1
Apple 2
Peach 8
Pear 1
Reducer 2
Grape 5
Orange 3
Reducer 3
Banana 2
High Level Abstraction
• Pluggable Interface: ShuffleManager
• registerShuffle(…)
• getWriter(…)
• getReader(…)
• Configurable: spark.shuffle.manager=xxx
• Mapper: ShuffleWriter
• write(records: Iterator)
• Reducer: ShuffleReader
• read(): Iterator
Implementations
• SortShuffleManager (extends ShuffleManager)
• Three Writers (optimized for different scenarios)
• SortShuffleWriter: uses ExternalSorter
• BypassMergeSortShuffleWriter: no sorter
• UnsafeShuffleWriter: uses ShuffleExternalSorter
• One Reader
• BlockStoreShuffleReader, uses
• ExternalAppendOnlyMap
• ExternalSorter (if ordering)
Writer Output Example (Shuffle Files)
Mapper 1
Data File
Index File
Reducer 1 Reducer 2 Reducer 3
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Number of Partitions == Number of Reducers
Three Shuffle Writers
• Different Writer Algorithms
• SortShuffleWriter
• BypassMergeSortShuffleWriter
• UnsafeShuffleWriter
• Used in different situations (optimizations)
• Things to consider
• Reduce total number of files
• Reduce serialization/deserialization when possible
When Different Writers Are Used?
• Small number of partitions?
---> BypassMergeSortShuffleWriter
• Able to sort record in serialized form?
---> UnsafeShuffleWriter
• Otherwise
---> SortShuffleWriter
BypassMergeSortShuffleWriter
One file for each partition, then merge them
Mapper
BypassMergeSort
ShuffleWriter
Temp File: Partition 0
…
Temp File: Partition X
Index File
Data File
merge
Temp File: Partition 1
write
BypassMergeSortShuffleWriter (cont’d)
Used when
• No map side combine
• Number of partitions < spark.shuffle.sort.bypassMergeThreshold
Pros
• Simple
Cons
• 1 to 1 mapping between temp file and partition
• Many temp files
SortShuffleWriter
• Why sort?
• Sort records by PartitionId, to separate records by different partitions
• Reduce number of files: number of spill files < number of partitions
• Buffer (in memory):
• PartitionedAppendOnlyMap (when there is map side combine)
• PartitionedPairBuffer (when there is no map side combine)
Mapper
SortShuffleWriter
ExternalSorter Buffer
Spill File (Sorted)
…
Spill File (Sorted)
Index File
Data File
merge
SortShuffleWriter (cont’d)
Used when
• Has map side combine, or, many partitions
• Serializer supports record relocation
Pros
• Flexible, support all shuffle situations
Cons
• Serialize/deserialize multiple times
Internal configure to control spill behavior
(inside Spillable.scala):
spark.shuffle.spill.initialMemoryThreshold
spark.shuffle.spill.numElementsForceSpillThreshold
UnsafeShuffleWriter
• Record serialized once, then stored in memory pages
• 8 bytes record pointer (pointing to: memory page + offset)
• All record pointers stored in a long array
• Sort record pointers (long array)
• Small memory footprint
• Better fit CPU cache
• Sorter class: ShuffleExternalSorter
Memory
Page 1
Memory
Page 2
Memory
Page xxx
Record 1 (8 bytes)
Record 2 (8 bytes)
…
Store/Sort as
Array
UnsafeShuffleWriter (cont’d)
Used when
• Serializer supports record relocation
• No aggregator
Pros
• Single serialization, no deserialization/serialization for merging spill files
• Sorting is CPU cache friendly
Cons
• Not supported when using default serializer (JavaSerializer), supported
when using KryoSerializer
Serializer: JavaSerializer
• Default serializer in Spark
• spark.serializer=org.apache.spark.serializer.JavaSerializer
• Use object reference in serialized stream
• Write reference instead of whole object for repeated (same) object
• Not support record relocation
• Cannot move record in serialized stream due to object reference
• Pros: support serialization in all situations
• Cons: performance not good
Serializer: KryoSerializer
• Use kryo library
• Not use object reference in serialized stream by default
• Support record relocation
• Because there is no object reference, and each serialized object is independent
• Need to explicitly register classes for serialization, otherwise, it will write
fully qualified class name for each serialized object
• Pros: performance is good for common classes and registered classes (see
KryoSerializer.scala
• Cons: performance is bad for custom classes if not registered, need to
explicitly register them
Shuffle Reader: BlockStoreShuffleReader
Mapper 1
Data File
Index File
Reducer: BlockStoreShuffleReader
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Aggregator
ExternalAppend
OnlyMap
Spill File
…
Spill File
Iterator
Use:
HashComparator ExternalSorter
Iterator
If ordering
by key
External Shuffle Service
• YarnShuffleService / MesosExternalShuffleService
• YarnShuffleService: running inside YARN Node Manager as an
AuxiliaryService
• Run on each machine in YARN/Mesos cluster
• Get shuffle files from local disk and stream to reducers
• Use file name convention to locate shuffle files
(ExternalShuffleBlockResolver)
• "shuffle_" + shuffleId + "_" + mapId + "_0.index”
• "shuffle_" + shuffleId + "_" + mapId + "_0.data"
Suggestions / Takeaway
• Shuffle is expensive, avoid unnecessary shuffle
• Shuffle vs Cache (Dataset.persist(…))
• Shuffle files provide full data set for next stage execution
• Cache may not necessary when there is shuffle (unless want cache replicas)
• Use KryoSerializer if possible
• Tune different configures
• spark.shuffle.sort.bypassMergeThreshold
• spark.shuffle.spill.initialMemoryThreshold
• spark.shuffle.spill.numElementsForceSpillThreshold

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

  • 1.
    Spark Shuffle DeepDive Bo Yang
  • 2.
    Content • Overview • MajorClasses • Shuffle Writer • Spark Serializer • Shuffle Reader • External Shuffle Service • Suggestions
  • 3.
    Shuffle Overview Mapper 1 Orange3 Apple 2 Peach 5 Pear 1 Mapper 2 Peach 3 Banana 2 Grape 5 Reducer 1 Apple 2 Peach 8 Pear 1 Reducer 2 Grape 5 Orange 3 Reducer 3 Banana 2
  • 4.
    High Level Abstraction •Pluggable Interface: ShuffleManager • registerShuffle(…) • getWriter(…) • getReader(…) • Configurable: spark.shuffle.manager=xxx • Mapper: ShuffleWriter • write(records: Iterator) • Reducer: ShuffleReader • read(): Iterator
  • 5.
    Implementations • SortShuffleManager (extendsShuffleManager) • Three Writers (optimized for different scenarios) • SortShuffleWriter: uses ExternalSorter • BypassMergeSortShuffleWriter: no sorter • UnsafeShuffleWriter: uses ShuffleExternalSorter • One Reader • BlockStoreShuffleReader, uses • ExternalAppendOnlyMap • ExternalSorter (if ordering)
  • 6.
    Writer Output Example(Shuffle Files) Mapper 1 Data File Index File Reducer 1 Reducer 2 Reducer 3 Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Mapper 2 Data File Index File Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Number of Partitions == Number of Reducers
  • 7.
    Three Shuffle Writers •Different Writer Algorithms • SortShuffleWriter • BypassMergeSortShuffleWriter • UnsafeShuffleWriter • Used in different situations (optimizations) • Things to consider • Reduce total number of files • Reduce serialization/deserialization when possible
  • 8.
    When Different WritersAre Used? • Small number of partitions? ---> BypassMergeSortShuffleWriter • Able to sort record in serialized form? ---> UnsafeShuffleWriter • Otherwise ---> SortShuffleWriter
  • 9.
    BypassMergeSortShuffleWriter One file foreach partition, then merge them Mapper BypassMergeSort ShuffleWriter Temp File: Partition 0 … Temp File: Partition X Index File Data File merge Temp File: Partition 1 write
  • 10.
    BypassMergeSortShuffleWriter (cont’d) Used when •No map side combine • Number of partitions < spark.shuffle.sort.bypassMergeThreshold Pros • Simple Cons • 1 to 1 mapping between temp file and partition • Many temp files
  • 11.
    SortShuffleWriter • Why sort? •Sort records by PartitionId, to separate records by different partitions • Reduce number of files: number of spill files < number of partitions • Buffer (in memory): • PartitionedAppendOnlyMap (when there is map side combine) • PartitionedPairBuffer (when there is no map side combine) Mapper SortShuffleWriter ExternalSorter Buffer Spill File (Sorted) … Spill File (Sorted) Index File Data File merge
  • 12.
    SortShuffleWriter (cont’d) Used when •Has map side combine, or, many partitions • Serializer supports record relocation Pros • Flexible, support all shuffle situations Cons • Serialize/deserialize multiple times Internal configure to control spill behavior (inside Spillable.scala): spark.shuffle.spill.initialMemoryThreshold spark.shuffle.spill.numElementsForceSpillThreshold
  • 13.
    UnsafeShuffleWriter • Record serializedonce, then stored in memory pages • 8 bytes record pointer (pointing to: memory page + offset) • All record pointers stored in a long array • Sort record pointers (long array) • Small memory footprint • Better fit CPU cache • Sorter class: ShuffleExternalSorter Memory Page 1 Memory Page 2 Memory Page xxx Record 1 (8 bytes) Record 2 (8 bytes) … Store/Sort as Array
  • 14.
    UnsafeShuffleWriter (cont’d) Used when •Serializer supports record relocation • No aggregator Pros • Single serialization, no deserialization/serialization for merging spill files • Sorting is CPU cache friendly Cons • Not supported when using default serializer (JavaSerializer), supported when using KryoSerializer
  • 15.
    Serializer: JavaSerializer • Defaultserializer in Spark • spark.serializer=org.apache.spark.serializer.JavaSerializer • Use object reference in serialized stream • Write reference instead of whole object for repeated (same) object • Not support record relocation • Cannot move record in serialized stream due to object reference • Pros: support serialization in all situations • Cons: performance not good
  • 16.
    Serializer: KryoSerializer • Usekryo library • Not use object reference in serialized stream by default • Support record relocation • Because there is no object reference, and each serialized object is independent • Need to explicitly register classes for serialization, otherwise, it will write fully qualified class name for each serialized object • Pros: performance is good for common classes and registered classes (see KryoSerializer.scala • Cons: performance is bad for custom classes if not registered, need to explicitly register them
  • 17.
    Shuffle Reader: BlockStoreShuffleReader Mapper1 Data File Index File Reducer: BlockStoreShuffleReader Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Mapper 2 Data File Index File Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Aggregator ExternalAppend OnlyMap Spill File … Spill File Iterator Use: HashComparator ExternalSorter Iterator If ordering by key
  • 18.
    External Shuffle Service •YarnShuffleService / MesosExternalShuffleService • YarnShuffleService: running inside YARN Node Manager as an AuxiliaryService • Run on each machine in YARN/Mesos cluster • Get shuffle files from local disk and stream to reducers • Use file name convention to locate shuffle files (ExternalShuffleBlockResolver) • "shuffle_" + shuffleId + "_" + mapId + "_0.index” • "shuffle_" + shuffleId + "_" + mapId + "_0.data"
  • 19.
    Suggestions / Takeaway •Shuffle is expensive, avoid unnecessary shuffle • Shuffle vs Cache (Dataset.persist(…)) • Shuffle files provide full data set for next stage execution • Cache may not necessary when there is shuffle (unless want cache replicas) • Use KryoSerializer if possible • Tune different configures • spark.shuffle.sort.bypassMergeThreshold • spark.shuffle.spill.initialMemoryThreshold • spark.shuffle.spill.numElementsForceSpillThreshold

Editor's Notes

  • #18 ExternalAppendOnlyMap ExternalSorter (if ordering)