KEMBAR78
Optimizations in Spark; RDD, DataFrame | PDF
Optimizations in
Apache Spark
Presented By: Sarfaraz Hussain
Software Consultant
Knoldus Inc.
About Knoldus
Knoldus is a technology consulting firm with focus on modernizing the digital systems
at the pace your business demands.
DevOps
Functional. Reactive. Cloud Native
01 Spark Execution Model
02 Optimizing Shuffle Operations
03 Optimizing Functions
04 SQL vs RDD
05 Logical & Physical Plan
Agenda
06 Optimizing Joins
Apple
Banana
Orange
Apple
Cat
Dog
Cow
Orange
Cow
Banana
RDD
● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model
DAG
Stage Details
Narrow Transformation Wide Transformation
map cogroup
mapValues groupWith
flatMap join
filter leftOuterJoin
mapPartitions rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce
Shuffle Operations
What is Shuffle?
- Shuffles are data transfers between different executors of a Spark cluster.
Shuffle Operations
1. In which executors the data needs to be sent?
2. How to send the data?
GroupByKey
Shuffle Operations
Where to send data?
- Partitioner - The partitioner defines how records will be distributed and thus which records
will be completed by each task
Partitioner
Types of partitioner:
- Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as:
partition = key.hashCode() % numPartitions.
- Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples
with the same range will be on the same machine. This method is suitable where
there’s a natural ordering in the keys and the keys are non negative.
Example:
Hash Partitioner - GroupByKey, ReduceByKey
Range Partitioner - SortByKey
Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark
Partitioner
Co-partitioned RDD
RDDs are co-partitioned if they are partitioned by the same partitioner.
Co-partitioned RDD
Co-located RDD
Partitions are co-located if they are both loaded into the memory of the same machine
(executor).
Shuffle Operations
How to send data?
- Serialization - It a mechanism of representing an object as a stream of byte,
transferring it through the network, and then reconstructing the same object, and its
state on another computer.
Serializer in Spark
- Types of Serializer in Spark -
- Java : slow, but robust
- Kryo : fast, but has few problem
Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization
Optimizing Functions In Transformation
Optimizing Functions
Optimizing Functions
map vs mapPartitions
- Map works the function being utilized at a per element level while mapPartitions
exercises the function at the partition level.
- map: Applies a transformation function on each item of the RDD and returns the result
as a new RDD.
- mapPartition: It is called only once for each partition. The entire content of the
respective partitions is available as a sequential stream of values via the input
argument (Iterarator[T]).
- https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
map vs mapPartitions
SQL vs RDD
SQL RDD
SQL is high-level. RDD is low-level API.
SQL focus on “WHAT”. RDD focus on “HOW”.
Spark takes care of optimizing most SQL
queries.
Optimizing RDD is developer’s responsibility.
SQL are Declarative. RDD are Imperative i.e. we need to specify each
step of computation.
SQL knows about your data. RDD doesn’t know anything about your data.
Does not involves much
serialization/deserialization as Catalyst
Optimizer takes care to optimize it.
RDD involves too many
serialization/deserialization
SQL
SQL
RDD
RDD
Logical & Physical Plan
● Logical Plan
- Unresolved Logical Plan OR Parsed Logical Plan
- Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan
- Optimized Logical Plan
● Catalog
● Catalyst Optimizer
● Tungsten
● Physical Plan
Logical & Physical Plan
https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/
Catalyst Optimizer and Tungsten
Codegen
Once the Best Physical Plan is selected, it’s the time to generate the executable
code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed
fashion. This process is called Codegen and that’s the job of Spark’s Tungsten
Execution Engine.
Let’s see them in action!
Unresolved Logical Plan
Resolved Logical Plan
Optimized Logical Plan
Physical Plan
Optimizing Joins
Types of Joins -
a. Shuffle hash Join
b. Sort-merge Join
c. Broadcast Join
Shuffle hash Join
- When join keys are not sortable.
- It is used when Sort-merge Join is disabled.
- spark.sql.join.preferSortMergeJoin is false.
- One side is much smaller (at least 3 times) than the other.
- Can build hash map.
Sort-merge Join
- spark.sql.join.preferSortMergeJoin is true by default.
- Default Join implementation.
- Join keys must be sortable.
- In our previous example, Sort-merge Join took place.
- Use Bucketing : Pre shuffle + sort data based on join key
Bucketing
- Bucketing helps to pre-compute the shuffle and store the data as input table, thus
avoiding shuffle at each stage.
- SET spark.sql.sources.bucketing.enabled = TRUE
Broadcast Join
- Broadcast smaller Dataframe to all Worker Node.
- Perform map-side join.
- No shuffle operations take places.
- spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)
Broadcast Join
Broadcast Join
Broadcast Join
Broadcast Join
Sort-merge Join
Caching/Persisting
a. It keeps the lineage intact.
b. Data is cached to Executor’s memory and is fetched from the cache.
c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not
recomputed. (Done by the Resource Manager)
d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached.
e. The cache is cleared after the SparkContext is destroyed.
f. Persisting is unreliable.
g. data.persist() OR data.cache()
Checkpointing
a. It breaks the lineage.
b. Data is written and fetched from HDFS or local file system.
c. Data can not be recomputed from scratch if some partitions are lost
as the lineage chain is completely lost.
d. Checkpointed data can be used in subsequent job run.
e. Checkpointed data is persistent and not removed after SparkContext is
destroyed.
f. Checkpointing is reliable.
Checkpointing
spark.sparkContext.setCheckpointDir("/hdfs_directory/")
myRdd.checkpoint()
df.rdd.checkpoint()
Why to make a checkpoint?
- Busy cluster.
- Expensive and long computations.
Thank You!
https://www.linkedin.com/in/sarf
araz-hussain-8123b4132/
sarfaraz.hussain@knoldus.com

Optimizations in Spark; RDD, DataFrame

  • 1.
    Optimizations in Apache Spark PresentedBy: Sarfaraz Hussain Software Consultant Knoldus Inc.
  • 2.
    About Knoldus Knoldus isa technology consulting firm with focus on modernizing the digital systems at the pace your business demands. DevOps Functional. Reactive. Cloud Native
  • 3.
    01 Spark ExecutionModel 02 Optimizing Shuffle Operations 03 Optimizing Functions 04 SQL vs RDD 05 Logical & Physical Plan Agenda 06 Optimizing Joins
  • 4.
  • 6.
    ● Two kindsof operations: 1. Transformation 2. Action ● Dependency are divided into two types: 1. Narrow Dependency 2. Wide Dependency ● Stages Spark Execution Model
  • 8.
  • 9.
  • 10.
    Narrow Transformation WideTransformation map cogroup mapValues groupWith flatMap join filter leftOuterJoin mapPartitions rightOuterJoin groupByKey reduceByKey combineByKey distinct intersection repartition coalesce
  • 11.
    Shuffle Operations What isShuffle? - Shuffles are data transfers between different executors of a Spark cluster.
  • 12.
    Shuffle Operations 1. Inwhich executors the data needs to be sent? 2. How to send the data? GroupByKey
  • 13.
    Shuffle Operations Where tosend data? - Partitioner - The partitioner defines how records will be distributed and thus which records will be completed by each task
  • 14.
    Partitioner Types of partitioner: -Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as: partition = key.hashCode() % numPartitions. - Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples with the same range will be on the same machine. This method is suitable where there’s a natural ordering in the keys and the keys are non negative. Example: Hash Partitioner - GroupByKey, ReduceByKey Range Partitioner - SortByKey Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark
  • 15.
  • 16.
    Co-partitioned RDD RDDs areco-partitioned if they are partitioned by the same partitioner.
  • 17.
  • 18.
    Co-located RDD Partitions areco-located if they are both loaded into the memory of the same machine (executor).
  • 19.
    Shuffle Operations How tosend data? - Serialization - It a mechanism of representing an object as a stream of byte, transferring it through the network, and then reconstructing the same object, and its state on another computer.
  • 20.
    Serializer in Spark -Types of Serializer in Spark - - Java : slow, but robust - Kryo : fast, but has few problem Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization
  • 21.
  • 22.
  • 23.
  • 24.
    map vs mapPartitions -Map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level. - map: Applies a transformation function on each item of the RDD and returns the result as a new RDD. - mapPartition: It is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). - https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
  • 25.
  • 26.
  • 27.
    SQL RDD SQL ishigh-level. RDD is low-level API. SQL focus on “WHAT”. RDD focus on “HOW”. Spark takes care of optimizing most SQL queries. Optimizing RDD is developer’s responsibility. SQL are Declarative. RDD are Imperative i.e. we need to specify each step of computation. SQL knows about your data. RDD doesn’t know anything about your data. Does not involves much serialization/deserialization as Catalyst Optimizer takes care to optimize it. RDD involves too many serialization/deserialization
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    Logical & PhysicalPlan ● Logical Plan - Unresolved Logical Plan OR Parsed Logical Plan - Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan - Optimized Logical Plan ● Catalog ● Catalyst Optimizer ● Tungsten ● Physical Plan
  • 33.
    Logical & PhysicalPlan https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/
  • 34.
  • 35.
    Codegen Once the BestPhysical Plan is selected, it’s the time to generate the executable code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed fashion. This process is called Codegen and that’s the job of Spark’s Tungsten Execution Engine.
  • 36.
    Let’s see themin action!
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Optimizing Joins Types ofJoins - a. Shuffle hash Join b. Sort-merge Join c. Broadcast Join
  • 42.
    Shuffle hash Join -When join keys are not sortable. - It is used when Sort-merge Join is disabled. - spark.sql.join.preferSortMergeJoin is false. - One side is much smaller (at least 3 times) than the other. - Can build hash map.
  • 43.
    Sort-merge Join - spark.sql.join.preferSortMergeJoinis true by default. - Default Join implementation. - Join keys must be sortable. - In our previous example, Sort-merge Join took place. - Use Bucketing : Pre shuffle + sort data based on join key
  • 44.
    Bucketing - Bucketing helpsto pre-compute the shuffle and store the data as input table, thus avoiding shuffle at each stage. - SET spark.sql.sources.bucketing.enabled = TRUE
  • 45.
    Broadcast Join - Broadcastsmaller Dataframe to all Worker Node. - Perform map-side join. - No shuffle operations take places. - spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)
  • 46.
  • 47.
  • 48.
  • 49.
    Caching/Persisting a. It keepsthe lineage intact. b. Data is cached to Executor’s memory and is fetched from the cache. c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not recomputed. (Done by the Resource Manager) d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached. e. The cache is cleared after the SparkContext is destroyed. f. Persisting is unreliable. g. data.persist() OR data.cache()
  • 50.
    Checkpointing a. It breaksthe lineage. b. Data is written and fetched from HDFS or local file system. c. Data can not be recomputed from scratch if some partitions are lost as the lineage chain is completely lost. d. Checkpointed data can be used in subsequent job run. e. Checkpointed data is persistent and not removed after SparkContext is destroyed. f. Checkpointing is reliable.
  • 51.
  • 52.