KEMBAR78
Improving Apache Spark Downscaling | PDF
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Christopher Crosbie, Google
Ben Sidhom, Google
Improving Spark
Downscaling
#UnifiedDataAnalytics #SparkAISummit
Open
Source
Google
Cloud Products
Google
Research
2000 2010
GFS
Map
Reduce
Dremel
Flume
Java MillwheelPubSub
2020
BigTable
BigQuery Pub/Sub Dataflow Bigtable MLDataproc
Long History of Solving Data
Problems
Tensorflow
Apache Airflow
Cloud ML Engine
Cloud Dataflow
Cloud Data Fusion
Cloud Composer
Who are we and what is Cloud Dataproc?
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Rapid cluster creation
Familiar open source tools
Customizable machines
Ephemeral clusters on-demand
Tightly Integrated
with other Google Cloud
Platform services
Cloud Dataproc: Open source solutions with GCP
Taking the best of open source And opening up access to the best of GCP
Webhcat
BigQuery
Cloud
Datastore
Cloud
Bigtable
Compute
Engine
Kubernetes
Engine
Cloud
Dataflow
Cloud
Dataproc
Cloud
Functions
Cloud Machine
Learning
Engine
Cloud
Pub/Sub
Key
Management
Service
Cloud
Spanner
Cloud SQL BQ Transfer
Service
Cloud
Translation API
Cloud Vision
API
Cloud
Storage
Jobs are “fire and forget”
No need to manually intervene
when a cluster is over or under
capacity
Choose balance between
standard and preemptible workers
Save resources (quota & cost) at
any point in time
Dataproc Autoscaling GA
Complicating Spark Downscaling
Without autoscaling
Submit job
Monitor resource usage
Adjust cluster size
With autoscaling
Submit jobs
Based on the difference between
YARN pending and available
memory
If more memory is needed then
scale up
If there is excess memory then
scale down
Obey VM limits and scale based
on scale factor
Autoscaling policies: fine grained control
Is there too much or too little
YARN memory?
Do nothing
Is the cluster at the maximum
# of nodes?
Do not autoscale
Determine type and scale of
nodes to modify
Autoscale cluster
Yes No
Yes No
YARN Infrastructure
Complexities
Finding processed data
(shuffle files, cached RDDs, etc)
Optimizing costs
Spark Autoscaling Challenges
YARN
YARN-based managed Spark
Dataproc Cluster
HDFS
Persistent Disk
Cluster bucket
Cloud Storage Compute engine nodes
Dataproc Image
Apache Spark
Apache Hadoop
Apache Hive
...
Clients
Cloud Dataproc API
Clusters
...
Jobs
Clients
(SSH)
Dataproc Agent
User Data
Cloud Storage
YARN pain points
Management is difficult
Clusters are complicated and have to use more components than are
required for a job or model. This also requires hard-to-find experts.
Complicated OSS software stack
Version and dependency management is hard. Have to understand how to
tune multiple components for efficiency.
Isolation is hard
I have to think about my jobs to size clusters, and isolating jobs requires
additional steps.
Multiple k8s
options
Moving the OSS ecosystem
to Kubernetes offers
customers a range of options
depending on their needs and
core expertise.
DIY k8s Dataproc
k8s Dataproc +
Vendor components
Runs OSS on k8s? Yes - self-managed
Yes - managed k8s
clusters
Yes - managed k8s
clusters
SLAs GKE only Dataproc cluster
Dataproc cluster
and component
OSS components Community only Google optimized
Google optimized +
vendor optimized
In-depth component
support
No No Yes
Integrated
management
No Yes Yes
Integrated security No Yes Yes
Hybrid/cross-cloud
support
No Yes Yes
How we are making this happen
• Kubernetes Operators - Application control
plane for complex applications
– The language of Kubernetes allows
extending its vocabulary through
Custom Resource Definition (CRD)
– Kubernetes Operator is an app-specific
control plane running in the cluster
• CRD: app-specific vocabulary
• CR: instance of CRD
• CR Controller: interpreter and
reconciliation loop for CRs
– The cluster can now speak the
app-specific words through the
Kubernetes API
Control Plane
(Master)
MyApp API
Data Plane
(Nodes)
CRUD MyApp ...
Kubernetes
MyApp Control Plane
Kubernetes API
● Integrates with BigQuery,
Google’s Serverless Data
Warehouse
● Provides Google Cloud Storage
as replacement for HDFS
● Ships logs to Stackdriver
Monitoring
○ via Prometheus server
with the Stackdriver
sidecar
● Contains sparkctl, a command
line tool that simplifies client-local
application dependencies in a
Kubernetes environment.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Deployment options
1. Deploy unified resource management
Get away deal from two separate cluster management interfaces to manage
open source component. Offers one central view for easy management.
2. Isolate Spark jobs and resources
Remove the headaches of version and dependency management; instead,
move models and ETL pipelines from dev to production without added work.
Build resilient infrastructure
Don’t worry about sizing and building clusters, manipulating Docker files, or
messing around with Kubernetes networking configurations. It just works.
Key benefits for autoscaling
Helpful but does not solve our
core problem…..
Finding the
processed data
What exactly is a shuffle & why do we care? Rob Wynne
A Brief History of Spark Shuffle
● Shuffle files to local storage on the executors
● Executors responsible for serving the files
● Loss of an executor meant loss of the shuffle files
● Result: poor auto-scaling
○ Pathological loop: scale down, lose work, re-compute, trigger scale up…
● Depended on driver GC event to clean up shuffle files
22#UnifiedDataAnalytics #SparkAISummit
Today: Dynamic allocation and “external” shuffle
● Executors no longer need to serve data
● “External” shuffle is not exactly external
○ Only executors can be released
○ Can scale up & down executors but not the machines
● Still depends on driver GC event to clean up shuffle files
23#UnifiedDataAnalytics #SparkAISummit
Spark’s shuffle code today
private[spark] trait ShuffleManager {
def registerShuffle[K, V, C](shuffleId: Int, numMaps: Int, dependency: ShuffleDependency[K, V,
C]): ShuffleHandle
def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext, metrics:
ShuffleWriteMetricsReporter): ShuffleWriter[K, V]
def getReader[K, C](handle: ShuffleHandle, startPartition: Int, endPartition: Int, context:
TaskContext, metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]
def unregisterShuffle(shuffleId: Int): Boolean
def shuffleBlockResolver: ShuffleBlockResolver
def stop(): Unit
}
24#UnifiedDataAnalytics #SparkAISummit
Continued..
/**
* Obtained inside a map task to write out records to the shuffle system.
*/
private[spark] abstract class ShuffleWriter[K, V] {
/** Write a sequence of records to this task's output */
@throws[IOException]
def write(records: Iterator[Product2[K, V]]): Unit
/** Close this writer, passing along whether the map completed */
def stop(success: Boolean): Option[MapStatus]
}
25#UnifiedDataAnalytics #SparkAISummit
Continued..
/** Write a bunch of records to this task's output */
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we
don't
// care whether the keys get sorted in each partition; that will be done on the reduce
side
// if the operation being run is sortByKey.
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
...
26#UnifiedDataAnalytics #SparkAISummit
Continued..
// Don't bother including the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see SPARK-3570).
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId,
IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
27#UnifiedDataAnalytics #SparkAISummit
Continued..
/ Note: Changes to the format in this file should be kept in sync with
//
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData().
private[spark] class IndexShuffleBlockResolver(
conf: SparkConf,
_blockManager: BlockManager = null)
extends ShuffleBlockResolver
………..
28#UnifiedDataAnalytics #SparkAISummit
Problems with This
● Rapid downscaling infeasible
○ Scaling down entire nodes hard
● Preemptible VMs & Spot Instances
29#UnifiedDataAnalytics #SparkAISummit
Optimizing Costs
Preemptible
VMs and
Spot
instances
PVMs Up to 80% cheaper for
short-lived instances. Can be pulled
at any time. Guaranteed to be
removed at least once in 24 hours.
Spot is based on Vickrey auction.
Stage 1 Stage 2Shuffle
How can we fix this?
Make intermediate shuffle data external to both the executor and the
machine itself
33#UnifiedDataAnalytics #SparkAISummit
Where we started
class HcfsShuffleWriter[K, V, C] extends ShuffleWriter[K, V] {
override def write(records: Iterator[Product2[K, V]]): Unit = {
val sorter = new ExternalSorter[K, V, C/V](...)
sorter.insertAll(records)
val partitionIter = sorter.partitionedIter
val hcfsStream = …
val countingStream = new CountingOutputStream(hcfsStream)
val framedOutput = new FramingOutputStream(countingStream)
try {
for ((partition, iter) <- partitionIter) {
// Write partition to external storage
}
} finally {
framedOutput.closeUnderlying()
}
}
34#UnifiedDataAnalytics #SparkAISummit
Alpha: HDFS not quite ready for prime time
● RPC overhead to HDFS or persistent storage
● Especially poor performance with misaligned partition/block sizes
○ HDFS/GCS/etc different expectations of block size
● Loss of implicit in-memory page cache
● Possibly slowness in cleaning up shuffle files
● Namenode contention when reading shuffle files (HDFS)
○ Added index caching layer to mitigate this
● Additional metadata tracking
36#UnifiedDataAnalytics #SparkAISummit
Object Storage?
Apache Crail (Incubating) is a high-performance distributed data store designed for fast sharing
of ephemeral data in distributed data processing workloads
● Fast
● Heterogeneous
● Modular
What about Google Cloud Bigtable?
Consistent low latency, high
throughput, and scalable
wide-column database service.
Back to basics - NFS
● Shuffle to Elastifile
○ Cloud based NFS service (scales horizontally)
○ Tailored to random access patterns, small files
○ NFS looks like local FS, but is not. Must be careful when dealing with
commit semantics and speculative execution.
● Still a performance hit but factors better than HDFS
41#UnifiedDataAnalytics #SparkAISummit
Goal: OSS Disaggregated Shuffle
Architecture
Kubernetes Cluster
Spark Driver Pod
Shuffle
Offload
(WIP)
Executor
Virtual Machine Group
Elastifile
Cloud (object)
Storage
Use the cloud to
fix the cloud?
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Improving Apache Spark Downscaling

  • 1.
    WIFI SSID:Spark+AISummit |Password: UnifiedDataAnalytics
  • 2.
    Christopher Crosbie, Google BenSidhom, Google Improving Spark Downscaling #UnifiedDataAnalytics #SparkAISummit
  • 3.
    Open Source Google Cloud Products Google Research 2000 2010 GFS Map Reduce Dremel Flume JavaMillwheelPubSub 2020 BigTable BigQuery Pub/Sub Dataflow Bigtable MLDataproc Long History of Solving Data Problems Tensorflow
  • 4.
    Apache Airflow Cloud MLEngine Cloud Dataflow Cloud Data Fusion Cloud Composer
  • 5.
    Who are weand what is Cloud Dataproc? Google Cloud Platform’s fully-managed Apache Spark and Apache Hadoop service Rapid cluster creation Familiar open source tools Customizable machines Ephemeral clusters on-demand Tightly Integrated with other Google Cloud Platform services
  • 6.
    Cloud Dataproc: Opensource solutions with GCP Taking the best of open source And opening up access to the best of GCP Webhcat BigQuery Cloud Datastore Cloud Bigtable Compute Engine Kubernetes Engine Cloud Dataflow Cloud Dataproc Cloud Functions Cloud Machine Learning Engine Cloud Pub/Sub Key Management Service Cloud Spanner Cloud SQL BQ Transfer Service Cloud Translation API Cloud Vision API Cloud Storage
  • 7.
    Jobs are “fireand forget” No need to manually intervene when a cluster is over or under capacity Choose balance between standard and preemptible workers Save resources (quota & cost) at any point in time Dataproc Autoscaling GA Complicating Spark Downscaling Without autoscaling Submit job Monitor resource usage Adjust cluster size With autoscaling Submit jobs
  • 8.
    Based on thedifference between YARN pending and available memory If more memory is needed then scale up If there is excess memory then scale down Obey VM limits and scale based on scale factor Autoscaling policies: fine grained control Is there too much or too little YARN memory? Do nothing Is the cluster at the maximum # of nodes? Do not autoscale Determine type and scale of nodes to modify Autoscale cluster Yes No Yes No
  • 9.
    YARN Infrastructure Complexities Finding processeddata (shuffle files, cached RDDs, etc) Optimizing costs Spark Autoscaling Challenges
  • 10.
  • 11.
    YARN-based managed Spark DataprocCluster HDFS Persistent Disk Cluster bucket Cloud Storage Compute engine nodes Dataproc Image Apache Spark Apache Hadoop Apache Hive ... Clients Cloud Dataproc API Clusters ... Jobs Clients (SSH) Dataproc Agent User Data Cloud Storage
  • 12.
    YARN pain points Managementis difficult Clusters are complicated and have to use more components than are required for a job or model. This also requires hard-to-find experts. Complicated OSS software stack Version and dependency management is hard. Have to understand how to tune multiple components for efficiency. Isolation is hard I have to think about my jobs to size clusters, and isolating jobs requires additional steps.
  • 14.
    Multiple k8s options Moving theOSS ecosystem to Kubernetes offers customers a range of options depending on their needs and core expertise. DIY k8s Dataproc k8s Dataproc + Vendor components Runs OSS on k8s? Yes - self-managed Yes - managed k8s clusters Yes - managed k8s clusters SLAs GKE only Dataproc cluster Dataproc cluster and component OSS components Community only Google optimized Google optimized + vendor optimized In-depth component support No No Yes Integrated management No Yes Yes Integrated security No Yes Yes Hybrid/cross-cloud support No Yes Yes
  • 15.
    How we aremaking this happen • Kubernetes Operators - Application control plane for complex applications – The language of Kubernetes allows extending its vocabulary through Custom Resource Definition (CRD) – Kubernetes Operator is an app-specific control plane running in the cluster • CRD: app-specific vocabulary • CR: instance of CRD • CR Controller: interpreter and reconciliation loop for CRs – The cluster can now speak the app-specific words through the Kubernetes API Control Plane (Master) MyApp API Data Plane (Nodes) CRUD MyApp ... Kubernetes MyApp Control Plane Kubernetes API
  • 16.
    ● Integrates withBigQuery, Google’s Serverless Data Warehouse ● Provides Google Cloud Storage as replacement for HDFS ● Ships logs to Stackdriver Monitoring ○ via Prometheus server with the Stackdriver sidecar ● Contains sparkctl, a command line tool that simplifies client-local application dependencies in a Kubernetes environment. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  • 17.
  • 18.
    1. Deploy unifiedresource management Get away deal from two separate cluster management interfaces to manage open source component. Offers one central view for easy management. 2. Isolate Spark jobs and resources Remove the headaches of version and dependency management; instead, move models and ETL pipelines from dev to production without added work. Build resilient infrastructure Don’t worry about sizing and building clusters, manipulating Docker files, or messing around with Kubernetes networking configurations. It just works. Key benefits for autoscaling
  • 19.
    Helpful but doesnot solve our core problem…..
  • 20.
  • 21.
    What exactly isa shuffle & why do we care? Rob Wynne
  • 22.
    A Brief Historyof Spark Shuffle ● Shuffle files to local storage on the executors ● Executors responsible for serving the files ● Loss of an executor meant loss of the shuffle files ● Result: poor auto-scaling ○ Pathological loop: scale down, lose work, re-compute, trigger scale up… ● Depended on driver GC event to clean up shuffle files 22#UnifiedDataAnalytics #SparkAISummit
  • 23.
    Today: Dynamic allocationand “external” shuffle ● Executors no longer need to serve data ● “External” shuffle is not exactly external ○ Only executors can be released ○ Can scale up & down executors but not the machines ● Still depends on driver GC event to clean up shuffle files 23#UnifiedDataAnalytics #SparkAISummit
  • 24.
    Spark’s shuffle codetoday private[spark] trait ShuffleManager { def registerShuffle[K, V, C](shuffleId: Int, numMaps: Int, dependency: ShuffleDependency[K, V, C]): ShuffleHandle def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext, metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V] def getReader[K, C](handle: ShuffleHandle, startPartition: Int, endPartition: Int, context: TaskContext, metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C] def unregisterShuffle(shuffleId: Int): Boolean def shuffleBlockResolver: ShuffleBlockResolver def stop(): Unit } 24#UnifiedDataAnalytics #SparkAISummit
  • 25.
    Continued.. /** * Obtained insidea map task to write out records to the shuffle system. */ private[spark] abstract class ShuffleWriter[K, V] { /** Write a sequence of records to this task's output */ @throws[IOException] def write(records: Iterator[Product2[K, V]]): Unit /** Close this writer, passing along whether the map completed */ def stop(success: Boolean): Option[MapStatus] } 25#UnifiedDataAnalytics #SparkAISummit
  • 26.
    Continued.. /** Write abunch of records to this task's output */ override def write(records: Iterator[Product2[K, V]]): Unit = { sorter = if (dep.mapSideCombine) { new ExternalSorter[K, V, C]( context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer) } else { // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't // care whether the keys get sorted in each partition; that will be done on the reduce side // if the operation being run is sortByKey. new ExternalSorter[K, V, V]( context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer) } sorter.insertAll(records) ... 26#UnifiedDataAnalytics #SparkAISummit
  • 27.
    Continued.. // Don't botherincluding the time to open the merged output file in the shuffle write time, // because it just opens a single file, so is typically too fast to measure accurately // (see SPARK-3570). val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId) val tmp = Utils.tempFileWith(output) try { val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID) val partitionLengths = sorter.writePartitionedFile(blockId, tmp) shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp) mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths) } finally { if (tmp.exists() && !tmp.delete()) { logError(s"Error while deleting temp file ${tmp.getAbsolutePath}") } } } 27#UnifiedDataAnalytics #SparkAISummit
  • 28.
    Continued.. / Note: Changesto the format in this file should be kept in sync with // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData(). private[spark] class IndexShuffleBlockResolver( conf: SparkConf, _blockManager: BlockManager = null) extends ShuffleBlockResolver ……….. 28#UnifiedDataAnalytics #SparkAISummit
  • 29.
    Problems with This ●Rapid downscaling infeasible ○ Scaling down entire nodes hard ● Preemptible VMs & Spot Instances 29#UnifiedDataAnalytics #SparkAISummit
  • 30.
  • 31.
    Preemptible VMs and Spot instances PVMs Upto 80% cheaper for short-lived instances. Can be pulled at any time. Guaranteed to be removed at least once in 24 hours. Spot is based on Vickrey auction.
  • 32.
    Stage 1 Stage2Shuffle
  • 33.
    How can wefix this? Make intermediate shuffle data external to both the executor and the machine itself 33#UnifiedDataAnalytics #SparkAISummit
  • 34.
    Where we started classHcfsShuffleWriter[K, V, C] extends ShuffleWriter[K, V] { override def write(records: Iterator[Product2[K, V]]): Unit = { val sorter = new ExternalSorter[K, V, C/V](...) sorter.insertAll(records) val partitionIter = sorter.partitionedIter val hcfsStream = … val countingStream = new CountingOutputStream(hcfsStream) val framedOutput = new FramingOutputStream(countingStream) try { for ((partition, iter) <- partitionIter) { // Write partition to external storage } } finally { framedOutput.closeUnderlying() } } 34#UnifiedDataAnalytics #SparkAISummit
  • 36.
    Alpha: HDFS notquite ready for prime time ● RPC overhead to HDFS or persistent storage ● Especially poor performance with misaligned partition/block sizes ○ HDFS/GCS/etc different expectations of block size ● Loss of implicit in-memory page cache ● Possibly slowness in cleaning up shuffle files ● Namenode contention when reading shuffle files (HDFS) ○ Added index caching layer to mitigate this ● Additional metadata tracking 36#UnifiedDataAnalytics #SparkAISummit
  • 37.
  • 39.
    Apache Crail (Incubating)is a high-performance distributed data store designed for fast sharing of ephemeral data in distributed data processing workloads ● Fast ● Heterogeneous ● Modular
  • 40.
    What about GoogleCloud Bigtable? Consistent low latency, high throughput, and scalable wide-column database service.
  • 41.
    Back to basics- NFS ● Shuffle to Elastifile ○ Cloud based NFS service (scales horizontally) ○ Tailored to random access patterns, small files ○ NFS looks like local FS, but is not. Must be careful when dealing with commit semantics and speculative execution. ● Still a performance hit but factors better than HDFS 41#UnifiedDataAnalytics #SparkAISummit
  • 42.
    Goal: OSS DisaggregatedShuffle Architecture Kubernetes Cluster Spark Driver Pod Shuffle Offload (WIP) Executor Virtual Machine Group Elastifile Cloud (object) Storage
  • 43.
    Use the cloudto fix the cloud?
  • 44.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT