Improving Apache Spark Downscaling

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Christopher Crosbie, Google
Ben Sidhom, Google
Improving Spark
Downscaling
#UnifiedDataAnalytics #SparkAISummit

Open
Source
Google
Cloud Products
Google
Research
2000 2010
GFS
Map
Reduce
Dremel
Flume
Java MillwheelPubSub
2020
BigTable
BigQuery Pub/Sub Dataﬂow Bigtable MLDataproc
Long History of Solving Data
Problems
Tensorﬂow

Apache Airflow
Cloud ML Engine
Cloud Dataflow
Cloud Data Fusion
Cloud Composer

Who are we and what is Cloud Dataproc?
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Rapid cluster creation
Familiar open source tools
Customizable machines
Ephemeral clusters on-demand
Tightly Integrated
with other Google Cloud
Platform services

Cloud Dataproc: Open source solutions with GCP
Taking the best of open source And opening up access to the best of GCP
Webhcat
BigQuery
Cloud
Datastore
Cloud
Bigtable
Compute
Engine
Kubernetes
Engine
Cloud
Dataﬂow
Cloud
Dataproc
Cloud
Functions
Cloud Machine
Learning
Engine
Cloud
Pub/Sub
Key
Management
Service
Cloud
Spanner
Cloud SQL BQ Transfer
Service
Cloud
Translation API
Cloud Vision
API
Cloud
Storage

Jobs are “ﬁre and forget”
No need to manually intervene
when a cluster is over or under
capacity
Choose balance between
standard and preemptible workers
Save resources (quota & cost) at
any point in time
Dataproc Autoscaling GA
Complicating Spark Downscaling
Without autoscaling
Submit job
Monitor resource usage
Adjust cluster size
With autoscaling
Submit jobs

Based on the difference between
YARN pending and available
memory
If more memory is needed then
scale up
If there is excess memory then
scale down
Obey VM limits and scale based
on scale factor
Autoscaling policies: fine grained control
Is there too much or too little
YARN memory?
Do nothing
Is the cluster at the maximum
# of nodes?
Do not autoscale
Determine type and scale of
nodes to modify
Autoscale cluster
Yes No
Yes No

YARN Infrastructure
Complexities
Finding processed data
(shuﬄe ﬁles, cached RDDs, etc)
Optimizing costs
Spark Autoscaling Challenges

YARN-based managed Spark
Dataproc Cluster
HDFS
Persistent Disk
Cluster bucket
Cloud Storage Compute engine nodes
Dataproc Image
Apache Spark
Apache Hadoop
Apache Hive
...
Clients
Cloud Dataproc API
Clusters
...
Jobs
Clients
(SSH)
Dataproc Agent
User Data
Cloud Storage

YARN pain points
Management is difficult
Clusters are complicated and have to use more components than are
required for a job or model. This also requires hard-to-find experts.
Complicated OSS software stack
Version and dependency management is hard. Have to understand how to
tune multiple components for efficiency.
Isolation is hard
I have to think about my jobs to size clusters, and isolating jobs requires
additional steps.

Multiple k8s
options
Moving the OSS ecosystem
to Kubernetes offers
customers a range of options
depending on their needs and
core expertise.
DIY k8s Dataproc
k8s Dataproc +
Vendor components
Runs OSS on k8s? Yes - self-managed
Yes - managed k8s
clusters
Yes - managed k8s
clusters
SLAs GKE only Dataproc cluster
Dataproc cluster
and component
OSS components Community only Google optimized
Google optimized +
vendor optimized
In-depth component
support
No No Yes
Integrated
management
No Yes Yes
Integrated security No Yes Yes
Hybrid/cross-cloud
support
No Yes Yes

How we are making this happen
• Kubernetes Operators - Application control
plane for complex applications
– The language of Kubernetes allows
extending its vocabulary through
Custom Resource Definition (CRD)
– Kubernetes Operator is an app-specific
control plane running in the cluster
• CRD: app-specific vocabulary
• CR: instance of CRD
• CR Controller: interpreter and
reconciliation loop for CRs
– The cluster can now speak the
app-specific words through the
Kubernetes API
Control Plane
(Master)
MyApp API
Data Plane
(Nodes)
CRUD MyApp ...
Kubernetes
MyApp Control Plane
Kubernetes API

● Integrates with BigQuery,
Google’s Serverless Data
Warehouse
● Provides Google Cloud Storage
as replacement for HDFS
● Ships logs to Stackdriver
Monitoring
○ via Prometheus server
with the Stackdriver
sidecar
● Contains sparkctl, a command
line tool that simplifies client-local
application dependencies in a
Kubernetes environment.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

1. Deploy unified resource management
Get away deal from two separate cluster management interfaces to manage
open source component. Offers one central view for easy management.
2. Isolate Spark jobs and resources
Remove the headaches of version and dependency management; instead,
move models and ETL pipelines from dev to production without added work.
Build resilient infrastructure
Don’t worry about sizing and building clusters, manipulating Docker files, or
messing around with Kubernetes networking configurations. It just works.
Key benefits for autoscaling

Helpful but does not solve our
core problem…..

What exactly is a shuffle & why do we care? Rob Wynne

A Brief History of Spark Shuffle
● Shuffle files to local storage on the executors
● Executors responsible for serving the files
● Loss of an executor meant loss of the shuffle files
● Result: poor auto-scaling
○ Pathological loop: scale down, lose work, re-compute, trigger scale up…
● Depended on driver GC event to clean up shuffle files
22#UnifiedDataAnalytics #SparkAISummit

Today: Dynamic allocation and “external” shuffle
● Executors no longer need to serve data
● “External” shuffle is not exactly external
○ Only executors can be released
○ Can scale up & down executors but not the machines
● Still depends on driver GC event to clean up shuffle files

Spark’s shuffle code today
private[spark] trait ShuffleManager {
def registerShuffle[K, V, C](shuffleId: Int, numMaps: Int, dependency: ShuffleDependency[K, V,
C]): ShuffleHandle
def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext, metrics:
ShuffleWriteMetricsReporter): ShuffleWriter[K, V]
def getReader[K, C](handle: ShuffleHandle, startPartition: Int, endPartition: Int, context:
TaskContext, metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]
def unregisterShuffle(shuffleId: Int): Boolean
def shuffleBlockResolver: ShuffleBlockResolver
def stop(): Unit
}

Continued..
/**
* Obtained inside a map task to write out records to the shuffle system.
*/
private[spark] abstract class ShuffleWriter[K, V] {
/** Write a sequence of records to this task's output */
@throws[IOException]
def write(records: Iterator[Product2[K, V]]): Unit
/** Close this writer, passing along whether the map completed */
def stop(success: Boolean): Option[MapStatus]
}

Continued..
/** Write a bunch of records to this task's output */
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we
don't
// care whether the keys get sorted in each partition; that will be done on the reduce
side
// if the operation being run is sortByKey.
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
...

Continued..
// Don't bother including the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see SPARK-3570).
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId,
IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}

Continued..
/ Note: Changes to the format in this file should be kept in sync with
//
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData().
private[spark] class IndexShuffleBlockResolver(
conf: SparkConf,
_blockManager: BlockManager = null)
extends ShuffleBlockResolver
………..

Problems with This
● Rapid downscaling infeasible
○ Scaling down entire nodes hard
● Preemptible VMs & Spot Instances

Preemptible
VMs and
Spot
instances
PVMs Up to 80% cheaper for
short-lived instances. Can be pulled
at any time. Guaranteed to be
removed at least once in 24 hours.
Spot is based on Vickrey auction.

How can we fix this?
Make intermediate shuffle data external to both the executor and the
machine itself

Where we started
class HcfsShuffleWriter[K, V, C] extends ShuffleWriter[K, V] {
override def write(records: Iterator[Product2[K, V]]): Unit = {
val sorter = new ExternalSorter[K, V, C/V](...)
sorter.insertAll(records)
val partitionIter = sorter.partitionedIter
val hcfsStream = …
val countingStream = new CountingOutputStream(hcfsStream)
val framedOutput = new FramingOutputStream(countingStream)
try {
for ((partition, iter) <- partitionIter) {
// Write partition to external storage
}
} finally {
framedOutput.closeUnderlying()
}
}

Alpha: HDFS not quite ready for prime time
● RPC overhead to HDFS or persistent storage
● Especially poor performance with misaligned partition/block sizes
○ HDFS/GCS/etc different expectations of block size
● Loss of implicit in-memory page cache
● Possibly slowness in cleaning up shuffle files
● Namenode contention when reading shuffle files (HDFS)
○ Added index caching layer to mitigate this
● Additional metadata tracking

Apache Crail (Incubating) is a high-performance distributed data store designed for fast sharing
of ephemeral data in distributed data processing workloads
● Fast
● Heterogeneous
● Modular

What about Google Cloud Bigtable?
Consistent low latency, high
throughput, and scalable
wide-column database service.

Back to basics - NFS
● Shuffle to Elastifile
○ Cloud based NFS service (scales horizontally)
○ Tailored to random access patterns, small files
○ NFS looks like local FS, but is not. Must be careful when dealing with
commit semantics and speculative execution.
● Still a performance hit but factors better than HDFS

Goal: OSS Disaggregated Shuffle
Architecture
Kubernetes Cluster
Spark Driver Pod
Shuffle
Offload
(WIP)
Executor
Virtual Machine Group
Elastifile
Cloud (object)
Storage

Use the cloud to
fix the cloud?

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Improving Apache Spark Downscaling

In this document

More Related Content

What's hot

Similar to Improving Apache Spark Downscaling

More from Databricks

Recently uploaded

Improving Apache Spark Downscaling