The document discusses Google's Cloud Dataproc, a fully-managed service for Apache Spark and Hadoop, detailing its integration with various Google Cloud products and the management of cluster resources through autoscaling. It addresses the complexities and challenges of optimizing Spark jobs, particularly focusing on shuffle processing, data management, and the transition to Kubernetes for enhanced control and performance. Additionally, it highlights advancements in external shuffling mechanisms to improve efficiency and reduce the impact of scaling operations.
Who are weand what is Cloud Dataproc?
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Rapid cluster creation
Familiar open source tools
Customizable machines
Ephemeral clusters on-demand
Tightly Integrated
with other Google Cloud
Platform services
6.
Cloud Dataproc: Opensource solutions with GCP
Taking the best of open source And opening up access to the best of GCP
Webhcat
BigQuery
Cloud
Datastore
Cloud
Bigtable
Compute
Engine
Kubernetes
Engine
Cloud
Dataflow
Cloud
Dataproc
Cloud
Functions
Cloud Machine
Learning
Engine
Cloud
Pub/Sub
Key
Management
Service
Cloud
Spanner
Cloud SQL BQ Transfer
Service
Cloud
Translation API
Cloud Vision
API
Cloud
Storage
7.
Jobs are “fireand forget”
No need to manually intervene
when a cluster is over or under
capacity
Choose balance between
standard and preemptible workers
Save resources (quota & cost) at
any point in time
Dataproc Autoscaling GA
Complicating Spark Downscaling
Without autoscaling
Submit job
Monitor resource usage
Adjust cluster size
With autoscaling
Submit jobs
8.
Based on thedifference between
YARN pending and available
memory
If more memory is needed then
scale up
If there is excess memory then
scale down
Obey VM limits and scale based
on scale factor
Autoscaling policies: fine grained control
Is there too much or too little
YARN memory?
Do nothing
Is the cluster at the maximum
# of nodes?
Do not autoscale
Determine type and scale of
nodes to modify
Autoscale cluster
Yes No
Yes No
YARN-based managed Spark
DataprocCluster
HDFS
Persistent Disk
Cluster bucket
Cloud Storage Compute engine nodes
Dataproc Image
Apache Spark
Apache Hadoop
Apache Hive
...
Clients
Cloud Dataproc API
Clusters
...
Jobs
Clients
(SSH)
Dataproc Agent
User Data
Cloud Storage
12.
YARN pain points
Managementis difficult
Clusters are complicated and have to use more components than are
required for a job or model. This also requires hard-to-find experts.
Complicated OSS software stack
Version and dependency management is hard. Have to understand how to
tune multiple components for efficiency.
Isolation is hard
I have to think about my jobs to size clusters, and isolating jobs requires
additional steps.
14.
Multiple k8s
options
Moving theOSS ecosystem
to Kubernetes offers
customers a range of options
depending on their needs and
core expertise.
DIY k8s Dataproc
k8s Dataproc +
Vendor components
Runs OSS on k8s? Yes - self-managed
Yes - managed k8s
clusters
Yes - managed k8s
clusters
SLAs GKE only Dataproc cluster
Dataproc cluster
and component
OSS components Community only Google optimized
Google optimized +
vendor optimized
In-depth component
support
No No Yes
Integrated
management
No Yes Yes
Integrated security No Yes Yes
Hybrid/cross-cloud
support
No Yes Yes
15.
How we aremaking this happen
• Kubernetes Operators - Application control
plane for complex applications
– The language of Kubernetes allows
extending its vocabulary through
Custom Resource Definition (CRD)
– Kubernetes Operator is an app-specific
control plane running in the cluster
• CRD: app-specific vocabulary
• CR: instance of CRD
• CR Controller: interpreter and
reconciliation loop for CRs
– The cluster can now speak the
app-specific words through the
Kubernetes API
Control Plane
(Master)
MyApp API
Data Plane
(Nodes)
CRUD MyApp ...
Kubernetes
MyApp Control Plane
Kubernetes API
16.
● Integrates withBigQuery,
Google’s Serverless Data
Warehouse
● Provides Google Cloud Storage
as replacement for HDFS
● Ships logs to Stackdriver
Monitoring
○ via Prometheus server
with the Stackdriver
sidecar
● Contains sparkctl, a command
line tool that simplifies client-local
application dependencies in a
Kubernetes environment.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
1. Deploy unifiedresource management
Get away deal from two separate cluster management interfaces to manage
open source component. Offers one central view for easy management.
2. Isolate Spark jobs and resources
Remove the headaches of version and dependency management; instead,
move models and ETL pipelines from dev to production without added work.
Build resilient infrastructure
Don’t worry about sizing and building clusters, manipulating Docker files, or
messing around with Kubernetes networking configurations. It just works.
Key benefits for autoscaling
A Brief Historyof Spark Shuffle
● Shuffle files to local storage on the executors
● Executors responsible for serving the files
● Loss of an executor meant loss of the shuffle files
● Result: poor auto-scaling
○ Pathological loop: scale down, lose work, re-compute, trigger scale up…
● Depended on driver GC event to clean up shuffle files
22#UnifiedDataAnalytics #SparkAISummit
23.
Today: Dynamic allocationand “external” shuffle
● Executors no longer need to serve data
● “External” shuffle is not exactly external
○ Only executors can be released
○ Can scale up & down executors but not the machines
● Still depends on driver GC event to clean up shuffle files
23#UnifiedDataAnalytics #SparkAISummit
Continued..
/**
* Obtained insidea map task to write out records to the shuffle system.
*/
private[spark] abstract class ShuffleWriter[K, V] {
/** Write a sequence of records to this task's output */
@throws[IOException]
def write(records: Iterator[Product2[K, V]]): Unit
/** Close this writer, passing along whether the map completed */
def stop(success: Boolean): Option[MapStatus]
}
25#UnifiedDataAnalytics #SparkAISummit
26.
Continued..
/** Write abunch of records to this task's output */
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we
don't
// care whether the keys get sorted in each partition; that will be done on the reduce
side
// if the operation being run is sortByKey.
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
...
26#UnifiedDataAnalytics #SparkAISummit
27.
Continued..
// Don't botherincluding the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see SPARK-3570).
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId,
IndexShuffleBlockResolver.NOOP_REDUCE_ID)
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
27#UnifiedDataAnalytics #SparkAISummit
28.
Continued..
/ Note: Changesto the format in this file should be kept in sync with
//
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getSortBasedShuffleBlockData().
private[spark] class IndexShuffleBlockResolver(
conf: SparkConf,
_blockManager: BlockManager = null)
extends ShuffleBlockResolver
………..
28#UnifiedDataAnalytics #SparkAISummit
29.
Problems with This
●Rapid downscaling infeasible
○ Scaling down entire nodes hard
● Preemptible VMs & Spot Instances
29#UnifiedDataAnalytics #SparkAISummit
Preemptible
VMs and
Spot
instances
PVMs Upto 80% cheaper for
short-lived instances. Can be pulled
at any time. Guaranteed to be
removed at least once in 24 hours.
Spot is based on Vickrey auction.
How can wefix this?
Make intermediate shuffle data external to both the executor and the
machine itself
33#UnifiedDataAnalytics #SparkAISummit
34.
Where we started
classHcfsShuffleWriter[K, V, C] extends ShuffleWriter[K, V] {
override def write(records: Iterator[Product2[K, V]]): Unit = {
val sorter = new ExternalSorter[K, V, C/V](...)
sorter.insertAll(records)
val partitionIter = sorter.partitionedIter
val hcfsStream = …
val countingStream = new CountingOutputStream(hcfsStream)
val framedOutput = new FramingOutputStream(countingStream)
try {
for ((partition, iter) <- partitionIter) {
// Write partition to external storage
}
} finally {
framedOutput.closeUnderlying()
}
}
34#UnifiedDataAnalytics #SparkAISummit
36.
Alpha: HDFS notquite ready for prime time
● RPC overhead to HDFS or persistent storage
● Especially poor performance with misaligned partition/block sizes
○ HDFS/GCS/etc different expectations of block size
● Loss of implicit in-memory page cache
● Possibly slowness in cleaning up shuffle files
● Namenode contention when reading shuffle files (HDFS)
○ Added index caching layer to mitigate this
● Additional metadata tracking
36#UnifiedDataAnalytics #SparkAISummit
Apache Crail (Incubating)is a high-performance distributed data store designed for fast sharing
of ephemeral data in distributed data processing workloads
● Fast
● Heterogeneous
● Modular
40.
What about GoogleCloud Bigtable?
Consistent low latency, high
throughput, and scalable
wide-column database service.
41.
Back to basics- NFS
● Shuffle to Elastifile
○ Cloud based NFS service (scales horizontally)
○ Tailored to random access patterns, small files
○ NFS looks like local FS, but is not. Must be careful when dealing with
commit semantics and speculative execution.
● Still a performance hit but factors better than HDFS
41#UnifiedDataAnalytics #SparkAISummit
42.
Goal: OSS DisaggregatedShuffle
Architecture
Kubernetes Cluster
Spark Driver Pod
Shuffle
Offload
(WIP)
Executor
Virtual Machine Group
Elastifile
Cloud (object)
Storage