KEMBAR78
Large Scale Machine Learning with Apache Spark | PPTX
Large Scale Learning with Apache Spark
Sandy Ryza, Data Science, Cloudera
● Data scientist at Cloudera
● Recently lead Apache Spark development at
Cloudera
● Before that, committing on Apache Hadoop
● Before that, studying combinatorial
optimization and distributed systems at
Brown
Me
Sometimes you find yourself
with lots of stuff
Large Scale Learning
Network Packets
Detect Network Intrusions
Credit Card Transactions
Detect Fraud
Movie Viewings
Recommend Movies
Two Main Problems
● Designing a system for processing huge
data in parallel
● Taking advantage of it with algorithms that
work well in parallel
System Requirements
● Scalability
● Programming model that abstracts away
distributed ugliness
● Data-scientist friendly
○ High-level operators
○ Interactive shell (REPL)
● Efficiency for iterative algorithms
CONFIDENTIAL - RESTRICTED*
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Key advances by MapReduce:
•Data Locality: Automatic split computation and launch of mappers appropriately
•Fault tolerance: Write out of intermediate results and restartable mappers meant
ability to run on commodity hardware
•Linear scalability: Combination of locality + programming model that forces developers
to write generally scalable solutions to problems
CONFIDENTIAL - RESTRICTED*
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java,
Scala, Python
• Interactive shell
• Fast to Run
• General execution
graphs
• In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
CONFIDENTIAL - RESTRICTED*
What is Spark?
Spark is a general purpose computation framework geared towards massive
data - more flexible than MapReduce
Extra properties:
•Leverages distributed memory
•Full Directed Graph expressions for data parallel computations
•Improved developer experience
Yet retains:
Linear scalability, Fault-tolerance and Data-Locality
CONFIDENTIAL - RESTRICTED*
Spark introduces concept of RDD to take
advantage of memory
RDD = Resilient Distributed Datasets
•Defined by parallel transformations on data in stable storage
RDDs
bigfile.txt
RDDs
bigfile.txt lines
val lines = sc.textFile(
“bigfile.txt”)
RDDs
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
val numbers = lines.map
((x) => x.toDouble)
RDDs
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
val numbers = lines.map
((x) => x.toDouble)
sum
numbers.sum()
RDDs
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toDouble) numbers.sum()
Shuffle
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val sorted =
lines.sort() sorted.sum()
CONFIDENTIAL - RESTRICTED*
Persistence and Fault Tolerance
•User decides whether and how to persist
• Disk
• Memory
• Transient (recomputed on each use)
Observation:
a.Provides fault-tolerance through concept of lineage
CONFIDENTIAL - RESTRICTED*
Lineage
•Reconstruct partitions that go
down using original steps we
used to create them
RDDs
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toInt) numbers.cache()
.sum()
numbers.sum()
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
CONFIDENTIAL - RESTRICTED*
Easy
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
CONFIDENTIAL - RESTRICTED*
Out of the Box Functionality
• Hadoop Integration
• Works with Hadoop Data
• Runs under YARN
• Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
• Roadmap
• Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s APIs
• Better ML
• Sparse Data Support
• Model Evaluation Framework
• Performance Testing
CONFIDENTIAL - RESTRICTED*
So back to ML
• Hadoop Integration
• Works with Hadoop Data
• Runs under YARN
• Libraries
•MLlib
• Spark Streaming
• GraphX (alpha)
• Roadmap
• Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s APIs
• Better ML
• Sparse Data Support
• Model Evaluation Framework
• Performance Testing
Spark MLlib
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
Spark MLlib
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
Why Cluster Big Data?
● Learn the structure of your data
● Interpret new data as it relates to this
structure
Anomaly Detection
● Anomalies as data points far away from any
cluster
Feature Learning
Feature Learning
Feature Learning
Image patch features
Train a classifier on each cluster
Using it
val data = sc.textFile("kmeans_data.txt")val parsedData =
data.map( _.split(' ').map(_.toDouble))// Cluster the
data into two classes using KMeansval numIterations =
20val numClusters = 2val clusters =
KMeans.train(parsedData, numClusters, numIterations)
K-Means
● Alternate between two steps:
o Assign each point to a cluster based on
existing centers
o Recompute cluster centers from the
points in each cluster
K-Means - very parallelizable
● Alternate between two steps:
o Assign each point to a cluster based on
existing centers
 Process each data point independently
o Recompute cluster centers from the
points in each cluster
 Average across partitions
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
The Problem
● K-Means is very sensitive to initial set of
center points chosen.
● Best existing algorithm for choosing centers
is highly sequential.
K-Means++
● Start with random point from dataset
● Pick another one randomly, with probability
proportional to distance from the closest
already chosen
● Repeat until initial centers chosen
K-Means++
● Initial cluster has expected bound of O(log k)
of optimum cost
K-Means++
● Requires k passes over the data
K-Means||
● Do only a few (~5) passes
● Sample m points on each pass
● Oversample
● Run K-Means++ on sampled points to find
initial centers
Then on the real data...
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache Spark

  • 1.
    Large Scale Learningwith Apache Spark Sandy Ryza, Data Science, Cloudera
  • 2.
    ● Data scientistat Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown Me
  • 3.
    Sometimes you findyourself with lots of stuff
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Two Main Problems ●Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
  • 12.
    System Requirements ● Scalability ●Programming model that abstracts away distributed ugliness ● Data-scientist friendly ○ High-level operators ○ Interactive shell (REPL) ● Efficiency for iterative algorithms
  • 13.
    CONFIDENTIAL - RESTRICTED* MapReduce MapMap Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems
  • 14.
    CONFIDENTIAL - RESTRICTED* Spark:Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 15.
    CONFIDENTIAL - RESTRICTED* Whatis Spark? Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality
  • 16.
    CONFIDENTIAL - RESTRICTED* Sparkintroduces concept of RDD to take advantage of memory RDD = Resilient Distributed Datasets •Defined by parallel transformations on data in stable storage
  • 17.
  • 18.
    RDDs bigfile.txt lines val lines= sc.textFile( “bigfile.txt”)
  • 19.
    RDDs bigfile.txt lines val lines= sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble)
  • 20.
    RDDs bigfile.txt lines val lines= sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble) sum numbers.sum()
  • 21.
    RDDs bigfile.txt lines val lines= sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
  • 22.
    Shuffle bigfile.txt lines val lines= sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val sorted = lines.sort() sorted.sum()
  • 23.
    CONFIDENTIAL - RESTRICTED* Persistenceand Fault Tolerance •User decides whether and how to persist • Disk • Memory • Transient (recomputed on each use) Observation: a.Provides fault-tolerance through concept of lineage
  • 24.
    CONFIDENTIAL - RESTRICTED* Lineage •Reconstructpartitions that go down using original steps we used to create them
  • 25.
    RDDs bigfile.txt lines val lines= sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
  • 26.
  • 27.
    CONFIDENTIAL - RESTRICTED* Easy •Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 28.
    CONFIDENTIAL - RESTRICTED* Outof the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
  • 29.
    CONFIDENTIAL - RESTRICTED* Soback to ML • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries •MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
  • 30.
    Spark MLlib Discrete Continuous SupervisedClassification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 31.
    Spark MLlib Discrete Continuous SupervisedClassification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 34.
    Why Cluster BigData? ● Learn the structure of your data ● Interpret new data as it relates to this structure
  • 35.
    Anomaly Detection ● Anomaliesas data points far away from any cluster
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    Train a classifieron each cluster
  • 41.
    Using it val data= sc.textFile("kmeans_data.txt")val parsedData = data.map( _.split(' ').map(_.toDouble))// Cluster the data into two classes using KMeansval numIterations = 20val numClusters = 2val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • 42.
    K-Means ● Alternate betweentwo steps: o Assign each point to a cluster based on existing centers o Recompute cluster centers from the points in each cluster
  • 48.
    K-Means - veryparallelizable ● Alternate between two steps: o Assign each point to a cluster based on existing centers  Process each data point independently o Recompute cluster centers from the points in each cluster  Average across partitions
  • 49.
    // Find thesum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
  • 50.
    // Update thecluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • 52.
    The Problem ● K-Meansis very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
  • 54.
    K-Means++ ● Start withrandom point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
  • 55.
    K-Means++ ● Initial clusterhas expected bound of O(log k) of optimum cost
  • 56.
    K-Means++ ● Requires kpasses over the data
  • 57.
    K-Means|| ● Do onlya few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
  • 65.
    Then on thereal data...