KEMBAR78
Apache Spark MLlib - Random Foreset and Desicion Trees | PPTX
Austin SIGKDD Spark talk
Random Forest and Decision Trees in
Spark MLlib
Tuhin Mahmud
Sep 20th, 2017
About Me
Tuhin Mahmud
Advisory Software Engineer , IBM
https://www.linkedin.com/in/tuhinmahmud/
Agenda
● Spark and MLLib Overview
● Decision Trees in Spark MlLib
● Random Forest in Spark MlLib
● Demo
What is MLlib?
● MLlib is Spark’s machine learning (ML) library.
● Its goal is to make practical machine learning scalable and easy.
● ML algorithms include Classification , Regression , Decision Trees and
random Forests, Recommendation, Clustering,Topic Modeling (LDA),
Distributed linear Algebra(SVD,PCA) and many more.
https://spark.apache.org/docs/latest/ml-guide.html
Spark MLlib Trajectory
https://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-science-and-production
Spark MlLib
● Initially developed by MLbase team in AmpLab (UC berkeley) - supported only Scala and Java
● Shipped to spark v0.8 ( Sep 2013)
● Current release Spark 2.2.0 released (Jul 11, 2017)
● MLLib 2.2:
○ DataFrame-based Api becomes the primary API for MLlib
■ org.apache.spark.ml and pysprak.ml
○ Ease of use
■ Java, Scala, Python, R
■ interoperates with NumPy in Python
■ Any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
Spark Mllib in production
● ML persistence
○ Saving and loading
● Pipeline MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a
single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
mostly inspired by the scikit-learn project.
Some concepts[5]
● DataFrame
● Transformer - one dataframe to another
● Estimator - fits on a dataframe to produce a transformer
● Pipeline A pipeline chains multiple transformers and Estimators to specify ML workflow.
MLlib 2.2.0 API
1. MLlib: RDD-based API ( maintanace mode)
2. MLlib: DataFrame-based API
Why is MLlib switching to the DataFrame-based API?
● DataFrames provide a more user-friendly API than RDDs.
a. The many benefits of DataFrames include Spark Datasources,
b. SQL/DataFrame queries, Tungsten and Catalyst optimizations,
c. uniform APIs across languages.
● The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
● DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
What is “Spark ML”?
● “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
https://spark.apache.org/docs/2.2.0/ml-guide.html
Agenda
● Spark and MLLib Overview
● Decision Tree in Spark MlLib
● Random Forest in Spark MlLib
● Demo
Decision Tree
http://www.saedsayad.com/decision_tree.htm
Decision Tree Applied
Machine learning with Random Forests And Decision Trees- A visual Guide for Beginner - by Scott Hartshorn
DecisionTrees packages in MLlib
1. from pyspark.mllib.tree import DecisionTree, DecisionTreeModel ( RDD
based)
2. from pyspark.ml.classification import DecisionTreeClassifier (Dataframe based)
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-trees
DecisionTree Classifier (MLlib Dataframe based API)
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
>>> model = dt.fit(td)
>>> model.numNodes
https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier
Decison Tree
hyper-parameters for decision tree in MlLib
● numClasses: How many classes are we trying to classify?
● categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should
not be treated as numbers
● impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures
of impurity with respect to classification: Gini and Entropy
● maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to
more accurate results but run the risk of overfitting.
Decision Trees are prone to Overfitting
Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
Spark MLlib -Decision Tree Example Notebook
1. Notebook using small dataset golf play
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMllibPyspark.golf.ipynb
image:http://i2.cdn.turner.com/dr/pga/sites/default/files/articles/bro-hof-17th-bubba-072411-640x360.jpg?1311698162
Agenda
● Spark and MLLib Overview
● Decision Tree in Spark MlLib
● Random Forest in Spark MlLib
● Demo
Random Forest
Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
Random Forest
The name “Random Forest” comes from combining the randomness that is used
to pick the subset of data with having a bunch of Decision trees.
A random forest (RF) is a collection of tree predictors
f ( x, T, Θk ), k = 1,…,K
where the Θk are i.i.d random vectors.
The trees are combined by
● voting (for classification)
● averaging (for regression).
Random Forest
http://www.math.usu.edu/adele/RandomForests/ENAR.pdf
Random Forest - Out of bag Error
Out-of-bag Error Estimate
● Average over the cases within each class to get a classwise out-of-bag error
rate.
● Average over all cases to get an overall out-of-bag error rate.
In random forests, there is no need for cross-
validation to get an unbiased estimate of the
test set error. It is estimated internally,
RandomForest in MLLib
DataFrame Based API
class pyspark.ml.classification.RandomForestClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction",
probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto",
seed=None, subsamplingRate=1.0)[source]¶
>>> import numpy>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
Random Forest - parameters
RandomForest - some parameters ( older RDD based but some apply to
dataframe based)
● numTrees: Number of trees in the resulting forest. Increasing the number of trees
decreases model variance.
● featureSubsetStrategy: Specifies a method which produces a number of how many
features are selected for training a single tree.
● seed: Seed for random generator initialization, since RandomForest depends on
random selection of features and rows
Random Forest -parameters
Spark also provides additional parameters to stop tree growing and produce fine-grained trees:
● minInstancesPerNode: A node is not split anymore, if it would provide left or right nodes which
would contain smaller number of observations than the value specified by this parameter. Default value is 1,
but typically for regression problems or large trees, the value should be higher.
● minInfoGain: Minimum information gain a split must get. Default value is 0.0.
MLlib - Labeled point vector (RDD based)
Labeled point vector
● Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset
into a labeled point vector.
○ val higgs = response.zip(features).map {
case (response, features) =>
LabeledPoint(response, features) }
higgs.setName("higgs").cache()
● An example of a labeled point vector follows:
(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789])
MLlib - data caching
Data caching
Many machine learning algorithms are iterative in nature and thus require multiple passes over the data.
Spark provides a way to persist the data in case we need to iterate over it. Spark also publishes several
StorageLevels to allow storing data with various options:
● NONE: No caching at all
● MEMORY_ONLY: Caches RDD data only in memory
● DISK_ONLY: Write cached RDD data to a disk and releases from memory
Making sense of a tragedy - titanic dataset
image:http://www.oceangate.com/images/expeditions/titanic/titanic-sinking-wikimedia-commons.jpg
● Notebook using titanic dataset and MLlib spark dataframe based apis for
Decision Tree and Random Forest
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Spar
kMlLibTitanicNewDFbasedAPI.ipynb
Notebooks
1. Notebook using small dataset golf play
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMllibPyspark.golf.ipynb
2. Notebook using titanic dataset and MLlib spark dataframe based apis for
Decision Tree and Random Forest
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMlLibTitanicNewDFbasedAPI.ipynb
THANK YOU!
Back slide : Spark Stack
Reference
1. https://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-
science-and-production
2. https://spark.apache.org/mllib/
3. https://spark.apache.org/docs/latest/ml-guide.html
4. http://www.saedsayad.com/decision_tree.htm
5. http://www.math.usu.edu/adele/RandomForests/ENAR.pdf
6. https://spark.apache.org/docs
7. https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
8. https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTr

Apache Spark MLlib - Random Foreset and Desicion Trees

  • 1.
    Austin SIGKDD Sparktalk Random Forest and Decision Trees in Spark MLlib Tuhin Mahmud Sep 20th, 2017
  • 2.
    About Me Tuhin Mahmud AdvisorySoftware Engineer , IBM https://www.linkedin.com/in/tuhinmahmud/
  • 3.
    Agenda ● Spark andMLLib Overview ● Decision Trees in Spark MlLib ● Random Forest in Spark MlLib ● Demo
  • 4.
    What is MLlib? ●MLlib is Spark’s machine learning (ML) library. ● Its goal is to make practical machine learning scalable and easy. ● ML algorithms include Classification , Regression , Decision Trees and random Forests, Recommendation, Clustering,Topic Modeling (LDA), Distributed linear Algebra(SVD,PCA) and many more. https://spark.apache.org/docs/latest/ml-guide.html
  • 5.
  • 6.
    Spark MlLib ● Initiallydeveloped by MLbase team in AmpLab (UC berkeley) - supported only Scala and Java ● Shipped to spark v0.8 ( Sep 2013) ● Current release Spark 2.2.0 released (Jul 11, 2017) ● MLLib 2.2: ○ DataFrame-based Api becomes the primary API for MLlib ■ org.apache.spark.ml and pysprak.ml ○ Ease of use ■ Java, Scala, Python, R ■ interoperates with NumPy in Python ■ Any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
  • 7.
    Spark Mllib inproduction ● ML persistence ○ Saving and loading ● Pipeline MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project. Some concepts[5] ● DataFrame ● Transformer - one dataframe to another ● Estimator - fits on a dataframe to produce a transformer ● Pipeline A pipeline chains multiple transformers and Estimators to specify ML workflow.
  • 8.
    MLlib 2.2.0 API 1.MLlib: RDD-based API ( maintanace mode) 2. MLlib: DataFrame-based API Why is MLlib switching to the DataFrame-based API? ● DataFrames provide a more user-friendly API than RDDs. a. The many benefits of DataFrames include Spark Datasources, b. SQL/DataFrame queries, Tungsten and Catalyst optimizations, c. uniform APIs across languages. ● The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages. ● DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details. What is “Spark ML”? ● “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. https://spark.apache.org/docs/2.2.0/ml-guide.html
  • 9.
    Agenda ● Spark andMLLib Overview ● Decision Tree in Spark MlLib ● Random Forest in Spark MlLib ● Demo
  • 10.
  • 11.
    Decision Tree Applied Machinelearning with Random Forests And Decision Trees- A visual Guide for Beginner - by Scott Hartshorn
  • 12.
    DecisionTrees packages inMLlib 1. from pyspark.mllib.tree import DecisionTree, DecisionTreeModel ( RDD based) 2. from pyspark.ml.classification import DecisionTreeClassifier (Dataframe based) https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-trees
  • 13.
    DecisionTree Classifier (MLlibDataframe based API) >>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed") >>> model = dt.fit(td) >>> model.numNodes https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier
  • 14.
    Decison Tree hyper-parameters fordecision tree in MlLib ● numClasses: How many classes are we trying to classify? ● categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should not be treated as numbers ● impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures of impurity with respect to classification: Gini and Entropy ● maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to more accurate results but run the risk of overfitting.
  • 15.
    Decision Trees areprone to Overfitting Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
  • 16.
    Spark MLlib -DecisionTree Example Notebook 1. Notebook using small dataset golf play http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp arkMllibPyspark.golf.ipynb image:http://i2.cdn.turner.com/dr/pga/sites/default/files/articles/bro-hof-17th-bubba-072411-640x360.jpg?1311698162
  • 17.
    Agenda ● Spark andMLLib Overview ● Decision Tree in Spark MlLib ● Random Forest in Spark MlLib ● Demo
  • 18.
    Random Forest Reference: MachineLearning - Decision Trees and Random Forests - by Loonycorn
  • 19.
    Random Forest The name“Random Forest” comes from combining the randomness that is used to pick the subset of data with having a bunch of Decision trees. A random forest (RF) is a collection of tree predictors f ( x, T, Θk ), k = 1,…,K where the Θk are i.i.d random vectors. The trees are combined by ● voting (for classification) ● averaging (for regression).
  • 20.
  • 21.
    Random Forest -Out of bag Error Out-of-bag Error Estimate ● Average over the cases within each class to get a classwise out-of-bag error rate. ● Average over all cases to get an overall out-of-bag error rate. In random forests, there is no need for cross- validation to get an unbiased estimate of the test set error. It is estimated internally,
  • 22.
    RandomForest in MLLib DataFrameBased API class pyspark.ml.classification.RandomForestClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto", seed=None, subsamplingRate=1.0)[source]¶ >>> import numpy>>> from numpy import allclose >>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42) >>> model = rf.fit(td) https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
  • 23.
    Random Forest -parameters RandomForest - some parameters ( older RDD based but some apply to dataframe based) ● numTrees: Number of trees in the resulting forest. Increasing the number of trees decreases model variance. ● featureSubsetStrategy: Specifies a method which produces a number of how many features are selected for training a single tree. ● seed: Seed for random generator initialization, since RandomForest depends on random selection of features and rows
  • 24.
    Random Forest -parameters Sparkalso provides additional parameters to stop tree growing and produce fine-grained trees: ● minInstancesPerNode: A node is not split anymore, if it would provide left or right nodes which would contain smaller number of observations than the value specified by this parameter. Default value is 1, but typically for regression problems or large trees, the value should be higher. ● minInfoGain: Minimum information gain a split must get. Default value is 0.0.
  • 25.
    MLlib - Labeledpoint vector (RDD based) Labeled point vector ● Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset into a labeled point vector. ○ val higgs = response.zip(features).map { case (response, features) => LabeledPoint(response, features) } higgs.setName("higgs").cache() ● An example of a labeled point vector follows: (1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789])
  • 26.
    MLlib - datacaching Data caching Many machine learning algorithms are iterative in nature and thus require multiple passes over the data. Spark provides a way to persist the data in case we need to iterate over it. Spark also publishes several StorageLevels to allow storing data with various options: ● NONE: No caching at all ● MEMORY_ONLY: Caches RDD data only in memory ● DISK_ONLY: Write cached RDD data to a disk and releases from memory
  • 27.
    Making sense ofa tragedy - titanic dataset image:http://www.oceangate.com/images/expeditions/titanic/titanic-sinking-wikimedia-commons.jpg ● Notebook using titanic dataset and MLlib spark dataframe based apis for Decision Tree and Random Forest http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Spar kMlLibTitanicNewDFbasedAPI.ipynb
  • 28.
    Notebooks 1. Notebook usingsmall dataset golf play http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp arkMllibPyspark.golf.ipynb 2. Notebook using titanic dataset and MLlib spark dataframe based apis for Decision Tree and Random Forest http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp arkMlLibTitanicNewDFbasedAPI.ipynb
  • 29.
  • 30.
    Back slide :Spark Stack
  • 31.
    Reference 1. https://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data- science-and-production 2. https://spark.apache.org/mllib/ 3.https://spark.apache.org/docs/latest/ml-guide.html 4. http://www.saedsayad.com/decision_tree.htm 5. http://www.math.usu.edu/adele/RandomForests/ENAR.pdf 6. https://spark.apache.org/docs 7. https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier 8. https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTr