Data Science with Apache Spark - Crash Course - HS16SJ

Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Hands-on Intro to Data Science
with Apache Spark
Crash Course

2 © Hortonworks Inc. 2011 –2016. All Rights Reserved
Plan for Today
• Data Science & ML
• ML Examples
• Overview of ML methods
• K-means, Decision Trees & Random Forests
• Spark MLlib & ML
• Lab Overview

Data Science Examples

Predictive Analytics Pre-requisites
Sales Play 4: Predictive Analytics

Predictive Analytics Process and Tools

Machine Learning
“… science of how
computers learn without
being explicitly
programmed” – Andrew Ng

Machine Learning Methods

Supervised
vs
Unsupervised
Learning
Examples
labeled.
Examples not
labeled.

Unsupervised LearningSupervised Learning

CLASSIFICATION
Identifying to which category an object belongs to.
Applications: spam detection, image recognition, ...
Algorithms: k-nn, decision trees, random forest, ...

REGRESSION
Predicting a continuous-valued attribute
associated with an object.
Applications: drug response, stock prices, …
Algorithms: linear regression, …

CLUSTERING
Automatic grouping of similar objects into sets.
Applications: customer segmentation, topic modeling, …
Algorithms: k-means, LDA, …

COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix.
Applications: Product recommendation, …
Algorithms: Alternating Least Squares (ALS)

DIMENSIONALITY REDUCTION
Reducing the number of random variables to consider.
Applications: visualization, increased efficiency, …
Algorithms: PCA, t-SNE, …

PREPROCESSING
Feature extraction and normalization
Applications: transforming input data such as text as input to ML algorithms
Algorithms: TF-IDF, word2vec, one hot encoding, …

MODEL SELECTION
Comparing, validating and choosing parameters and models.
Applications: improved accuracy via parameter tuning
Algorithms: grid search, metrics …

Spark MLlib

Spark Machine Learning Library
Ã Clustering
– k-means clustering
– latent Dirichlet allocation (LDA)
Ã Dimensionality reduction
– singularity value decomposition (SVD)
– principal component analysis (PCA)
Ã Feature Extractors & Transformers
– word2vec
Ã Basic statistics
– summary statistics
– hypothesis testing
– random number generation
Ã Classification and regression
– linear models (SVMs, log & linear regression)
– decision trees
– ensembles of trees (Random Forests & GBTs)
Ã Collaborative filtering
– alternating least squares (ALS)

K-Means Clustering
(Unsupervised Learning)

Why K-Means
Ã Simple & fast algorithm to find clusters
Ã Common technique for anomaly detection
Ã Drawbacks
– Doesn't work well with non-circular cluster shape
– Number of cluster and initial seed value need to be specified beforehand
– Strong sensitivity to outliers and noise
– Low capability to pass the local optimum.

Initialize Cluster Centers
Randomly pick 3
cluster centers.

Assign Each Point
Assign each point
to the nearest
cluster center.

Recompute Cluster Centers
Move each
cluster to the
mean of each
cluster.

K-means Clustering

San Francisco

Outline Each Neighborhood

Folium: choropleth map

SF Neighborhood Centers Calculated with K-Means

Sample Dataset – K-Means
0.0, 0.0, 0.0
0.1, 0.1, 0.1
0.2, 0.2, 0.2
3.0, 3.0, 3.0
3.1, 3.1, 3.1
3.2, 3.2, 3.2

Decision Trees & Random Forests
(Supervised Learning)

Why Decision Trees?
Ã Simple to understand and interpret. (And explain to executives.)
Ã Requires little data preparation. (Other techniques often require data
normalisation, dummy variables need to be created and blank values to be removed.)
Ã Performs well with large datasets.

Visual Intro to Decision Trees
Ã http://www.r2d3.us/visual-intro-to-machine-learning-part-1

Random Forest (Ensemble Model)
Ã Main idea: build an ensemble of simple decision trees
Ã Each tree is simple and less likely to overfit
Ã Classify/predict by voting between all trees

Decision Tree vs Random Forest

Overcome limitations of a single hypothesis
Decision Tree Model Averaging
Why Ensembles work?

Diabetes Dataset – Decision Trees / Random Forest
Labeled set with 8 Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667
-1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333
+1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1
-1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6
+1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7
-1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...

Machine Learning in Spark

Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLlib GraphX

Machine Learning with Spark (MLlib & ML)
Ã Original “lower” API
Ã Built on top of RDDs
Ã Maintenance mode starting with Spark 2.0
MLlib
Ã Newer “higher-level” API for constructing workflows
Ã Built on top of DataFrames
ML
Both algorithms
implemented to take
advantage of data
parallelism

Predict
Model
Supervised Learning: End-to-End Flow
Feature Extraction
Train the
Model
ModelData items
Labels
Data item Feature Extraction Label
Training
(batch)
Predicting
(real time or batch)
Feature Matrix
Feature Vector
Training set

Spark ML: Spark API for building ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Random
Forest
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model

Spark ML Pipeline
Ã Pipeline includes both fit() and transform() methods
– fit() is for training
– transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model
fit()
transform()
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Spark ML – Simple Random Forest Example
indexer = StringIndexer(inputCol=”district", outputCol=”dis-inx")
parser = Tokenizer(inputCol=”text-field", outputCol="words")
hashingTF = HashingTF(numFeatures=50, inputCol="words", outputCol="hash-inx")
vecAssembler = VectorAssembler(
inputCols =[“dis-inx”, “hash-inx”],
outputCol="features")
rf = RandomForestClassifier(numTrees=100, labelCol="label", seed=42)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Apache Zeppelin – A Modern Web-based Data Science Studio
Ã Data exploration and discovery
Ã Visualization
Ã Deeply integrated with Spark and Hadoop
Ã Pluggable interpreters
Ã Multiple languages in one notebook: R, Python, Scala

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
Ã Supported models
– K-Means
– Linear Regression
– Ridge Regression
– Lasso
– SVM
– Binary

Additional Resources
• Machine Learning
• Natural Language Processing (NLP)
• Scalable Machine Learning
• Introduction to Statistics

Lab Overview
tinyurl.com/hwx-intro-to-ml-with-spark

Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories

Robert Hryniewicz
@RobHryniewicz
Thanks!

Data Science with Apache Spark - Crash Course - HS16SJ

More Related Content

What's hot

Viewers also liked

Similar to Data Science with Apache Spark - Crash Course - HS16SJ

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Data Science with Apache Spark - Crash Course - HS16SJ