KEMBAR78
Data Science with Apache Spark - Crash Course - HS16SJ | PDF
Robert	Hryniewicz
Data	Evangelist
@RobHryniewicz
Hands-on	Intro	to	Data	Science
with	Apache	Spark
Crash Course
2 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Plan for Today
• Data Science & ML
• ML Examples
• Overview of ML methods
• K-means, Decision Trees & Random Forests
• Spark MLlib & ML
• Lab Overview
3 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Data	Science	Examples
4 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
5 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Predictive Analytics Pre-requisites
Sales	Play	4:	Predictive	Analytics
6 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Predictive Analytics Process and Tools
7 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Machine	Learning
“… science of how
computers learn without
being explicitly
programmed” – Andrew Ng
8 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Machine	Learning	Methods
9 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Supervised
vs
Unsupervised
Learning
Examples	
labeled.
Examples	not
labeled.
10 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Unsupervised	LearningSupervised	Learning
11 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
CLASSIFICATION
Identifying	to	which	category	an	object	belongs	to.
Applications:	spam	detection,	image	recognition,	...
Algorithms:	k-nn,	decision	trees,	random	forest,	...
12 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
REGRESSION
Predicting	a	continuous-valued	attribute	
associated	with	an	object.
Applications:	drug	response,	stock	prices,	…
Algorithms: linear	regression,	…
13 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
CLUSTERING
Automatic	grouping	of	similar	objects	into	sets.
Applications:	customer	segmentation,	topic	modeling,	…
Algorithms: k-means,	LDA,	…
14 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
COLLABORATIVE	FILTERING
Fill	in	the	missing	entries	of	a	user-item	association	matrix.
Applications:	Product	recommendation,	…
Algorithms: Alternating Least Squares (ALS)
15 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
DIMENSIONALITY	REDUCTION
Reducing	the	number	of	random	variables	to	consider.
Applications:	visualization,	increased	efficiency,	…
Algorithms: PCA,	t-SNE,	…
16 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
PREPROCESSING
Feature	extraction	and	normalization
Applications:	transforming	input	data	such	as	text	as	input	to	ML	algorithms
Algorithms:	TF-IDF,	word2vec,	one	hot	encoding,	…
17 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
MODEL	SELECTION
Comparing,	validating	and	choosing	parameters	and	models.
Applications:	improved	accuracy	via	parameter	tuning
Algorithms:	grid	search,	metrics	…
18 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Spark	MLlib
19 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Spark	Machine	Learning	Library
à Clustering
– k-means	clustering
– latent	Dirichlet allocation	(LDA)
à Dimensionality	reduction
– singularity	value	decomposition	(SVD)
– principal	component	analysis	(PCA)
à Feature	Extractors	&	Transformers
– word2vec
à Basic	statistics
– summary	statistics
– hypothesis	testing
– random	number	generation
à Classification	and	regression
– linear	models	(SVMs,	log	&	linear	regression)
– decision	trees
– ensembles	of	trees	(Random	Forests	&	GBTs)
à Collaborative	filtering
– alternating	least	squares	(ALS)
20 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
K-Means	Clustering
(Unsupervised	Learning)
21 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Why K-Means
à Simple	&	fast	algorithm	to	find	clusters
à Common	technique	for	anomaly	detection
à Drawbacks
– Doesn't	work	well	with	non-circular	cluster	shape
– Number	of	cluster	and	initial	seed	value	need	to	be	specified	beforehand
– Strong	sensitivity	to	outliers	and	noise
– Low	capability	to	pass	the	local	optimum.
22 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Initialize Cluster Centers
Randomly	pick	3	
cluster	centers.
23 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Assign Each Point
Assign	each	point	
to	the	nearest	
cluster	center.
24 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Recompute Cluster Centers
Move	each	
cluster	to	the	
mean	of	each	
cluster.
25 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
K-means Clustering
26 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
San Francisco
27 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Outline Each Neighborhood
28 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Folium: choropleth map
29 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
SF Neighborhood Centers Calculated with K-Means
30 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Sample Dataset – K-Means
0.0, 0.0, 0.0
0.1, 0.1, 0.1
0.2, 0.2, 0.2
3.0, 3.0, 3.0
3.1, 3.1, 3.1
3.2, 3.2, 3.2
31 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Decision	Trees	&	Random	Forests	
(Supervised	Learning)
32 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Why	Decision	Trees?
à Simple	to	understand	and	interpret. (And	explain	to	executives.)
à Requires	little	data	preparation. (Other	techniques	often	require	data	
normalisation, dummy	variables	need	to	be	created	and	blank	values	to	be	removed.)
à Performs	well	with	large	datasets.
33 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Visual	Intro	to	Decision	Trees
à http://www.r2d3.us/visual-intro-to-machine-learning-part-1
34 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Random Forest (Ensemble Model)
à Main	idea:	build	an	ensemble	of	simple	decision	trees
à Each	tree	is	simple	and	less	likely	to	overfit
à Classify/predict	by	voting	between	all	trees
35 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Decision	Tree	vs	Random	Forest
36 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Overcome	limitations	of	a	single	hypothesis
Decision	Tree Model	Averaging
Why	Ensembles	work?
37 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Diabetes	Dataset	– Decision	Trees	/	Random	Forest
Labeled	set	with	8	Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333
+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667
-1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333
+1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1
-1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6
+1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7
-1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...
38 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Machine	Learning	in	Spark
39 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Spark	Ecosystem
Spark	Core
Spark	SQL Spark	Streaming MLlib GraphX
40 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Machine	Learning	with	Spark	(MLlib &	ML)
à Original	“lower”	API
à Built	on	top	of	RDDs
à Maintenance	mode	starting	with	Spark	2.0
MLlib
à Newer	“higher-level”	API	for	constructing	workflows
à Built	on	top	of	DataFrames
ML
Both algorithms
implemented to take
advantage of data
parallelism
41 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Predict
Model
Supervised Learning: End-to-End Flow
Feature Extraction
Train the
Model
ModelData items
Labels
Data item Feature Extraction Label
Training
(batch)
Predicting
(real time or batch)
Feature Matrix
Feature Vector
Training set
42 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Spark ML: Spark API for building ML pipelines
Feature	
transform	
1
Feature	
transform	
2
Combine	
features
Random	
Forest
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline	Model
43 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Spark ML Pipeline
à Pipeline includes both fit() and transform() methods
– fit() is for training
– transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline	Model
fit()
transform()
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
44 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Spark ML – Simple Random Forest Example
indexer = StringIndexer(inputCol=”district", outputCol=”dis-inx")
parser = Tokenizer(inputCol=”text-field", outputCol="words")
hashingTF = HashingTF(numFeatures=50, inputCol="words", outputCol="hash-inx")
vecAssembler = VectorAssembler(
inputCols =[“dis-inx”, “hash-inx”],
outputCol="features")
rf = RandomForestClassifier(numTrees=100, labelCol="label", seed=42)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
45 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Apache	Zeppelin	– A	Modern	Web-based	Data	Science	Studio
à Data	exploration	and	discovery
à Visualization
à Deeply	integrated	with	Spark	and	Hadoop
à Pluggable	interpreters
à Multiple	languages	in	one	notebook:	R,	Python,	Scala
46 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
47 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Exporting ML Models - PMML
à Predictive	Model	Markup	Language	(PMML)
à Supported	models
– K-Means	
– Linear	Regression
– Ridge	Regression	
– Lasso
– SVM
– Binary
48 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Additional Resources
• Machine	Learning
• Natural	Language	Processing	(NLP)
• Scalable	Machine	Learning
• Introduction	to	Statistics
49 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Lab Overview
tinyurl.com/hwx-intro-to-ml-with-spark
50 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Hortonworks	Community	Connection
Read access for everyone, join to participate and be recognized
• Full	Q&A	Platform	(like	StackOverflow)
• Knowledge	Base	Articles
• Code	Samples	and	Repositories
51 ©	Hortonworks	Inc.	2011	–2016.	All	Rights	Reserved
Community	Engagement
community.hortonworks.com
©	Hortonworks	Inc.	2011	–2015.	All	Rights	Reserved
7,500+
Registered	Users
15,000+
Answers
20,000+
Technical	Assets
One Website!
Robert	Hryniewicz
@RobHryniewicz
Thanks!

Data Science with Apache Spark - Crash Course - HS16SJ