KEMBAR78
Demystifying AI, Machine Learning and Deep Learning | PDF
© 2017 MapR Technologies
Applying Machine Learning to IOT:
End to End Distributed Pipeline for Real-
Time Uber Data Using Apache APIs: Kafka,
Spark, HBase
Carol McDonald
@caroljmcdonald
© 2017 MapR Technologies
Agenda
•  What is AI?
•  Why now?
•  What is Machine Learning?
–  Examples
•  What is Deep Learning?
–  Examples
© 2017 MapR Technologies
What is AI?
© 2017 MapR Technologies
AI NSA MIT Late 80s
© 2017 MapR Technologies
Problems with hard coded Rules
•  Rules are manual, uses a human expert
–  difficult to maintain
–  give a one size fits all decision! (2 times overdose same as 38 times)
•  Machine learning uses data and statistics
–  can give sorted probabilty, can precisely match/target individuals
© 2017 MapR Technologies
What is Machine Learning?
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns
f(X)
© 2017 MapR Technologies
Why all the buzz now?
What has changed?
© 2017 MapR Technologies
What has changed in the past 10 years?
Distributed computing
Streaming analytics
Improved machine learning
© 2017 MapR Technologies
Distribute Computation
Driver sends
Program tasks
Data Distributed
across Cluster
Result
© 2017 MapR Technologies
Apache Spark Distributed Datasets
Distributed Dataset
Node
Executor
P4
Node
Executor
P1 P3
Node
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
•  Data read into Memory Cache
•  Partitioned across a cluster
•  Operated on in parallel
•  Cached in memory for iterations
© 2017 MapR Technologies
Streaming Analytics
© 2017 MapR Technologies
GPUs speed up Multi core servers for parallel processing
Cluster of GPUs 1 million times faster than Cray-1
© 2017 MapR Technologies
Mythbusters explain Parallel graphics with GPU vs Sequential CPU
•  Painting a smily face with a sequential paint gun
© 2017 MapR Technologies
Mythbusters explain Parallel graphics with GPU
•  Painting a smiling face with one blast from a parallel paint gun !
© 2017 MapR Technologies
Machine Learning
© 2017 MapR Technologies
Types of Machine learning
© 2017 MapR Technologies
Supervised Machine Learning
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
Label
© 2017 MapR Technologies
Supervised Algorithms use labeled data
Data
features
Build Model
New Data
features
Predict
Use Model
X1, X2
Y
f(X1, X2) =Y
X1, X2
Y
© 2017 MapR Technologies
ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
●  Churn Modelling
Uber
trips
Stream
TopicUber
trips
New Data
© 2017 MapR Technologies
Supervised Machine Learning: Classification & Regression
Classification
Identifies
category for item
© 2017 MapR Technologies
Classification: Definition
Form of ML that:
•  Identifies which category an item belongs to
•  Uses supervised learning algorithms
–  Data is labeled
Sentiment
© 2017 MapR Technologies
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:
© 2017 MapR Technologies
Debit Card Fraud Example
•  What are we trying to predict?
–  This is the Label or Target outcome:
–  Fraud or Not Fraud
•  What are the “if questions” or properties we can use to predict?
–  These are the Features:
–  Is the amount spent today > historical average?
–  Unusual region for card history ?
–  Known merchant or not ?
© 2017 MapR Technologies
Decision Tree For Classification
•  Tree of decisions about features
•  Estimates IF THEN ELSE questions
•  Gives probability of a correct decision
Is the amount spent in 24
hours > average
Is the number of
states used from > 2
Are there multiple
Purchases today from
risky merchants?
YES NO
NoYES
Fraud
90%
Not Fraud
50%
Fraud
90%
Not Fraud
30%
YES No
© 2017 MapR Technologies
Real Time Credit Card Fraud Detection with Apache Spark Streaming
1.  Get event credit card
transaction data
2.  Read card holder profile
3.  Calculate history
features
4.  Publish Alerts for fraud
and enriched events
https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/
© 2017 MapR Technologies
Classification Identifies Category
•  Classification:
–  identifies which category a new item belongs to
•  Who will ( buy, churn, get admitted to hospital ) ?
•  What is the mood of this comment?
•  Retail Example:
–  Which promotion draws more customers ?
•  Healthcare Example:
–  Suggest Patient diagnosis
–  Identify patients with high readmission risk
© 2017 MapR Technologies
Label
Probabilty
of Fraud 1
X
Features: trans amount, type of store,
Time Location difference last trans.
Fraud
0
Not Fraud
.5
Classification Probability Logistic Regression Example
Predicts probability an item belongs to a category
© 2017 MapR Technologies
Supervised Learning: Classification Probability
•  Logistic Regression (and other algorithms) :
–  Predicts probability an item belongs to a category (eg probability of fraud)
•  What is probablity someone will ( buy, churn, get admitted to hospital ) ?
•  Probability customer will renew service
•  Healthcare:
–  Probability of readmission
© 2017 MapR Technologies
Label:
Price of house
Y
X1, X2
Features: square feet,
number bedrooms, location
Data point: sum of x,
price
Sales price = intercept + coeff * X1 + coeff2 * X2
Regression Predicts Amount, Estimates relationship between X & Y
© 2017 MapR Technologies
Regression Predicts by estimating the relationship between variables
•  Regression predicts a numeric value (eg price)
•  What will be the ( revenue, product demand , sales , # churners)
•  Retail Example:
–  Sales based on an event
•  Healthcare Example:
–  Days of hospital stay
© 2017 MapR Technologies
What is Unsupervised Machine Learning?
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
© 2017 MapR Technologies
Unsupervised Algorithms use Unlabeled data
Customer GroupsBuild ModelTrain Algorithm
Finds patterns
New Customer
Purchase Data
Use Model
Similar Customer Group
Contains patterns Recognizes patterns
Customer purchase
data
© 2017 MapR Technologies
Unsupervised Machine Learning: Clustering
Clustering
group news articles into different categories
© 2017 MapR Technologies
Unsupervised Learning
Learning structure from unlabeled examples
NBA Players
http://www.sloansportsconference.com/wp-content/uploads/2012/03/Alagappan-Muthu-EOSMarch2012PPT.pdf
© 2017 MapR Technologies
Clustering: Definition
•  Groups objects into clusters of high similarity
–  Customer segmentation
–  Text categorization
–  recommendations
•  Anomaly detection: find what’s not similar
x
x
x
x
x
© 2017 MapR Technologies
Clustering Groups objects into Clusters of high similarity
•  What are the groups of (customers, patients..) with
similar (bevahior, purchases, symptoms, illness…)
•  Healthcare:
–  Patient similarity
•  Retail:
–  Group customers by purchases.
© 2017 MapR Technologies
Bank Customer Segmentation: Bank Products, Card Purchases
© 2017 MapR Technologies
Association, Co-Occurrence, Market Basket Recommendations
•  Retail
–  Products which are purchased
together
•  Take action:
–  Store layouts
–  Which products to put on
specials, promote, coupons…
•  Healthcare
–  Patients like mine cohorts
© 2017 MapR Technologies
Deep Learning
© 2017 MapR Technologies
Deep Learning
Multilayered neural networks
© 2017 MapR Technologies
The Network is trained with images
© 2017 MapR Technologies
Neural network neuron or node
Each node takes input data and a weight and outputs a confidence score to the next
layer
© 2017 MapR Technologies
Each node outputs a confidence score to the next layer
© 2017 MapR Technologies
Errors are calculted at the output layer
© 2017 MapR Technologies
Errors are sent back through the network
© 2017 MapR Technologies
This process is repeated, adjusting weights, until correct
© 2017 MapR Technologies
This process is repeated with lots of images
© 2017 MapR Technologies
Deep Learning
During this process layers learn the optimal features for the model
© 2017 MapR Technologies
Deep Learning Features
•  Advantage:
–  Features do not have to be
predetermined
•  Disadvantage:
–  Decisions are a black box
Feature
Decisions
?
© 2017 MapR Technologies
Deep Learning in the News!
FINANCE AUTON. DRIVING HEALTHCARE VOICE RECOG.
3/27/17 - Hedge funds
have been trying to
teach computers to
think like traders for
years. (Bloomberg)
4/3/17 – Daimler… to
deploy autonomous
taxis that customers
can hail using a
smartphone app by
the start of the next
decade. (Fortune)
3/28/17 - deep learning
is being applied to
processing medical
images … eye disease
… skin cancer (MIT
tech review)
3/31/17 - IBM research
… advancing speech
recognition by applying
deep learning into
acoustic and lang.
models (InfoQ)
© 2017 MapR Technologies
Deep Neural Networks
•  Classification and
•  Forecasting
Deep
Neural
Networks
© 2017 MapR Technologies
Convolutional Neural Networks for Images
•  Insights from image & video files
Convolutional
Neural
Networks
© 2017 MapR Technologies
Ex. PATIENT MORTALITY PREDICTION
1Scientific RepoRts | 7: 1648 | DOI:10.1038/s41598-017-01931-w
www.nature.com/scientificreports
Precision Radiology: Predicting
longevity using feature engineering
and deep learning methods in a
radiomics framework
LukeOakden-Rayner1,2
,GustavoCarneiro3
,Taryn Bessen1
, JacintoC. Nascimento4
,Andrew P.
Bradley5
& Lyle J. Palmer2
Precision medicine approaches rely on obtaining precise knowledge of the true state of health of an
individual patient, which results from a combination of their genetic risks and environmental exposures.
This approach is currently limited by the lack of effective and efficient non-invasive medical tests to
define the full range of phenotypic variation associated with individual health. Such knowledge is
critical for improved early intervention, for better treatment decisions, and for ameliorating the steadily
worsening epidemic of chronic disease.We present proof-of-concept experiments to demonstrate how
routinely acquired cross-sectionalCT imaging may be used to predict patient longevity as a proxy for
overall individual health and disease status using computer image analysis techniques. Despite the
limitations of a modest dataset and the use of off-the-shelf machine learning methods, our results are
comparable to previous ‘manual’ clinical methods for longevity prediction.This work demonstrates
that radiomics techniques can be used to extract biomarkers relevant to one of the most widely used
outcomes in epidemiological and clinical research – mortality, and that deep learning with convolutional
neural networks can be usefully applied to radiomics research.Computer image analysis applied
to routinely collected medical images offers substantial potential to enhance precision medicine
initiatives.
Measuring phenotypic variation in precision medicine
Precision medicine has become a key focus of modern bioscience and medicine, and involves “prevention and
treatment strategies that take individual variability into account”, through the use of “large-scale biologic data-
bases … powerful methods for characterizing patients … and computational tools for analysing large sets of
data”1
. The variation within individuals that enables the identification of patient subgroups for precision medicine
strategies is termed the “phenotype”. The observable phenotype reflects both genomic variation and the accumu-
lated lifestyle and environmental exposures that impact biological function - the exposome2
.
Precision medicine relies upon the availability of useful biomarkers, defined as “a characteristic that is objec-
tively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or phar-
macological responses to a therapeutic intervention”3
. A ‘good’ biomarker has the following characteristics: it is
sensitive, specific, predictive, robust, bridges clinical and preclinical health states, and is non-invasive4
.
Genomics can produce good biomarkers useful for precision medicine5
. There has been significant success in
exploring human genetic variation in the field of genomics, where data-driven methods have highlighted the role
of human genetic variation in disease diagnosis, prognosis, and treatment response6
. However, for the chronic
and age-related diseases which account for the majority of morbidity and mortality in developed nations7
and
worldwide8
, the majority (70–90%) of observable phenotypic variation is related to non-genetic determinants9
.
1
Department of a io o o a e ai e ospita ort errace e ai e 5000 ustra ia. c oo of u ic
ea t e ni ersit of e ai e ort errace e ai e 5000 ustra ia. 3
c oo of omputer cience e
ni ersit of e ai e ort errace e ai e 5000 ustra ia. 4
Instituto uperior cnico is on ortu a .
5
c oo of Information ec no o an ectrica n ineerin e ni ersit of ueens an ui in 78 t ucia
D 40 7 ueens an ustra ia. orrespon ence an re uests for materia s s ou e a resse to .O. emai :
u eoa enra ner mai .com)
Received: 8 December 2016
Accepted: 6 April 2017
Published: xx xx xxxx
OPEN
Oakden-Rayner, et al.,
Scientific Reports, May 2017
com/scientificreports/
Figure 4. Images at the level of the proximal left anterior descending coronary artery, with the most strongly
predicted mortality and survival cases selected by averaging the predictions from the deep learning and
engineered feature models. The mortality cases (left side) demonstrate prominent visual changes of emphysema,
cardiomegaly, vascular disease and osteopaenia. The survival cases (right side) appear visually less diseased and
frail.
Mortality Survival
© 2017 MapR Technologies
Example: Exploiting Unstructured Data
http://www.economist.com/news/science-and-technology/21664943-computers-can-
recognise-complication-diabetes-can-lead-blindness-now - Sep 19, 2015
Diabetic Retinopathy:
•  Challenging to diagnose from
image (84% consensus)
•  Crowd-sourced to Kaggle
•  Deep-learning and convolutional
NN used to classify image data
•  Winning model showed 85%
accuracy rate
© 2017 MapR Technologies
Recurrent Neural Networks for Sequenced data
•  Sequence of events and language
applications
Recurrent
Neural
Networks
© 2017 MapR Technologies
To Learn More:
•  MapR Quick Start solutions
https://mapr.com/solutions/big-data-and-hadoop-quick-start-solutions/
•  Customer 360, Recommendation Engine, Log Analysis, Risk, Deep Learning
© 2017 MapR Technologies
MapR Deep Learning QSS
New Image
to Classify
Category
Probabilities
Training
Images…
Category
1
Category
N
…
MapR-FS
MapR Data Platform
Kubernetes
Enterprise Storage Database Event Streaming
MapR-FS MapR-DB MapR Streams
Global Namespace High Availability Data Protection Multi-tenancy Unified Security
D
MapR Converged Data Platform
POD 1
DD MASTER
NODE
POD 2 POD 3
Parameter
Server 1
TF Trainer
1
TF Trainer
2
© 2017 MapR Technologies
Fit your business model
Common Use Cases
•  Churn prediction
•  Customer clustering
•  Product recommendation
•  Budget optimization
•  ETA
•  Sales prediction
•  Pricing model
•  …
Cost function -- real business impact
•  Leverage A/B testing
© 2017 MapR Technologies
90+%	of	effort	is	logistics,	
not	learning
© 2017 MapR Technologies
Big Data – Machine Learning Cycle 
Big
Data
Identify a problem
Prepare Data Model Data Get Insight
Test a Solution
EvaluateMonitor Deploy
Machine LearningReference: head of Machine learning at Uber
© 2017 MapR Technologies
End to End Streaming Analytics Example Application
https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1/
© 2017 MapR Technologies
MapR Blog
• https://www.mapr.com/blog/
© 2017 MapR Technologies
© 2017 MapR Technologies
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com
© 2017 MapR Technologies
We reinvented the data platform
for next-gen intelligent applications & Data Science
On-Premise, In the Cloud, Hybrid
NoSQL Webscale
Storage
MessagingMultiple
Processing
Engines
Real Time Unified Security Multi-tenancy Disaster Recovery
Streaming
Multiple compute engines and tools operating concurrently
Immediate access to vast amounts of diverse data
Low latency for millisecond responsiveness
Support diverse workloads simultaneously
Able to be a reliable system of record
Enterprise grade reliability
© 2017 MapR Technologies
Q&A
ENGAGE WITH US

Demystifying AI, Machine Learning and Deep Learning

  • 1.
    © 2017 MapRTechnologies Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- Time Uber Data Using Apache APIs: Kafka, Spark, HBase Carol McDonald @caroljmcdonald
  • 2.
    © 2017 MapRTechnologies Agenda •  What is AI? •  Why now? •  What is Machine Learning? –  Examples •  What is Deep Learning? –  Examples
  • 3.
    © 2017 MapRTechnologies What is AI?
  • 4.
    © 2017 MapRTechnologies AI NSA MIT Late 80s
  • 5.
    © 2017 MapRTechnologies Problems with hard coded Rules •  Rules are manual, uses a human expert –  difficult to maintain –  give a one size fits all decision! (2 times overdose same as 38 times) •  Machine learning uses data and statistics –  can give sorted probabilty, can precisely match/target individuals
  • 6.
    © 2017 MapRTechnologies What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns f(X)
  • 7.
    © 2017 MapRTechnologies Why all the buzz now? What has changed?
  • 8.
    © 2017 MapRTechnologies What has changed in the past 10 years? Distributed computing Streaming analytics Improved machine learning
  • 9.
    © 2017 MapRTechnologies Distribute Computation Driver sends Program tasks Data Distributed across Cluster Result
  • 10.
    © 2017 MapRTechnologies Apache Spark Distributed Datasets Distributed Dataset Node Executor P4 Node Executor P1 P3 Node Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. •  Data read into Memory Cache •  Partitioned across a cluster •  Operated on in parallel •  Cached in memory for iterations
  • 11.
    © 2017 MapRTechnologies Streaming Analytics
  • 12.
    © 2017 MapRTechnologies GPUs speed up Multi core servers for parallel processing Cluster of GPUs 1 million times faster than Cray-1
  • 13.
    © 2017 MapRTechnologies Mythbusters explain Parallel graphics with GPU vs Sequential CPU •  Painting a smily face with a sequential paint gun
  • 14.
    © 2017 MapRTechnologies Mythbusters explain Parallel graphics with GPU •  Painting a smiling face with one blast from a parallel paint gun !
  • 15.
    © 2017 MapRTechnologies Machine Learning
  • 16.
    © 2017 MapRTechnologies Types of Machine learning
  • 17.
    © 2017 MapRTechnologies Supervised Machine Learning Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD Label
  • 18.
    © 2017 MapRTechnologies Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model X1, X2 Y f(X1, X2) =Y X1, X2 Y
  • 19.
    © 2017 MapRTechnologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction ●  Churn Modelling Uber trips Stream TopicUber trips New Data
  • 20.
    © 2017 MapRTechnologies Supervised Machine Learning: Classification & Regression Classification Identifies category for item
  • 21.
    © 2017 MapRTechnologies Classification: Definition Form of ML that: •  Identifies which category an item belongs to •  Uses supervised learning algorithms –  Data is labeled Sentiment
  • 22.
    © 2017 MapRTechnologies If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  • 23.
    © 2017 MapRTechnologies Debit Card Fraud Example •  What are we trying to predict? –  This is the Label or Target outcome: –  Fraud or Not Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  Is the amount spent today > historical average? –  Unusual region for card history ? –  Known merchant or not ?
  • 24.
    © 2017 MapRTechnologies Decision Tree For Classification •  Tree of decisions about features •  Estimates IF THEN ELSE questions •  Gives probability of a correct decision Is the amount spent in 24 hours > average Is the number of states used from > 2 Are there multiple Purchases today from risky merchants? YES NO NoYES Fraud 90% Not Fraud 50% Fraud 90% Not Fraud 30% YES No
  • 25.
    © 2017 MapRTechnologies Real Time Credit Card Fraud Detection with Apache Spark Streaming 1.  Get event credit card transaction data 2.  Read card holder profile 3.  Calculate history features 4.  Publish Alerts for fraud and enriched events https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/
  • 26.
    © 2017 MapRTechnologies Classification Identifies Category •  Classification: –  identifies which category a new item belongs to •  Who will ( buy, churn, get admitted to hospital ) ? •  What is the mood of this comment? •  Retail Example: –  Which promotion draws more customers ? •  Healthcare Example: –  Suggest Patient diagnosis –  Identify patients with high readmission risk
  • 27.
    © 2017 MapRTechnologies Label Probabilty of Fraud 1 X Features: trans amount, type of store, Time Location difference last trans. Fraud 0 Not Fraud .5 Classification Probability Logistic Regression Example Predicts probability an item belongs to a category
  • 28.
    © 2017 MapRTechnologies Supervised Learning: Classification Probability •  Logistic Regression (and other algorithms) : –  Predicts probability an item belongs to a category (eg probability of fraud) •  What is probablity someone will ( buy, churn, get admitted to hospital ) ? •  Probability customer will renew service •  Healthcare: –  Probability of readmission
  • 29.
    © 2017 MapRTechnologies Label: Price of house Y X1, X2 Features: square feet, number bedrooms, location Data point: sum of x, price Sales price = intercept + coeff * X1 + coeff2 * X2 Regression Predicts Amount, Estimates relationship between X & Y
  • 30.
    © 2017 MapRTechnologies Regression Predicts by estimating the relationship between variables •  Regression predicts a numeric value (eg price) •  What will be the ( revenue, product demand , sales , # churners) •  Retail Example: –  Sales based on an event •  Healthcare Example: –  Days of hospital stay
  • 31.
    © 2017 MapRTechnologies What is Unsupervised Machine Learning? Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic
  • 32.
    © 2017 MapRTechnologies Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model Similar Customer Group Contains patterns Recognizes patterns Customer purchase data
  • 33.
    © 2017 MapRTechnologies Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
  • 34.
    © 2017 MapRTechnologies Unsupervised Learning Learning structure from unlabeled examples NBA Players http://www.sloansportsconference.com/wp-content/uploads/2012/03/Alagappan-Muthu-EOSMarch2012PPT.pdf
  • 35.
    © 2017 MapRTechnologies Clustering: Definition •  Groups objects into clusters of high similarity –  Customer segmentation –  Text categorization –  recommendations •  Anomaly detection: find what’s not similar x x x x x
  • 36.
    © 2017 MapRTechnologies Clustering Groups objects into Clusters of high similarity •  What are the groups of (customers, patients..) with similar (bevahior, purchases, symptoms, illness…) •  Healthcare: –  Patient similarity •  Retail: –  Group customers by purchases.
  • 37.
    © 2017 MapRTechnologies Bank Customer Segmentation: Bank Products, Card Purchases
  • 38.
    © 2017 MapRTechnologies Association, Co-Occurrence, Market Basket Recommendations •  Retail –  Products which are purchased together •  Take action: –  Store layouts –  Which products to put on specials, promote, coupons… •  Healthcare –  Patients like mine cohorts
  • 39.
    © 2017 MapRTechnologies Deep Learning
  • 40.
    © 2017 MapRTechnologies Deep Learning Multilayered neural networks
  • 41.
    © 2017 MapRTechnologies The Network is trained with images
  • 42.
    © 2017 MapRTechnologies Neural network neuron or node Each node takes input data and a weight and outputs a confidence score to the next layer
  • 43.
    © 2017 MapRTechnologies Each node outputs a confidence score to the next layer
  • 44.
    © 2017 MapRTechnologies Errors are calculted at the output layer
  • 45.
    © 2017 MapRTechnologies Errors are sent back through the network
  • 46.
    © 2017 MapRTechnologies This process is repeated, adjusting weights, until correct
  • 47.
    © 2017 MapRTechnologies This process is repeated with lots of images
  • 48.
    © 2017 MapRTechnologies Deep Learning During this process layers learn the optimal features for the model
  • 49.
    © 2017 MapRTechnologies Deep Learning Features •  Advantage: –  Features do not have to be predetermined •  Disadvantage: –  Decisions are a black box Feature Decisions ?
  • 50.
    © 2017 MapRTechnologies Deep Learning in the News! FINANCE AUTON. DRIVING HEALTHCARE VOICE RECOG. 3/27/17 - Hedge funds have been trying to teach computers to think like traders for years. (Bloomberg) 4/3/17 – Daimler… to deploy autonomous taxis that customers can hail using a smartphone app by the start of the next decade. (Fortune) 3/28/17 - deep learning is being applied to processing medical images … eye disease … skin cancer (MIT tech review) 3/31/17 - IBM research … advancing speech recognition by applying deep learning into acoustic and lang. models (InfoQ)
  • 51.
    © 2017 MapRTechnologies Deep Neural Networks •  Classification and •  Forecasting Deep Neural Networks
  • 52.
    © 2017 MapRTechnologies Convolutional Neural Networks for Images •  Insights from image & video files Convolutional Neural Networks
  • 53.
    © 2017 MapRTechnologies Ex. PATIENT MORTALITY PREDICTION 1Scientific RepoRts | 7: 1648 | DOI:10.1038/s41598-017-01931-w www.nature.com/scientificreports Precision Radiology: Predicting longevity using feature engineering and deep learning methods in a radiomics framework LukeOakden-Rayner1,2 ,GustavoCarneiro3 ,Taryn Bessen1 , JacintoC. Nascimento4 ,Andrew P. Bradley5 & Lyle J. Palmer2 Precision medicine approaches rely on obtaining precise knowledge of the true state of health of an individual patient, which results from a combination of their genetic risks and environmental exposures. This approach is currently limited by the lack of effective and efficient non-invasive medical tests to define the full range of phenotypic variation associated with individual health. Such knowledge is critical for improved early intervention, for better treatment decisions, and for ameliorating the steadily worsening epidemic of chronic disease.We present proof-of-concept experiments to demonstrate how routinely acquired cross-sectionalCT imaging may be used to predict patient longevity as a proxy for overall individual health and disease status using computer image analysis techniques. Despite the limitations of a modest dataset and the use of off-the-shelf machine learning methods, our results are comparable to previous ‘manual’ clinical methods for longevity prediction.This work demonstrates that radiomics techniques can be used to extract biomarkers relevant to one of the most widely used outcomes in epidemiological and clinical research – mortality, and that deep learning with convolutional neural networks can be usefully applied to radiomics research.Computer image analysis applied to routinely collected medical images offers substantial potential to enhance precision medicine initiatives. Measuring phenotypic variation in precision medicine Precision medicine has become a key focus of modern bioscience and medicine, and involves “prevention and treatment strategies that take individual variability into account”, through the use of “large-scale biologic data- bases … powerful methods for characterizing patients … and computational tools for analysing large sets of data”1 . The variation within individuals that enables the identification of patient subgroups for precision medicine strategies is termed the “phenotype”. The observable phenotype reflects both genomic variation and the accumu- lated lifestyle and environmental exposures that impact biological function - the exposome2 . Precision medicine relies upon the availability of useful biomarkers, defined as “a characteristic that is objec- tively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or phar- macological responses to a therapeutic intervention”3 . A ‘good’ biomarker has the following characteristics: it is sensitive, specific, predictive, robust, bridges clinical and preclinical health states, and is non-invasive4 . Genomics can produce good biomarkers useful for precision medicine5 . There has been significant success in exploring human genetic variation in the field of genomics, where data-driven methods have highlighted the role of human genetic variation in disease diagnosis, prognosis, and treatment response6 . However, for the chronic and age-related diseases which account for the majority of morbidity and mortality in developed nations7 and worldwide8 , the majority (70–90%) of observable phenotypic variation is related to non-genetic determinants9 . 1 Department of a io o o a e ai e ospita ort errace e ai e 5000 ustra ia. c oo of u ic ea t e ni ersit of e ai e ort errace e ai e 5000 ustra ia. 3 c oo of omputer cience e ni ersit of e ai e ort errace e ai e 5000 ustra ia. 4 Instituto uperior cnico is on ortu a . 5 c oo of Information ec no o an ectrica n ineerin e ni ersit of ueens an ui in 78 t ucia D 40 7 ueens an ustra ia. orrespon ence an re uests for materia s s ou e a resse to .O. emai : u eoa enra ner mai .com) Received: 8 December 2016 Accepted: 6 April 2017 Published: xx xx xxxx OPEN Oakden-Rayner, et al., Scientific Reports, May 2017 com/scientificreports/ Figure 4. Images at the level of the proximal left anterior descending coronary artery, with the most strongly predicted mortality and survival cases selected by averaging the predictions from the deep learning and engineered feature models. The mortality cases (left side) demonstrate prominent visual changes of emphysema, cardiomegaly, vascular disease and osteopaenia. The survival cases (right side) appear visually less diseased and frail. Mortality Survival
  • 54.
    © 2017 MapRTechnologies Example: Exploiting Unstructured Data http://www.economist.com/news/science-and-technology/21664943-computers-can- recognise-complication-diabetes-can-lead-blindness-now - Sep 19, 2015 Diabetic Retinopathy: •  Challenging to diagnose from image (84% consensus) •  Crowd-sourced to Kaggle •  Deep-learning and convolutional NN used to classify image data •  Winning model showed 85% accuracy rate
  • 55.
    © 2017 MapRTechnologies Recurrent Neural Networks for Sequenced data •  Sequence of events and language applications Recurrent Neural Networks
  • 56.
    © 2017 MapRTechnologies To Learn More: •  MapR Quick Start solutions https://mapr.com/solutions/big-data-and-hadoop-quick-start-solutions/ •  Customer 360, Recommendation Engine, Log Analysis, Risk, Deep Learning
  • 57.
    © 2017 MapRTechnologies MapR Deep Learning QSS New Image to Classify Category Probabilities Training Images… Category 1 Category N … MapR-FS MapR Data Platform Kubernetes Enterprise Storage Database Event Streaming MapR-FS MapR-DB MapR Streams Global Namespace High Availability Data Protection Multi-tenancy Unified Security D MapR Converged Data Platform POD 1 DD MASTER NODE POD 2 POD 3 Parameter Server 1 TF Trainer 1 TF Trainer 2
  • 58.
    © 2017 MapRTechnologies Fit your business model Common Use Cases •  Churn prediction •  Customer clustering •  Product recommendation •  Budget optimization •  ETA •  Sales prediction •  Pricing model •  … Cost function -- real business impact •  Leverage A/B testing
  • 59.
    © 2017 MapRTechnologies 90+% of effort is logistics, not learning
  • 60.
    © 2017 MapRTechnologies Big Data – Machine Learning Cycle Big Data Identify a problem Prepare Data Model Data Get Insight Test a Solution EvaluateMonitor Deploy Machine LearningReference: head of Machine learning at Uber
  • 61.
    © 2017 MapRTechnologies End to End Streaming Analytics Example Application https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1/
  • 62.
    © 2017 MapRTechnologies MapR Blog • https://www.mapr.com/blog/
  • 63.
    © 2017 MapRTechnologies
  • 64.
    © 2017 MapRTechnologies …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com
  • 65.
    © 2017 MapRTechnologies We reinvented the data platform for next-gen intelligent applications & Data Science On-Premise, In the Cloud, Hybrid NoSQL Webscale Storage MessagingMultiple Processing Engines Real Time Unified Security Multi-tenancy Disaster Recovery Streaming Multiple compute engines and tools operating concurrently Immediate access to vast amounts of diverse data Low latency for millisecond responsiveness Support diverse workloads simultaneously Able to be a reliable system of record Enterprise grade reliability
  • 66.
    © 2017 MapRTechnologies Q&A ENGAGE WITH US