Complete Data Science
and Machine Learning
Using Python
By
Jitesh Khurkhuriya
© Jitesh Khurkhuriya
Data Growth – IDC-Seagate November, 2018
© Jitesh Khurkhuriya
Data Growth – IDC-Seagate November, 2018
175 ZB
1 ZB = 1,000,000,000 TB
Majority of the data is Unstructured
2025
© Jitesh Khurkhuriya
It’s not just about…..
© Jitesh Khurkhuriya
Application of Data Science and Machine Learning
Automobile Banking Healthcare Media Telecom
Preventive Fraud/Default Content
Maintenance Predict Disease Increase sale
Prevention Personalisation
© Jitesh Khurkhuriya
Heard On The Streets
• IDC Futurescape - Two-thirds of Global 2000 Enterprises CEOs will
centre their corporate strategy on digital transformation including
machine learning (ML) solutions.
• Harvard Business Review – Data Scientist: The Sexiest Job of the
21st Century
• McKinsey Report – 45 percent of work activities could potentially
be automated by currently demonstrated technologies; machine
learning can be an enabling technology for the automation of
80 percent of those activities.
• Microsoft CEO Satya Nadella – called out machine learning -- and
the big data that powers it -- as a key development in his memo to
Microsoft last July.
© Jitesh Khurkhuriya
Benefits of Data Science and Machine Learning
✓ Faster decisions
✓ Develop insights that are beyond human
capabilities
✓ Act at the right time and take advantage of
opportunities, converting them into closed
deals.
© Jitesh Khurkhuriya
Types of Analytics
• What’s the Best method to
Past/Present retain the customer?
• Will this customer go?
How can we
• Poor customer service
• Cheaper Alternative make/prevent it?
Foresight What will
• Sales are up/down happen? Prescriptive
• Customer Left/Leaving Why did it Analytics
happen? Predictive Analytics
Insight What
Happened? Diagnostic Analytics
Descriptive
Hindsight Analytics
Difficulty Level
Courtesy – Gartner Report and analysis
© Jitesh Khurkhuriya
What is Data Science?
Mathematics, Programming
Statistics Data Preparation
Machine Learning
Data
Science
Domain Knowledge
Subject Matter Expertise
© Jitesh Khurkhuriya
Present
Result Deploy
Model
Planning
Data Model
Processing Building
and
Business Selection
Case and
Discovery
© Jitesh Khurkhuriya
Business Case and Discovery
What’s the End Goal?
Stakeholders Discussions
How much time and budget we have
Past attempts
What kind of data is available
© Jitesh Khurkhuriya
Data Processing
Data Mapping Data Cleaning Data Transformation Sample the Data
• Data Quality • Format conversion
• Data Sampling
• Missing Data • Data Normalization
• Data Split
• Noisy Data • Statistical imputation
• Data Binning
• Outlier Treatment • Feature Engineering
© Jitesh Khurkhuriya
Exploratory Data Analysis
© Jitesh Khurkhuriya
CLUSTERING
ANOMALY DETECTION
K-MEANS MULTI-CLASS CLASSIFICATION
One Class SVM > 100 Features
Fast Training, Linear Model Multi-Class Logistic Regression
PCA Based Anomaly Detection Fast Training Accuracy, Long Training Times Multi-Class Neural Network
Accuracy, Fast Training Multi-Class Decision Forest
REGRESSION Accuracy, Small Memory Footprint Multi-Class Decision Jungle
Data in Rank Order
Start Depends on Two-Class One-V-All Multiclass
Ordinal Regression
categories
Poisson Regression Predicting Event Counts
Predicting a
Fast Forest Quantile Regression Distribution TWO-CLASS CLASSIFICATION
Fast Training, Linear Two-Class Decision
Linear Regression Accuracy, Fast
Model >100 Features, Forest
Two Class SVM Training
Linear Model
Linear Model, Small Accuracy, Fast Two-Class Boosted
Bayesian Linear Regression
datasets Two-Class Averaged Fast Training, Training, LargeM Decision Tree
Perceptron Linear Model
Accuracy, Long Training Accuracy, SmallM Two Class Decision
Neural Network Regression Fast Training,
Time Two Class Logistic Jungle
Regression Linear model
Decision Forest Regression Accuracy, Fast Training >100 Features Two Class Locally Deep
Two Class Bayes Fast Training, SVM
Point Machine Linear Model
Accuracy, Fast Training, Accuracy, Long Two Class Neural
Boosted Decision Tree Regression Training Times
large Memory Network
© Jitesh Khurkhuriya
What to consider while choosing an algorithm?
Predicting Categories
Predicting Continuous Value
Finding Unusual Data Points
Discovering Structure
© Jitesh Khurkhuriya
Model Building and Selection
Train Model
Cross Validation
Parameter Tuning
Select Model
© Jitesh Khurkhuriya
Present the results
• Explain the process of model planning and selection
• Explain the findings; correlations, causes, variable
selections
• Communicate the results
• Explain the process of operationalization
© Jitesh Khurkhuriya
Deployment
Back Office Systems Data Science Space Social Media
Transactional Data
Activities
Operations Data
ML Engine Web and Mobile Logs
ERP/CRM Data
Website and Online Apps
DWH
Mobile Apps
MIS/Reporting Marketing DWH
Enterprise BI and Reporting Marketing Campaigns
Processed Algorithms
Data
External Systems Customer Service
3rd Party Operations
Customer CRM
Regulatory
Decisions
© Jitesh Khurkhuriya
Skills Required to be a Data Scientist
• Soft Skills
• Domain knowledge
Business
• Communication Case and
• Analytical skills Discovery
• Technical Skills
• Curiosity Deploy
Data
Processing • Mathematics
• Common Sense
Data • Statistics
Science • File handling or database
Present Model • Machine Learning
Result Planning
• Python or similar
Model • Tableu or similar visualization
Building
© Jitesh Khurkhuriya
Soft Skills
Understanding of the Discovery phase as well Analyse various Asking the right
Is it making sense on
data elements based as presenting findings relationships among questions to gain
normal beliefs?
on domain expertise to the stakeholders data features. deeper understanding.
Domain knowledge Communication Analytical Skills Curiosity Common Sense
© Jitesh Khurkhuriya
Technical Skills
Math as the basis for Helps in dealing with the
Build models using either
algorithms. Helps for own imperfections of data as
Python, R, SAS, Azure ML
implementations. well as data transformation
Mathematics Statistics Data Wrangling Programming Data
Machine Learning
Languages Visualisation
Helps in data imputation as Heart of Data Science. Visual understanding of
well as validate the results Various algorithms for data as well as
of an experiment predictions of the outcome. communication of findings.
© Jitesh Khurkhuriya
Complete Data Science and Machine Learning Using Python
Thank You!
© Jitesh Khurkhuriya