KEMBAR78
Python, Data science, and Unsupervised learning | PDF
Disclaimer
Presentations are intended for educational
purposes only and do not replace
independent professional judgment.
Statements of fact and opinions expressed
are those of the participants individually
and don’t necessarily reflect those of
blibli.com.
Blibli.com does not endorse or approve,
and assumes no responsibility for, the
content, accuracy or completeness of the
information presented.
Python, data science, and
unsupervised learning
Hendri Karisma
hendri.karisma@gdn-commerce.com / situkangsayur@gmail.com
Hendri Karisma
• Sr. Research and Development
Engineer at blibli.com (PT. Global
Digital Niaga)
• Rnd Team for Machine Learning
• Working for Fraud Detection System.
Current working in dynamic
recommendation system project.
Definition of Informatics
“Automation of Information” –
Prof. Dr. Ing. Iping Supriana
Solution Approachment
• Analytical (Exact)
Example :
– analytics solution :
– Numerical solution
– Error = | 7.25 – 22/3| = |7.25-7.33|=0.08333
• Numerical (Aprox)
– Is numerical methods just about ML method that we know in the
book?
– Newton raphson, Gauss Elimination, Gauss-Jordan, Jacobi method,
Gauss-Seidel, Lagrange, Newton Gregory, Richardson Interpolation,
etc.
Machine Learning Definition
“A computer program is said to learn
from experience E with respect to
some class of tasks T and performance
measure P, if its performance at tasks
in T, as measured by P, improves with
experience E.” – Prof. Tom Mitchel
How it works
Machine Learning Perspective
● Information Theory (Decission Tree :
ID-Tree, C4.5, etc)
● Probability (Bayessian : Naive
Bayes, Belief Network, etc)
● Graphical Model (Belief network, HMM,
CRF, Neural Network, etc)
● Numerical Method or Regression
(Stochastic Gradient Descent/Ascent:
Linear Regression, Multiple Linear
Regression, Neural Network, E-M
Algorithm, HMM)
Machine Learning
• Supervised
• Unsupervised
• Reinforcement Learning
• Semi-Supervised
• Deep Learning
The four layer of data mining
Tools/libs in python
● Numpy
● Scipy
● Pandas
● Scikit-learn
● Matplotlib
● seaborn
● Tensorflow
*pydata.org
*anaconda
● Other Tech (to
support ML) :
– Apache Kafka
– Apache Spark
– Db : mongo, postgre
– elasticsearch
– CUDA/OpenCL
Numpy, scipy, padas, and sk-learn
● Numpy & scipy: Arrays, Indexing, Slicing,
and Iterating, Reshaping, Shallow vs deep
copy, Broadcasting, Indexing (advanced),
Matrices, Matrix decompositions, Scipy on
top numpy
● Pandas : Reading data, Selecting columns
and rows, Filtering, Vectorized string
operations, Missing values, Handling time,
Time series, On top numpy.
● SK-Learn : Feature extraction, Classification,
Regression, Clustering, Dimension reduction,
Model selection
What we do in blibli using python
● Data flow
● Data pooling
● Data preprocessing
● Machine Learning Service/app
Our system that using python for ML
● Personalize recommendation system
● Data engineering (especially the
data flow for ML engine)
● Machine learning engine
● Fraud detection experiments
EM Algorithms
Repeat until convergence{
}
What??
EM Algorithms
There are 3 keys that (as far as I know) almost
always used in EM-Algorithm :
● Data Distribution
● Maximum Likelihood Estimation (MLE)
● Estimation-Maximization (EM)
*Today we will use the Gaussian distribution for
sample case
EM Algorithms
The algorithm has 2 main steps just like the name
of the algorithm:
– Expectation :
– Maximization:
*repeat until get maximum likelihood :
Gaussian Distribution
Gaussian Distribution
Gaussian Multivariate
● Gaussian Distribution :
● Gaussian Distribution Multivariate :
Mixture Gaussian
EM-Algorithm for Mixture Gaussian
● Expectation :
● Maximization :
*Log likelihood :
Fraud – without target class/labels
● These are anomalous data
● Anomaly data usually have one or
some small group of data
● A lot of features without labels
------------------------------------------
● We need unsupervised algorithm
(EM-Algorithm)
Case Anomaly Detection
● Credit Card data with fraudulant data.
Case Anomaly Detection
Case Anomaly Detection
Case Anomaly Detection
Case Anomaly Detection
Problem Performance
Distributed System/Scale Out
Python script
Presistence
Computation
Supervisor/Service
Using python
THANK YOU
Any question?
*we are hiring*

Python, Data science, and Unsupervised learning

  • 1.
    Disclaimer Presentations are intendedfor educational purposes only and do not replace independent professional judgment. Statements of fact and opinions expressed are those of the participants individually and don’t necessarily reflect those of blibli.com. Blibli.com does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented.
  • 2.
    Python, data science,and unsupervised learning Hendri Karisma hendri.karisma@gdn-commerce.com / situkangsayur@gmail.com
  • 3.
    Hendri Karisma • Sr.Research and Development Engineer at blibli.com (PT. Global Digital Niaga) • Rnd Team for Machine Learning • Working for Fraud Detection System. Current working in dynamic recommendation system project.
  • 4.
    Definition of Informatics “Automationof Information” – Prof. Dr. Ing. Iping Supriana
  • 5.
    Solution Approachment • Analytical(Exact) Example : – analytics solution : – Numerical solution – Error = | 7.25 – 22/3| = |7.25-7.33|=0.08333 • Numerical (Aprox) – Is numerical methods just about ML method that we know in the book? – Newton raphson, Gauss Elimination, Gauss-Jordan, Jacobi method, Gauss-Seidel, Lagrange, Newton Gregory, Richardson Interpolation, etc.
  • 6.
    Machine Learning Definition “Acomputer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” – Prof. Tom Mitchel
  • 7.
  • 8.
    Machine Learning Perspective ●Information Theory (Decission Tree : ID-Tree, C4.5, etc) ● Probability (Bayessian : Naive Bayes, Belief Network, etc) ● Graphical Model (Belief network, HMM, CRF, Neural Network, etc) ● Numerical Method or Regression (Stochastic Gradient Descent/Ascent: Linear Regression, Multiple Linear Regression, Neural Network, E-M Algorithm, HMM)
  • 9.
    Machine Learning • Supervised •Unsupervised • Reinforcement Learning • Semi-Supervised • Deep Learning
  • 10.
    The four layerof data mining
  • 11.
    Tools/libs in python ●Numpy ● Scipy ● Pandas ● Scikit-learn ● Matplotlib ● seaborn ● Tensorflow *pydata.org *anaconda ● Other Tech (to support ML) : – Apache Kafka – Apache Spark – Db : mongo, postgre – elasticsearch – CUDA/OpenCL
  • 12.
    Numpy, scipy, padas,and sk-learn ● Numpy & scipy: Arrays, Indexing, Slicing, and Iterating, Reshaping, Shallow vs deep copy, Broadcasting, Indexing (advanced), Matrices, Matrix decompositions, Scipy on top numpy ● Pandas : Reading data, Selecting columns and rows, Filtering, Vectorized string operations, Missing values, Handling time, Time series, On top numpy. ● SK-Learn : Feature extraction, Classification, Regression, Clustering, Dimension reduction, Model selection
  • 13.
    What we doin blibli using python ● Data flow ● Data pooling ● Data preprocessing ● Machine Learning Service/app
  • 14.
    Our system thatusing python for ML ● Personalize recommendation system ● Data engineering (especially the data flow for ML engine) ● Machine learning engine ● Fraud detection experiments
  • 15.
  • 16.
  • 17.
    EM Algorithms There are3 keys that (as far as I know) almost always used in EM-Algorithm : ● Data Distribution ● Maximum Likelihood Estimation (MLE) ● Estimation-Maximization (EM) *Today we will use the Gaussian distribution for sample case
  • 18.
    EM Algorithms The algorithmhas 2 main steps just like the name of the algorithm: – Expectation : – Maximization: *repeat until get maximum likelihood :
  • 19.
  • 20.
  • 21.
    Gaussian Multivariate ● GaussianDistribution : ● Gaussian Distribution Multivariate :
  • 22.
  • 23.
    EM-Algorithm for MixtureGaussian ● Expectation : ● Maximization : *Log likelihood :
  • 24.
    Fraud – withouttarget class/labels ● These are anomalous data ● Anomaly data usually have one or some small group of data ● A lot of features without labels ------------------------------------------ ● We need unsupervised algorithm (EM-Algorithm)
  • 25.
    Case Anomaly Detection ●Credit Card data with fraudulant data.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Distributed System/Scale Out Pythonscript Presistence Computation Supervisor/Service Using python
  • 32.