KEMBAR78
Intro to Machine Learning for non-Data Scientists | PPTX
Dr. Parinaz Ameri
Intro to Machine Learning
for non-Data Scientists
Agenda
● 1.5 hours: Introduction to ML algorithms
● 1.5 hours: Implementing algorithms for different use-cases
● 1 hour: Working on a recommendation mini-project
Machine Learning in Daily Life
Source:
[xkcd_1838]
Machine Learning Definition
Arthur Samuel (1959):
“Field of study that gives computers the ability to learn without being explicitly
programmed.” [ML_Awad]
Source: [fortune]
Email Spam Filter
A Machine Learning Model
Machine Learning Definition
Tom Mitchell (1998):
“A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.” [ML_Mitchell]
E, T and P in a Spam Filter Example
● Task T:
○ Classify emails as Spam or Ham.
● Experience E:
○ Monitor you labeling emails as Spam or Not spam.
● Performance measure P:
○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
Machine Learning Definition
Peter Flach (2012):
“Machine learning is the systematic study of algorithms and systems that improve their
knowledge or performance with experience.” [ML_Flach]
Source:
[towardsdatascience]
Machine Learning Main Ingredients
1. Tasks:
○ An abstract representation of a problem we want to solve regarding the domain objects
2. Models:
○ Representation of many tasks as a model from data points to outputs.
○ Produces as the output of a machine learning algorithm applied to training data.
3. Features:
○ A language definition in which we describe the relevant objects in our domain.
Source: [ML_Flach]
Machine Learning Main Ingredients
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
Source: [Medium]
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Tasks & Learning Algorithms
● Supervised Learning
○ Regression
○ Classification
● Unsupervised Learning
○ Clustering
● Reinforcement Learning
● Recommendation systems
Supervised Learning Algorithms
Data is Labeled = Right Answers are Given
Housing Price Prediction
750
Regression : Predict
a continuous valued
output
Breast Cancer (Malignant, Benign)
Classification :
Predict discrete
valued output (0,1)
Features in Classification
Other Features:
- Clump thickness
- Uniformity of cell
size
- Uniformity of cell
shape
- ...
Exercise 1
Should you treat the following problems with regression or classification?
Problem 1: You want to develop a learning algorithm to examine individual customer accounts
and determine if each account has been hacked.
Problem 2: You have a huge list of identical items and want to predict which how many of
them will be sold over next 3 months.
Unsupervised Learning Algorithms
Data is Not Labeled
Supervised Learning
X1
X2
Unsupervised Learning
X1
X2
Clustering
Clustering in Biology
Source: [researchgate]
More Clustering Applications
Social Network Analysis
Organizing Computing Clusters
Market Segmentation
Exercise 2
Which of the following problems would you address with Unsupervised Learning
algorithms?
1. Given a dataset of patients diagnosed as either having diabetes or not, learn
to classify new patients as having diabetes or not.
2. Given a database of customer data, automatically discover market segments
and group customers into different market segments.
3. Given a dataset of news articles found on the web, group them into set of
articles about the same story.
4. Given email labeled as spam/ham, learn spam filter.
Example of Supervised learning
Source:[radimrehurek]
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Models
Predictive model Descriptive model
Supervised learning Classification, Regression Subgrouping
Unsupervised learning Predictive clustering Clustering, Association Rule
discovery
Model Types
● Geometric
● Probabilistic
● Logical
Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
Model Validation
● Goodness of fit (fit error)
● Goodness of prediction (prediction error): generalization error
Overfitting:
unnecessary increase of model complexity
Underfitting:
too simple model will not fit data properly
k-Fold Cross Validation
k=4 Cross Validation
Source: [wiki]
Mean Squared
Prediction Error:
computed on q
data points that
were not used in
estimating the
model
Machine Learning Pipeline
Data
Preparation
Training
Data
Test
Data
Feature Selection
ML Algorithm
Selection
Building
a model
Model
Evaluation
New
Data
Prediction
Result
Get your hands dirty
Source: [karlstratos]
Installing docker with Anaconda image
1. Install docker with :
> sudo apt install docker.io
2. Add your current user to the docker group with the following command:
> sudo usermod -a -G docker $USER
3. Restart your computer
4. Register and proceed at https://hub.docker.com/_/anaconda
5. Download the docker of anaconda with the following command:
> docker pull continuumio/anaconda
6. Run docker:
> docker run -i -t continuumio/anaconda /bin/bash
7. Test your conda environment:
(base) root@9b9e483ba80e:/opt/conda# conda info
Running Jupyter Notebook
Run the following command in one line from host machine:
> docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c
"/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks &&
/opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 --
port=8888 --no-browser --allow-root"
- Open your Notebook in the browser
- Open a terminal and install: numpy pandas matplotlib scipy and sklearn
Local Download server
172.90.0.161
Python Libraries for Machine Learning
● NumPy (http://www.numpy.org/ ):
○ Introduce objects for multidimensional arrays and matrices
○ Provides vectorization of mathematical operations on arrays and matrices
● SciPy(https://www.scipy.org/scipylib/ ):
○ Collection of algorithms for linear algebra, statistics, optimization and etc.
○ Build on NumPy
● Pandas(http://pandas.pydata.org/ ):
○ Provide tools for data manipulation and handling missing data
● SciKit-Learn(https://scikit-learn.org/stable/ ):
○ Provide machine learning algorithms: classification, regression, clustering, model validation
etc.
● Matplotlib(https://matplotlib.org/ ):
○ Python 2D plotting library
Pandas DataFrame Data Types
Pandas type Python native type Description
obj string The most general dtype.
Will be assigned to your
column if it contains mixed
types (numbers and
strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see thedatetimemodule
in Python’s standard library)
Values meant to hold time data.
Look into these for time series
experiments.
DataFrame Attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labelsand column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values Numpy representation of the data
Exercise with DataFrame Attributes
1. How many records this data frame has?
2. How many elements are there?
3. What are the column names?
4. What types of columns we have in this data frame?
DataFrame Methods
df.method() description
head( [n] ), tail( [n] ) first/lastn rows
describe() generate descriptive statistics (for numeric
columns only)
max(), min() return max/min values for all numeric
columns
mean(), median() return mean/median values for all numeric
columns
std() standard deviation
sample([n]) returns a random sample of the data frame
dropna() drop all the records with missing values
Exercise with DataFrame Methods
1. Give the summary for the numeric columns in the dataset
2. Calculate standard deviation for all numeric columns
3. What are the mean values of the first 50 records in the dataset?
Hint: use head() method to subset the first 50 records and then calculate the mean
Handling Missing Values
● ‘NaN - NoT a Number’ shows missing values
● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or
medium (most common)
● But should be aware that something has been changed
● Could also ignore the sample or feature with missing values
Missing Values in Pandas
● Missing values in GroupBy method are excluded
● Many descriptive statistics methods have ‘skipna’ option to control if missing data should
be excluded . This value is set to True by default.
Dealing with Missing Values in DF
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, how='all') Drop column if all the values aremissing
dropna(thresh = 5) Drop rows that contain less than 5 non-
missing values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Source: [Print_Lego]
Building a Linear Regression Model
Mean Squared Error
(MSE):
Measures the
average of the
squares of the errors
R-Squared
Where and
Here, yi^ is the fitted value for observation i and y¯ is the mean of Y.
k-Nearest Neighbors
Distance Measurements
KNN Algorithm
Accuracy
K-Means Clustering
K-Means Clustering Algorithm
Future Plans?
Further Learning
● Kaggle: is the place to do data science projects
● Seeing Theory : a visual introduction to probability and statistics.
● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI.
Software
Reading Recommendations
● Machine learning : The art and science of algorithms that make sense of data by Peter
Flach
● Python for Data Analysis by We McKinney
● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
References
[ML_Awad] Awad M., Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA
[xkcd_1838] https://xkcd.com/1838/
[fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/
[ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press.
[ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2
[Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2
[karlstratos] http://karlstratos.com/drawings/drawings.html
[Print_Lego] https://www.pinterest.com/pin/422071796300372061/
[Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview-
57891c595e96
[researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related-
functional_fig6_26688269
References (2)
[redimrehurek] https://radimrehurek.com/data_science_python/
[wiki] https://en.wikipedia.org/wiki/Cross-validation_(statistics)
Icon References
● Icons made by: Freepik from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY
● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
We organize IT24.04.2019
Your Contact
Dr. Hamzeh Alavira
Founder, oranIT GmbH
alavirad@oranit.de
0049-176-8080-7585
Dr. Parinaz Ameri
Co-Founder, oranIT GmbH
ameri@oranit.de
0049-176-3497-0683

Intro to Machine Learning for non-Data Scientists

  • 1.
    Dr. Parinaz Ameri Introto Machine Learning for non-Data Scientists
  • 2.
    Agenda ● 1.5 hours:Introduction to ML algorithms ● 1.5 hours: Implementing algorithms for different use-cases ● 1 hour: Working on a recommendation mini-project
  • 3.
  • 4.
  • 5.
    Machine Learning Definition ArthurSamuel (1959): “Field of study that gives computers the ability to learn without being explicitly programmed.” [ML_Awad] Source: [fortune]
  • 6.
    Email Spam Filter AMachine Learning Model
  • 7.
    Machine Learning Definition TomMitchell (1998): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” [ML_Mitchell]
  • 8.
    E, T andP in a Spam Filter Example ● Task T: ○ Classify emails as Spam or Ham. ● Experience E: ○ Monitor you labeling emails as Spam or Not spam. ● Performance measure P: ○ The Number (or fraction) of emails that are correctly classified as Spam or Ham.
  • 9.
    Machine Learning Definition PeterFlach (2012): “Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience.” [ML_Flach]
  • 10.
  • 11.
    Machine Learning MainIngredients 1. Tasks: ○ An abstract representation of a problem we want to solve regarding the domain objects 2. Models: ○ Representation of many tasks as a model from data points to outputs. ○ Produces as the output of a machine learning algorithm applied to training data. 3. Features: ○ A language definition in which we describe the relevant objects in our domain.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Tasks & LearningAlgorithms ● Supervised Learning ○ Regression ○ Classification ● Unsupervised Learning ○ Clustering ● Reinforcement Learning ● Recommendation systems
  • 17.
    Supervised Learning Algorithms Datais Labeled = Right Answers are Given
  • 18.
    Housing Price Prediction 750 Regression: Predict a continuous valued output
  • 19.
    Breast Cancer (Malignant,Benign) Classification : Predict discrete valued output (0,1)
  • 20.
    Features in Classification OtherFeatures: - Clump thickness - Uniformity of cell size - Uniformity of cell shape - ...
  • 21.
    Exercise 1 Should youtreat the following problems with regression or classification? Problem 1: You want to develop a learning algorithm to examine individual customer accounts and determine if each account has been hacked. Problem 2: You have a huge list of identical items and want to predict which how many of them will be sold over next 3 months.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    More Clustering Applications SocialNetwork Analysis Organizing Computing Clusters Market Segmentation
  • 27.
    Exercise 2 Which ofthe following problems would you address with Unsupervised Learning algorithms? 1. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not. 2. Given a database of customer data, automatically discover market segments and group customers into different market segments. 3. Given a dataset of news articles found on the web, group them into set of articles about the same story. 4. Given email labeled as spam/ham, learn spam filter.
  • 28.
    Example of Supervisedlearning Source:[radimrehurek]
  • 29.
    Machine Learning Pipeline Data Preparation Training Data Test Data FeatureSelection ML Algorithm Selection Building a model
  • 30.
    Models Predictive model Descriptivemodel Supervised learning Classification, Regression Subgrouping Unsupervised learning Predictive clustering Clustering, Association Rule discovery
  • 31.
    Model Types ● Geometric ●Probabilistic ● Logical
  • 32.
    Building a LinearRegression Model Mean Squared Error (MSE): Measures the average of the squares of the errors
  • 33.
    Machine Learning Pipeline Data Preparation Training Data Test Data FeatureSelection ML Algorithm Selection Building a model Model Evaluation
  • 34.
    Model Validation ● Goodnessof fit (fit error) ● Goodness of prediction (prediction error): generalization error
  • 35.
  • 36.
    Underfitting: too simple modelwill not fit data properly
  • 37.
  • 38.
    k=4 Cross Validation Source:[wiki] Mean Squared Prediction Error: computed on q data points that were not used in estimating the model
  • 39.
    Machine Learning Pipeline Data Preparation Training Data Test Data FeatureSelection ML Algorithm Selection Building a model Model Evaluation New Data Prediction Result
  • 40.
    Get your handsdirty Source: [karlstratos]
  • 41.
    Installing docker withAnaconda image 1. Install docker with : > sudo apt install docker.io 2. Add your current user to the docker group with the following command: > sudo usermod -a -G docker $USER 3. Restart your computer 4. Register and proceed at https://hub.docker.com/_/anaconda 5. Download the docker of anaconda with the following command: > docker pull continuumio/anaconda 6. Run docker: > docker run -i -t continuumio/anaconda /bin/bash 7. Test your conda environment: (base) root@9b9e483ba80e:/opt/conda# conda info
  • 42.
    Running Jupyter Notebook Runthe following command in one line from host machine: > docker run -i -t -p 8888:8888 continuumio/miniconda /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip=0.0.0.0 -- port=8888 --no-browser --allow-root" - Open your Notebook in the browser - Open a terminal and install: numpy pandas matplotlib scipy and sklearn
  • 43.
  • 44.
    Python Libraries forMachine Learning ● NumPy (http://www.numpy.org/ ): ○ Introduce objects for multidimensional arrays and matrices ○ Provides vectorization of mathematical operations on arrays and matrices ● SciPy(https://www.scipy.org/scipylib/ ): ○ Collection of algorithms for linear algebra, statistics, optimization and etc. ○ Build on NumPy ● Pandas(http://pandas.pydata.org/ ): ○ Provide tools for data manipulation and handling missing data ● SciKit-Learn(https://scikit-learn.org/stable/ ): ○ Provide machine learning algorithms: classification, regression, clustering, model validation etc. ● Matplotlib(https://matplotlib.org/ ): ○ Python 2D plotting library
  • 45.
    Pandas DataFrame DataTypes Pandas type Python native type Description obj string The most general dtype. Will be assigned to your column if it contains mixed types (numbers and strings). int64 int Numeric characters. 64 refers to the memory allocated to hold this character. float64 float Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. datetime64, timedelta[ns] N/A (but see thedatetimemodule in Python’s standard library) Values meant to hold time data. Look into these for time series experiments.
  • 46.
    DataFrame Attributes df.attribute description dtypeslist the types of the columns columns list the column names axes list the row labelsand column names ndim number of dimensions size number of elements shape return a tuple representing the dimensionality values Numpy representation of the data
  • 47.
    Exercise with DataFrameAttributes 1. How many records this data frame has? 2. How many elements are there? 3. What are the column names? 4. What types of columns we have in this data frame?
  • 48.
    DataFrame Methods df.method() description head([n] ), tail( [n] ) first/lastn rows describe() generate descriptive statistics (for numeric columns only) max(), min() return max/min values for all numeric columns mean(), median() return mean/median values for all numeric columns std() standard deviation sample([n]) returns a random sample of the data frame dropna() drop all the records with missing values
  • 49.
    Exercise with DataFrameMethods 1. Give the summary for the numeric columns in the dataset 2. Calculate standard deviation for all numeric columns 3. What are the mean values of the first 50 records in the dataset? Hint: use head() method to subset the first 50 records and then calculate the mean
  • 50.
    Handling Missing Values ●‘NaN - NoT a Number’ shows missing values ● Often replaced by arbitrary chosen values like -1 in feature with positive numbers or 0 or medium (most common) ● But should be aware that something has been changed ● Could also ignore the sample or feature with missing values
  • 51.
    Missing Values inPandas ● Missing values in GroupBy method are excluded ● Many descriptive statistics methods have ‘skipna’ option to control if missing data should be excluded . This value is set to True by default.
  • 52.
    Dealing with MissingValues in DF df.method() description dropna() Drop missing observations dropna(how='all') Drop observations where all cells is NA dropna(axis=1, how='all') Drop column if all the values aremissing dropna(thresh = 5) Drop rows that contain less than 5 non- missing values fillna(0) Replace missing values with zeros isnull() returns True if the value is missing notnull() Returns True for non-missing values
  • 53.
  • 54.
    Building a LinearRegression Model Mean Squared Error (MSE): Measures the average of the squares of the errors
  • 55.
    R-Squared Where and Here, yi^is the fitted value for observation i and y¯ is the mean of Y.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
    Further Learning ● Kaggle:is the place to do data science projects ● Seeing Theory : a visual introduction to probability and statistics. ● Kdnuggets: Machine Learning, Data Science, Data Mining, Big Data, Analytics, AI. Software
  • 64.
    Reading Recommendations ● Machinelearning : The art and science of algorithms that make sense of data by Peter Flach ● Python for Data Analysis by We McKinney ● https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
  • 65.
    References [ML_Awad] Awad M.,Khanna R. (2015) Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA [xkcd_1838] https://xkcd.com/1838/ [fortune] http://fortune.com/2018/06/25/ai-business-breakthrough/ [ML_Flach] Flach, P. (2012). Machine Learning: The art and science of algorithms that make sense of data. Cambridge University Press. [ML_Mitchell] Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. ISBN: 978-0-07-042807-2 [Medium_Sharma] https://medium.com/datadriveninvestor/how-to-built-a-recommender-system-rs-616c988d64b2 [karlstratos] http://karlstratos.com/drawings/drawings.html [Print_Lego] https://www.pinterest.com/pin/422071796300372061/ [Medium] https://medium.com/@mehulved1503/feature-selection-and-feature-extraction-in-machine-learning-an-overview- 57891c595e96 [researchgate] https://www.researchgate.net/figure/Hierarchical-clustering-of-the-181-genes-corresponding-to-zinc-biology-related- functional_fig6_26688269
  • 66.
    References (2) [redimrehurek] https://radimrehurek.com/data_science_python/ [wiki]https://en.wikipedia.org/wiki/Cross-validation_(statistics)
  • 67.
    Icon References ● Iconsmade by: Freepik from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Pixel perfect from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Vectors Market from www.flaticon.com is licensed by CC 3.0 BY ● Icons made by: Smashicons from www.flaticon.com is licensed by CC 3.0 BY
  • 68.
    We organize IT24.04.2019 YourContact Dr. Hamzeh Alavira Founder, oranIT GmbH alavirad@oranit.de 0049-176-8080-7585 Dr. Parinaz Ameri Co-Founder, oranIT GmbH ameri@oranit.de 0049-176-3497-0683