📈
Machine Learning Notes
Machine Learning Lifecyle:
1. Problem Definition: Defining the project requirements and business requirements.
Defining data requirements and modules.
2. Data Selection: Collect and prepare all of the relevant data for from dataset used in
machine learning.
3. Descriptive Statistics: Descriptive statistics are used to describe or summarize the
characteristics of a sample or data set.
4. Exploratory Data Analysis: Analysis of data. Find hidden patterns in the dataset.
5. Data Preprocessing: Data Cleaning, Imputing(Removing missing data) and getting
more useful and relevant data.
6. Data Transformation: Transforming the relevant data into appropriate form.
encoding techniques used(one hot), scaling, e.t.c.
7. Feature Selection: Selection of useful and informative features(attributes) and
eliminating irrelevant feature, optimizing the features. Required features to be used.
Filtering out best features. Subset of data selection.
Machine Learning Notes 1
8. Model Selection: Selection of model based on the variables. Selecting right
algorithm.
9. Model Training: 80-20 rule(training-80,test data-20),working on getting max
accuracy in training stage.
10. Model Evaluation: Model evaluation aims to estimate the generalization accuracy
of a model on future (unseen/out-of-sample) data.
11. Model Deployment: The process of taking a trained ML model and making its
predictions available to users or other systems is known as deployment.
Basic Terminologies:
Feature matrix/Data Matrix:
Matrix of all features
Features/Attributes:
Columns in a dataset
N-dimensional array/Data points:
Rows in a dataset
Dataset:
Set of data used for training model
Dependent/Output(y-axis) variable:
Variable which is output or predicted in a training model
Independent/Input(x-axis)variable:
Variable which is used for input in a training model
Target:
used for predicting
Types of Data:
Continuous variables- Always numeric, continuous and infinite, eg: height, score
Machine Learning Notes 2
Discrete variables- Numeric or categorical, countable and finite, eg: number of
fruits, gender,pincode,etc.
VLOOKUP() in Excel:
VLOOKUP()-merging various tables together, fetching data from multiple tables.
VLOOKUP(search criterion ;array; index; sort)
eg: VLOOKUP(State_ID; userState.A2-An; sort(asc/desc))
Types of Data Analysis:
UNIVARIATE ANALYSIS:
only using one feature
BIVARIATE ANALYSIS:
numeric vs numeric
categoric vs categoric
numeric vs categoric
MULTIVARIATE ANALYSIS:
using multiple features for doing analysis
~min()- it will return the minimum data from a particular dataset
Outlier is any data which is out of the range of your dataset. Anything below or above
the limits will be a outlier.
Upper limit=Q3+1.5IQR
Lower limit=Q1-1.5IQR
avg() used for calculation of mean
median() for calculating of
Coefficient of dispersion based on range: (max-min)/(max+min)
Coefficient of dispersion based on mean deviation: mean deviation/mean
Coefficient of dispersion based on range: (Q3-Q1)/(Q3+Q1)
Machine Learning Notes 3
Quartiles are divided in 4 parts:
Q2=median
Q1=25%, Q2=50%, Q3=75%, Q4=100%
QUARTILE()
IQR(INTER QUARTILE DEVIATION)
Q3-Q1=IQR
QUARTILE DEVIATION=IQR/2
Frequency table
-Divide in form particular ranges
-Frequency(data,classes)
-returns arrays
Pivot table for univariate categorical
pie chart used for 100% data
Bivariate Numeric vs Numeric
Correlation is the how two variables are re
Corelation range 1 to -1
1=two variable highly correlated
-1=highly negatively correlated(inversely)
0=no correlation
R-square is the square of correlation
Trendline is line of best fit
f(x) is the line equation (y=mx+c) in graph
Bivariate categorical vs categorical
Eg gender and state
Machine Learning Notes 4
Bivariate numeric vs categorical
eg: weight and gender
Multivariate: analysis on multiple variables
eg: each state and each gender their average height ,weight
CONCATENATE(col1;" ";col2;...;coln)-concatenating columns like names having more
than 1 word
removing inconsistencies from tables: PROPER(TRIM)-making it proper case and
removing spaces
UPPER()-uppercase and LOWER()- lowercase
combine TRIM with other function for removing extra spaces
Removing duplicates: using advanced filters > no duplication check
Imputation: filling out missing data; using average of a column/median/mode of the data;
if there is col where 70 to 80% NA,then you fill in data, dont use for model
Outliers: Anything below or above the lower and upper limits; UL=Q3+1.5Q1
Normalization: normalizing the data on common format in range of 0 to 1
(X-min)/(max+min)
X-value to be normalized
min(of the X's column)
max(X's column)
max+min>x-min
Standarization:
Regression,Linear regression,correlation
Machine Learning Notes 5
📈 Machine learning using scikit learn
📈 Machine Learning Axioms
📈 Deep Learning-Chorale Prelude + I ngression to DL
📈 Neural Networks and Deep Learning
📈 Convolutional Neural Network
📈 Machine Learning -Exploring the model
📈 Understanding Conversational Systems
Machine Learning Notes 6