Edureka Machine Learning Ebook
Edureka Machine Learning Ebook
TABLE OF CONTENTS
IN: 9606058406
sales@edureka.co
US: 18338555775
MASTERING MACHINE LEARNING WITH EDUREKA
TABLE OF CONTENTS
8. CAREER GUIDANCE 20
How to become an ML Professional?
Edureka's Structured Training Programs
IN: 9606058406
sales@edureka.co
US: 18338555775
3 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 1
INTRODUCTION TO
MACHINE LEARNING
Undoubtedly, Machine Learning is the most in-demand technology in today’s
market. Its applications range from self-driving cars to predicting deadly
diseases such as ALS. The term Machine Learning was first coined by Arthur
Samuel in the year 1959. If you browse through the net, searching for 'what is
Machine Learning’, you’ll get at least 100 different definitions. However, the
very first formal definition was:
"
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E.
Tom M. Mitchell
"
1.1 What is Machine Learning?
In simple terms, Machine Learning is a subset of
Artificial Intelligence (AI) which provides
machines the ability to learn automatically &
improve from experience without being explicitly
programmed to do so. In this sense, it is the
practice of getting machines to solve problems by
gaining the ability to think. The Machine Learning
process involves building a Predictive model that
can be used to find a solution for a problem
statement. The shown image represents the
steps in the Machine Learning process:
IN: 9606058406
sales@edureka.co
US: 18338555775
4 WWW.EDUREKA.CO/MACHINE-LEARNING
machine learning
APPLICATIONS
IN: 9606058406
sales@edureka.co
US: 18338555775
5 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 2
MODEL
2 A model is the main component of Machine Learning and is trained by using an algorithm. It maps all the
decisions that a model is supposed to take based on the given input, in order to get the correct output.
PREDICTOR VARIABLE
3 It is a feature(s) of the data that can be used to predict the output.
RESPONSE VARIABLE
4 It is the feature or the output variable that needs to be predicted by using the predictor variable(s).
TRAINING DATA
5 The Machine Learning model is built using the training data. The training data helps the model to identify
key trends and patterns essential to predict the output.
TESTING DATA
6
After the model is trained, it must be tested to evaluate how accurately it can predict an outcome. This is
done by the testing data set.
IN: 9606058406
sales@edureka.co
US: 18338555775
6 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 3
3.1 NumPy
NumPy is a Python package that stands for ‘Numerical Python’. It is the core library for scientific
computing, which contains a powerful n-dimensional array object. Python NumPy arrays provide tools
for integrating C, C++, etc. It is also useful in linear algebra, random number capability, etc. NumPy array
can also be used as an efficient multi-dimensional container for generic data.
3.2 Pandas
Pandas is an open-source software library that is built on top of NumPy. It is used for data manipulation,
analysis and cleaning. Python pandas is well suited for different kinds of data, such as:
1. Tabular data with heterogeneously-typed columns
2. Ordered and unordered time series data
3. Arbitrary matrix data with row & column labels
4. Unlabelled data
5. Any other form of observational or statistical data sets
IN: 9606058406
sales@edureka.co
US: 18338555775
7 WWW.EDUREKA.CO/MACHINE-LEARNING
Scikit learn is a library used to perform Machine Learning in Python. Scikit learn is an open-source
library that is licensed under BSD and is reusable in various contexts, encouraging academic and
commercial use. It provides a range of supervised and unsupervised learning algorithms in Python. It
consists of popular algorithms and libraries. Apart from that, it also contains the following packages:
To implement Scikit learn, we first need to import the above packages. You can download these two
packages using the command line or if you are using PyCharm, you can directly install it by going to
your setting in the same way you do it for other packages.
IN: 9606058406
sales@edureka.co
US: 18338555775
8 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 4
MACHINE LEARNING
CLASSIFICATION
A machine can learn to solve a problem by following any three approaches covered in this chapter.
1 2
REGRESSION CLASSIFICATION
IN: 9606058406
sales@edureka.co
US: 18338555775
9 WWW.EDUREKA.CO/MACHINE-LEARNING
4.1.1 Regression
Regression is the kind of Supervised Learning that learns from the Labeled Datasets and is then able to
predict a continuous-valued output for the new data given to the algorithm. It is used whenever the
output required is a number such as money or height etc. Some popular Supervised Learning
algorithms are discussed below:
LINEAR REGRESSION
1
This algorithm assumes that there is a linear relationship
between the 2 variables, Input (X) and Output (Y), of the
data it has learnt from. The Input variable is called the
Independent Variable and the Output variable is called the
Dependent Variable. When unseen data is passed to the
algorithm, it uses the function, calculates and maps the input
to a continuous value for the output.
2 LOGISTIC REGRESSION
This algorithm predicts discrete values for the set of
Independent variables that have been passed to it. It does
the prediction by mapping the unseen data to the logit
function that has been programmed into it. The algorithm
predicts the probability of the new data and so it’s output
lies between the range of 0 and 1.
POLYNOMIAL REGRESSION
3
Polynomial Regression is a method used to handle non-
linear data. Non-linearly separable data is basically when
you cannot draw out a straight line to study the relationship
between the dependent and independent variables.
IN: 9606058406
sales@edureka.co
US: 18338555775
10 WWW.EDUREKA.CO/MACHINE-LEARNING
4.1.2 Classification
Classification, on the other hand, is the kind of learning where the algorithm needs to map the new data
that is obtained to any one of the 2 classes that we have in our dataset. The classes need to be mapped
to either 1 or 0, which in real-life translated as ‘Yes’ or ‘No’, ‘Rains’ or ‘Does Not Rain’ and so forth. The
output will be either one of the classes and not a number as it was in Regression. Some of the most well-
known algorithms are discussed below:
2 DECISION TREE
Decision Trees classify based on the feature values.
They use the method of Information Gain and find out
which feature of the dataset gives the best of
information, make that as the root node and so on till
they are able to classify each instance of the dataset.
Every branch in the Decision Tree represents a feature
of the dataset. They are one of the most widely used
algorithms for classification.
IN: 9606058406
sales@edureka.co
US: 18338555775
11 WWW.EDUREKA.CO/MACHINE-LEARNING
1 2
CLUSTERING ASSOCIATION
4.2.1 Clustering
Clustering is the type of Unsupervised Learning where you find patterns in the data that you are
working on. It may be the shape, size, color, etc. which can be used to group data items or create
clusters. Some popular algorithms in Clustering are discussed below:
HIERARCHICAL CLUSTERING
1
This algorithm builds clusters based on the similarity between different data points in the dataset. It goes
over the various features of the data points and looks for the similarity between them. If the data points are
found to be similar, they are grouped together. This continues until the dataset has been grouped which
creates a hierarchy for each of these clusters.
IN: 9606058406
sales@edureka.co
US: 18338555775
12 WWW.EDUREKA.CO/MACHINE-LEARNING
K-MEANS CLUSTERING
2
This algorithm works step-by-step where the main goal is to achieve clusters that have labels to identify
them. The algorithm creates clusters of different data points which are as homogenous as possible by
calculating the centroid of the cluster and making sure that the distance between this centroid and the
new data point is as little as possible. The smallest distance between the data point and the centroid
determines which cluster it belongs to while making sure the clusters do not interlay with each other.
The centroid acts like the heart of the cluster. This ultimately gives us the cluster which can be labeled as
needed.
K-NN CLUSTERING
3
This is probably the simplest of the Machine
Learning algorithms, as the algorithm does not
really learn but rather classifies the new data
point based on the datasets that have been stored
by it. This algorithm is also called a lazy learner
because it learns only when the algorithm is given
a new data point. It works well with smaller
datasets as huge datasets take time to learn.
4.2.2 Association
Association is the kind of Unsupervised Learning where you find the dependencies of one data item to
another data item and map them such that they help you profit better. Some popular algorithms in
Association Rule Mining are discussed further:
IN: 9606058406
sales@edureka.co
US: 18338555775
13 WWW.EDUREKA.CO/MACHINE-LEARNING
APRIORI ALGORITHM
1
The Apriori Algorithm is a breadth-first search based
which calculates the support between items. This
support basically maps the dependency of one data
item with another which can help us understand what
data item influences the possibility of something
happening to the other data item. For example, bread
influences the buyer to buy milk and eggs. So that
mapping helps increase profits for the store. That sort
of mapping can be learnt using this algorithm which
yields rules as for its output.
FP-GROWTH ALGORITHM
2 The Frequency Pattern (FP) algorithm finds the count of the pattern that has been repeated, adds that to a
table and then finds the most plausible item and sets that as the root of the tree. Other data items are then
added into the tree and the support is calculated. If that particular branch fails to meet the threshold of the
support, it is pruned. Once all the iterations are completed, a tree with the root to the item will be created
which are then used to make the rules of the association. This algorithm is faster than Apriori as the support is
calculated and checked for increasing iterations rather than creating a rule and checking the support from the
dataset.
IN: 9606058406
sales@edureka.co
US: 18338555775
14 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 5
ADVANCED MACHINE
LEARNING CONCEPTS
This chapter will introduce you to some of the advanced concepts of Machine Learning.
REGULARIZATION TECHNIQUES
Building a Machine Learning model is not just about feeding the data, there are a lot of deficiencies
that affect the accuracy of any model. Overfitting and Underfitting in Machine Learning are such
deficiencies that hinder the accuracy as well as the performance of the model.
IN: 9606058406
sales@edureka.co
US: 18338555775
15 WWW.EDUREKA.CO/MACHINE-LEARNING
IN: 9606058406
sales@edureka.co
US: 18338555775
16 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 6
WINE QUALITY
Properties of red and white Vinho Verde wine samples from the north
of Portugal. The goal here is to model wine quality based on some
physicochemical tests.
AMAZON REVIEWS
It contains approximately 35 million reviews from Amazon spanning
18 years. Data includes user information, product information,
ratings, and text review.
IN: 9606058406
sales@edureka.co
US: 18338555775
17 WWW.EDUREKA.CO/MACHINE-LEARNING
QUANDL
A great source of economic and financial data that is useful to build
models to predict stock prices or economic indicators.
IMF DATA
The International Monetary Fund (IMF) publishes data on
international finances, foreign exchange reserves, debt rates,
commodity prices, and investments.
IMAGENET
This de-facto image dataset for new algorithms is organized
according to the WordNet hierarchy, where each node is depicted
by hundreds and thousands of images.
IMDB REVIEWS
Dataset for binary sentiment classification. It features 25,000
movie reviews.
SENTIMENT140
Uses 160,000 tweets with emoticons pre-removed.
IN: 9606058406
sales@edureka.co
US: 18338555775
18 WWW.EDUREKA.CO/MACHINE-LEARNING
Chapter 7
FREQUENTLY
ASKED
INTERVIEW
QUESTIONS
Machine Learning is a buzzword in the technology world right
now and so is the need for Machine Learning Professionals
is high in demand and this surge is due to evolving
technology and the generation of huge amounts of data aka
Big Data. This chapter covers the questions that will help
you in your Machine Learning Interviews and open up various
career opportunities as a Machine Learning aspirant.
1. What are the different types of Machine Learning? 16. What is the difference between Entropy and Information
2. How would you explain Machine Learning to a school- Gain?
going kid? 17. What is Overfitting? And how do you ensure you’re not
3. How does Deep Learning differ from Machine Learning? overfitting with a model?
4. Explain Classification and Regression. 18. Explain Ensemble learning technique in Machine Learning.
5. What do you understand by Selection Bias? 19. What is bagging and boosting in Machine Learning?
6. What do you understand by Precision and Recall? 20. How would you screen for outliers and what should you do if
7. Explain false negative, false positive, true negative and you find one?
true positive with a simple example. 21. What are Collinearity and Multicollinearity?
8. What is a Confusion Matrix? 22. What do you understand by Eigenvectors and Eigenvalues?
9. What is the difference between Inductive and Deductive 23. What is A/B Testing?
learning? 24. What is Cluster Sampling?
10. How is KNN different from K-means clustering? 25. Running a binary classification tree algorithm is quite easy.
11. What is ROC curve and what does it represent? But do you know how the tree decides on which variable to
12. What’s the difference between Type I and Type II error? split at the root node and its succeeding child nodes?
13. Is it better to have too many false positives or too many 26. Name a few libraries in Python used for Data Analysis and
false negatives? Explain. Scientific Computations.
14. Which is more important to you – Model Accuracy or 27. Which library would you prefer for plotting in Python
Model Performance? language: Seaborn or Matplotlib or Bokeh?
15. What is the difference between Gini Impurity and 28. How are NumPy and SciPy related?
Entropy in a Decision Tree?
100+ MACHINE LEARNING INTERVIEW QUESTIONS & ANSWERS
IN: 9606058406
sales@edureka.co
US: 18338555775
19 WWW.EDUREKA.CO/MACHINE-LEARNING
Computer Vision
Machine Learning
Engineer
Engineers
As a Computer Vision Engineer, you
use software to handle the processing
Machine Learning Engineers work in and analysis of large data populations,
close collaboration with Data Scientists. and your efforts support the
While Data Scientists extract automation of predictive decision-
meaningful insights from large datasets making efforts
and communicate the information to
business stakeholders, Machine
Learning Engineers ensure that the
models used by Data Scientists can
ingest vast amounts of real-time data Software
for generating more accurate results. Developer/Engineer
www.edureka.co/masters-program/machine-learning-engineer-training
www.edureka.co/executive-programs/machine-learning-and-ai
www.edureka.co/machine-learning-certification-training
www.edureka.co/python-programming-certification-training
LEARNER'S REVIEWS
Edureka’s PGP helped me get on Awesome the way of Its very simple in learning and
the right path. Live lectures by teaching and support. interesting too. The way our
experts working in the field helped And best part is instructor is teaching us is
Edureka most focus simply awesome. The thing
us understand real time
on practice, practice which I like the most about
application areas and scopes while Edureka is its Support
makes a man perfect.
learning. Overall, I am happy that service,as I have got all my
Edureka was part of my learning queries answered by them on
journey. time. Thank you Edureka :)
IN: 9606058406
sales@edureka.co
US: 18338555775
Free
Resources
2500+ Technical
Blogs
3000+
Video Tutorials on
YouTube
30+
Active
Free Monthly
Community Webinars
WWW.EDUREKA.CO/MACHINE-LEARNING
About Us
There are countless online education marketplaces on the internet. And there’s us. We
are not the biggest. We are not the cheapest. But we are the fastest growing. We have
the highest course completion rate in the industry. We aim to become the largest
online learning ecosystem for continuing education, in partnership with corporates
and academia. To achieve that we remain ridiculously committed to our students. Be it
constant reminders, relentless masters or 24 x 7 online technical support - we will
absolutely make sure that you run out of excuses to not complete the course.