ML Module 01
ML Module 01
Module-1
Introduction: Need for Machine Learning, Machine Learning Explained, Machine Learning
in Relation to other Fields, Types of Machine Learning, Challenges of Machine Learning,
Machine Learning Process, Machine Learning Applications.
Business organizations generate massive amounts of data daily, but previously struggled to
utilize it fully due to data being scattered across disparate systems and a lack of appropriate
analytical tools.
1. High Volume of available data to manage: Big Companies such as Facebook, Twitter,
and YouTube generate huge amount of data that grows at a phenomenal rate. It is
estimated that the data approximately gets doubled every year.
2. Second reason is that the cost of storage has reduced. The hardware cost has also
dropped. Therefore, it is easier now to capture, process, store, distribute, and transmit
the digital information.
3. Third reason for popularity of machine learning is the availability of complex
algorithms now. Especially with the advent of deep learning, many algorithms are
available for machine learning.
What is data? All facts are data. Data can be numbers or text that can be processed by a
computer. Processed data is called information. This includes patterns, associations, or
relationships among data.
The objective of machine learning is to process these archival data for organizations to take
better decisions to design new products, improve the business processes, and to develop
effective decision support systems.
As humans take decisions based on an experience, computers build models based on patterns
extracted from input data. Then use these data-filled models for prediction and decision-
making. Essentially, for computers, the learned model serves as the equivalent of human
experience. This concept is illustrated in Figure 1.2
Fig. 1.2: (a) A Learning system for Humans (b) A Learning system for Machine Learning
The learning program summarizes the raw data in a model. Formally stated, a model is an
explicit description of patterns within the data in the form of:
1. Mathematical equation.
2. Relational diagrams like trees/graphs.
3. Logical if/else rules, or
4. Grouping called clusters.
Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gains experience by various means. They gain knowledge by rote learning. They
observe others and imitate it. Humans gain a lot of knowledge from teachers and books. We
learn many things by trail and error. Once the knowledge is gained, when a new problem is
encountered, humans search for similar past situations and then formulate the heuristics and
use that for prediction. But, in systems, experience is gathered by these steps:
❖ Collection of data.
❖ Once data is gathered, abstract concepts are formed out of that data. Abstraction is used
to generate concepts. This is equivalent to humans’ idea of objects.
❖ Generalization converts the abstraction into an actionable form of intelligence. It can
be viewed as ordering of all possible concepts. So, generalization involves ranking of
concepts, inferencing from them and formation of heuristics, and actionable aspect of
intelligence. Heuristics are educated guesses for all tasks.
❖ Heuristics normally works! But, occasionally, it may fail too. It is not the fault of
heuristics as it is just a ‘rule of thumb’. The course correction is done by taking
evaluation measures. Evaluation checks the thoroughness of the models and to-do
course correction, if necessary, to generate better formulations.
Machine learning uses the concept of artificial Intelligence, Data Science, and Statistics
primarily. It is the resultant of combined ideas of diverse fields.
3. Deep Learning (DL) is a specialized area within Machine Learning that utilizes
neural networks to build models. Neural networks, inspired by the human brain,
consist of interconnected neurons that process information through activation functions.
Deep learning has driven significant advancements in areas like image recognition and
natural language processing.
In essence, AI is the broad concept, ML is a method within AI, and DL is a technique within
ML.
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
• Data Science: Data science is an interdisciplinary field that encompasses various areas
focused on gathering, analyzing, and extracting knowledge from data.
• Big Data: Big data is a field within data science that deals with massive datasets
characterized by volume, variety, and velocity, often used in machine learning
applications.
• Data Mining: Data mining is the process of discovering patterns and insights from
large datasets, often considered closely related to machine learning with a focus on
pattern extraction.
• Data Analytics: Data analytics focuses on examining raw data to draw useful
conclusions and includes various types like predictive analytics, which is closely related
to machine learning.
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of interaction
of the program with its environment. It can be compared with the interaction between a teacher
and a student. There are four types of machine Learning.
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised
4. Reinforcement learning
• Labelled Data: Relies on training data that includes both input features and
corresponding correct output values (labels).
• Prediction Focus: Aims to learn a mapping function to predict outputs for new, unseen
inputs.
o Classification: A supervised learning task where the goal is to assign data points
to specific categories or classes (e.g., classifying emails as spam or not spam).
• Unlabelled Data: Works with data that only contains input features, no output labels.
• Combined Data: Uses a mix of labelled and unlabelled data for training.
• Bridging the Gap: Acts as a bridge between supervised and unsupervised learning.
• Reward System: The agent learns through trial and error by taking actions and receiving
rewards (or penalties) for those actions.
• Optimal Policy: The goal is to learn a policy (set of rules) that maximizes the
cumulative reward the agent receives over time.
❖ Ill-Posed Problems: Machine learning systems struggle with problems that are not
clearly defined or have incomplete specifications. They require "well-posed" problems
with sufficient information for a solution.
❖ Data Dependency: Machine learning heavily relies on the availability of large amounts
of high-quality data. Insufficient, missing, or incorrect data can significantly hinder the
performance and accuracy of machine learning models.
like GPUs or TPUs are often necessary to handle the complexity and time requirements
of machine learning algorithms.
❖ Bias/Variance Tradeoff: Machine learning models face the challenge of balancing bias
and variance. Overfitting (high variance) occurs when a model performs well on
training data but poorly on test data due to memorizing noise. Underfitting (high bias)
happens when a model fails to capture the underlying patterns in the data. Finding the
right balance is essential for good generalization.
The emerging process model for the data mining solutions for business organizations is CRISP-
DM. since machine learning is like data mining, except for the aim, this process can be used
for machine learning. CRISP-DM stands for Cross Industry Standard Process- Data mining.
This process involves six steps.
3. Data Preparation:
4. Modelling:
5. Evaluation:
6. Deployment:
Machine learning technologies are used widely now in different domains. Machine learning
applications are everywhere! One encounters many machine learning applications in the day-
to-day life. Some applications are listed below.
2. Voice Assistants: Virtual assistants like Siri and Alexa use machine learning to
understand and respond to voice commands, performing tasks or providing information.
3. Recommendation Systems: These systems use machine learning to predict what items
a user might like based on their past behavior or preferences, as seen in Amazon's book
suggestions or Netflix's movie recommendations.
UNDERSTANDING DATA
1.8 WHAT IS DATA? Data, facts encoded as bits in computer systems, can be human-
interpretable or require computer processing. Organizations accumulate massive datasets from
various sources, categorized as operational (for daily processes) or non-operational (for
decision-making). Raw data is meaningless until processed and labelled, transforming it into
information that reveals patterns and insights. This processed data enables analysis and
knowledge extraction, such as identifying top-selling products from sales data.
Data whose volume is less and can be stored and processed by a small-scale computer is called
‘small data’. These data are collected from several sources, and integrated and processed by a
small-scale computer. Big data, on the other hand, is a larger data whose volume is much larger
than ‘small data’ and is characterized as follows.
2. Velocity: Data in Big Data flows in at high speeds from various sources like IoT
devices, requiring rapid processing.
3. Variety: Big Data comes in diverse formats, including structured, semi-structured, and
unstructured data like text, images, and videos.
• Form: Big Data encompasses diverse data types like text, images, audio,
video, and more, including complex combinations.
• Function: Big Data serves various purposes, from recording human
conversations to tracking transactions and preserving archival information.
• Source: Big Data originates from numerous sources, including public data,
social media, and multimodal platforms.
4. Veracity: Big Data has challenges with trustworthiness and quality due to potential
errors and inconsistencies in the data sources.
5. Validity: The accuracy and relevance of Big Data for decision-making or problem-
solving is crucial for its usefulness.
6. Value: The ultimate importance of Big Data lies in its capacity to provide valuable
insights that drive decisions and create positive impacts.
In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.
Structured Data: Organized data stored in a structured format like a database table, easily
accessed with tools like SQL.
• Record Data: Data organized as a collection of records (rows) with attributes (columns),
often represented as a matrix.
• Data Matrix: A type of record data with numeric attributes, enabling standard matrix
operations.
• Graph Data: Data representing relationships between objects, like web pages linked by
hyperlinks.
• Ordered Data: Data with an inherent order among attributes, including temporal,
sequential, and spatial data.
• Temporal Data: Data associated with time, like customer purchase patterns over time.
• Sequence Data: Data with a sequence but no explicit timestamps, like DNA sequences.
Unstructured Data: This includes formats like video, images, audio, textual documents,
programs, and blog data. It is estimated that 80% of the data are unstructured data.
Semi-structured Data: semi-structured data are partially structured and partially unstructured.
These include data like XML/JSON data, RSS feeds, and hierarchical data.
Flat Files: Simple and Accessible: Easy to create and read, using plain text format (ASCII).
Limited Scalability: Not suitable for large datasets; minor data changes can significantly impact
analysis results.
• Common Formats: CSV (comma-separated) and TSV (tab-separated) are widely used
for data exchange between applications.
• Easy to Process: Can be readily opened and manipulated by spreadsheet software like
Excel and Google Sheets.
• Organized Structure: Data is stored in tables with rows (records) and columns
(attributes), enabling efficient querying and management.
• Data Integrity and Management: DBMS provides tools for data consistency, security,
access control, and concurrent updates.
o Time-Series Database: Organizes data points collected over time, like sensor
readings or stock prices, to track changes and trends.
World Wide Web (WWW): A vast, global repository of information that data mining
algorithms aim to analyse for interesting patterns.
XML (eXtensible Markup Language): A human and machine-readable format used for
representing and sharing data across different systems.
Data Stream: Continuously flowing, dynamic data characterized by high volume, fixed-order
movement, and real-time processing needs.
RSS (Really Simple Syndication): A format for easily sharing real-time updates and feeds
across various platforms.
JSON (JavaScript Object Notation): A popular data interchange format frequently used in
machine learning applications.
Data analysis helps businesses make informed decisions, like identifying top-selling products
to improve marketing. While often used interchangeably, data analytics is a broader term
encompassing data collection and preprocessing, while data analysis focuses specifically on
analyzing historical data to generate insights.
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive Analytics:
Diagnostic Analytics:
Predictive Analytics:
Prescriptive Analytics:
Many frameworks exist for data analytics, sharing common factors and often using a layered
architecture for benefits like genericness. Big data framework is a 4-layer architecture.
Presentation Layer:
Thus, the Big Data processing cycle involves data management that consists of the following
steps.
• Data Collection
• Data preprocessing
• Applications of machine learning algorithm
• Interpretation of results and visualization of machine learning algorithm
Importance of data collection for machine learning and outlines the characteristics of "good
data" as well as different data sources.
• Impact on Results: The quality of the data directly affects the quality of the results.
"Good data" leads to better, more reliable outcomes.
• Timeliness: Data should be up-to-date and relevant to the current task. Old or obsolete
data can lead to inaccurate conclusions.
• Relevancy: Data should contain the necessary information for the machine learning task
and be free from biases that could skew the results.
• Understandability: Data should be interpretable and self-explanatory, allowing domain
experts to understand its meaning and use it effectively.
Data Sources:
• Open/Public Data: This type of data is freely available and has no copyright restrictions,
making it accessible for various purposes. Examples include:
o Government census data
o Digital libraries
o Scientific datasets (genomic, biological)
o Healthcare data (patient records, insurance info)
• Social Media Data: Data generated by social media platforms like Twitter, Facebook,
YouTube, and Instagram. This is a massive source of information but can be noisy and
require cleaning.
• Multimodal Data: Data that combines multiple formats like text, video, audio, and
images. Examples include:
o Image archives with associated text and numerical data
o The World Wide Web, which contains a mix of various data types.
Key takeaway: The text stresses that collecting high-quality data is the foundation of successful
machine learning. It outlines the key attributes of good data and describes the different types
of data sources that can be used. The emphasis on data quality and the diverse nature of data
sources highlight the complexities and challenges of data collection in machine learning.
Real-world data is often "dirty," meaning it's incomplete, inaccurate, inconsistent, contains
outliers, missing values, or duplicates. Data preprocessing is crucial to improve the quality of
data mining results by cleaning errors and transforming data into a processable format for
machine learning algorithms. This involves detecting and removing errors (data cleaning), and
making the data suitable for analysis (data wrangling). Common data errors include human
mistakes, structural issues, omissions, duplications, noise (random distortions), and artifacts
(deterministic distortions).
1. Ignore the tuple: Remove the entire record if it has missing values, but this is inefficient
with lots of missing data.
2. Fill in the values manually: Experts analyze and manually input correct values, but this
is time-consuming and impractical for large datasets.
3. Use a global constant: Replace missing values with a fixed value like "Unknown," but
this can skew analysis.
4. Fill with the attribute mean: Replace missing values with the average of that attribute
for all records.
5. Use the attribute mean for the same class: Replace missing values with the average for
that attribute within the same category or group.
6. Use the most probable value: Employ methods like classification or decision trees to
predict and fill in missing values.
Noise, random errors in data, can be reduced by binning, a method that sorts data into equal-
frequency bins (buckets) and smooths values using neighbors. Common binning techniques
include smoothing by means, medians, or boundaries, where values are replaced by the bin's
average, middle value, or nearest edge, respectively.
• Data Integration: Combines data from various sources into a single, unified dataset,
addressing potential redundancies.
• Data Transformation: Modifies data to improve its suitability for analysis and enhance
the performance of data mining algorithms.
• Normalization: Scales attribute values to a specific range (e.g., 0-1) to prevent
dominance by larger values and improve algorithm convergence.
• Min-Max Normalization: Scales data linearly based on the minimum and maximum
values of the attribute.
𝑽−𝒎𝒊𝒏
min-max = 𝒎𝒂𝒙−𝒎𝒊𝒏 × (new max – new min) + new min
• z-Score Normalization: Standardizes data based on the mean and standard deviation of
the attribute.
V* = V-µ/𝝈
Data Reduction
Data reduction shrinks dataset size while preserving key information for similar results.
Methods like aggregation, feature selection, and dimensionality reduction achieve this by
summarizing, selecting the most relevant features, or transforming data into a lower-
dimensional space.
Descriptive statistics summarizes and describes data without delving into machine learning
algorithms. Data visualization uses plots to explain and present data, often to clients. Together,
these techniques form Exploratory Data Analysis (EDA), which focuses on understanding and
preparing data for machine learning by revealing its nature and informing the selection of
appropriate tasks. EDA encompasses both descriptive statistics and data visualization.
A dataset can be assumed to be collection of data objects. The data objects may be records,
points, vectors, patterns, events, cases, samples or observations. These records contain many
attributes. An attribute can be defined as the property of an object.
Every attribute should be associated with a value. This process is called measurement. The type
of attribute determines the data types, often referred to as measurement scale types. The data
types are shown in below figure.
1. Categorical (or Qualitative) Data: This type of data represents qualities or characteristics.
• Nominal Data: This data consists of categories with no inherent order. Think of them
as labels.
o Example: Patient IDs are nominal. You can tell if two IDs are the same or
different, but you can't say one is "greater" than the other.
o Key Feature: Only equality comparisons (=, ≠) make sense.
• Ordinal Data: This data has categories with a meaningful order.
o Example: Fever levels (Low, Medium, High) are ordinal. You know High is a
higher level than Medium.
o Key Feature: Order matters, but the intervals between values aren't necessarily
equal. Transformations can be applied while preserving the order.
2. Numeric (or Quantitative) Data: This type of data represents quantities and can be
measured.
• Interval Data: This data has meaningful intervals between values, but the zero point is
arbitrary.
o Example: Temperature in Celsius or Fahrenheit is interval data. The difference
between 20°C and 30°C is the same as the difference between 30°C and 40°C,
but 0°C1 doesn't mean "no temperature."
o Key Feature: Addition and subtraction are meaningful, but ratios are not (e.g.,
20°C is not "twice as hot" as 10°C).
• Ratio Data: This data has meaningful intervals and a true zero point.
o Example: Weight, height, or income are ratio data. 0 kg means "no weight," and
100 kg is twice as heavy as 50 kg.
o Key Feature: All arithmetic operations are meaningful, including ratios.
Discrete Data: This type of data is countable and consists of whole numbers. Think of it as
data you can list out (1, 2, 3...).
o Examples: Survey responses (if they are, say, multiple choice or counting
things), employee ID numbers.
Continuous Data: This type of data can take on any value within a range, including decimals
and fractions.
Third way of classifying the data is based on the number of variables used in the dataset. Based
on that, the data can be classified as univariate data, bivariate data, and multivariate data. This
is shown in below figure.
Univariate Data: Deals with a single variable. looking at one characteristic at a time. For
example, Analyzing only the heights of students in a class.
Bivariate Data: Deals with two variables. Examining the relationship between two
characteristics. For example, Studying the relationship between height and weight of students.
Multivariate Data: Deals with three or more variables. Exploring the relationships among
multiple characteristics. For example, Analyzing the relationship between height, weight, and
age of students.
Univariate analysis is the simplest way to analyse data since it only looks at one variable at a
time. It aims to describe the data and find patterns within that single variable, without exploring
relationships with other variables.
Graph visualization is essential for understanding data and presenting it effectively, especially
to clients. Common univariate analysis graphs include bar charts, histograms, frequency
polygons, and pie charts. These graphs help in presenting, summarizing, describing, exploring,
and comparing data.
1. Bar chart
Bar charts display the frequency (or count) of different categories or values of a variable. They
are useful for showing how many times each value occurs.
The bar chart for students marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown
in below figure.
2. Pie chart
Pie charts show the relative proportions of different categories or values that make up a whole.
Each slice of the pie represents a category, and the size of the slice is proportional to its
percentage frequency. The percentage frequency distribution of students marks {22, 22, 40, 40,
70, 70, 70, 85, 90, 90}
It can be observed that the number of students with 22 marks are 2. The total number of students
are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in above figure.
3. Histogram
Histograms visually display the distribution of data by grouping data points into bins or ranges
(intervals) and showing how many data points fall into each bin. The vertical axis represents
the frequency (count) of data points within each bin.
The histogram for student marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75,
76-100 is given in below figure. One can visually inspect from below figure that the number
of students in the range 76-100 is 2.
4. Dot Plots
Dot plots display the frequency or value of data points for different categories or values. They
show how many times each value occurs in a dataset. The dot plot of English marks for five
students with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in below figure.
1. Mean:
▪ Arithmetic Mean
o Simple and Widely Used: It's the most common type of average and easy to
calculate.
o Sensitive to Outliers: Extreme values can significantly distort the mean, making
it less representative of the data's center in some cases.
▪ Weighted Mean
o Accounts for Importance: Allows you to assign different levels of importance
(weights) to data points, useful when some values are more influential than
others.
o More Complex Calculation: Requires determining and applying appropriate
weights, which can add complexity compared to the basic arithmetic mean.
▪ Geometric Mean
o Suitable for Growth Rates: Ideal for data that exhibits exponential growth or
multiplicative relationships, like compound interest or population growth.
o Less Affected by Outliers (than arithmetic mean): While not completely
immune, it's generally less sensitive to extreme values compared to the
arithmetic mean, especially when dealing with proportional changes.
In large cases, computing geometric mean is difficult. Hence, it is usually calculated as:
2. Median: The median is the middle value in a dataset when the data is ordered. If there's an
odd number of values, it's the exact middle one. If even, it's the average of the two central
values. It splits the data into two halves, with half the values below and half above it.
3. Mode: The mode is the value that appears most frequently in a dataset. It's the data point
with the highest frequency. Datasets can have one mode (unimodal), two modes (bimodal), or
three modes (trimodal). It's primarily used for discrete data.
2.3.3 Dispersion
Range: The range is the simplest measure of dispersion, calculated as the difference between
the maximum and minimum values in a dataset.1 While easy to compute, it's highly sensitive
to outliers and doesn't reflect the distribution of the data between these extremes.
Standard Deviation: Standard deviation measures the average distance of data points from
the mean. It provides a more comprehensive picture of spread than the range, as it considers all
data points. A higher standard deviation indicates greater dispersion, while a lower one
suggests data points are clustered closer to the mean. It's a crucial measure in statistical
analysis.
Quartiles and Interquartile Range (IQR): Quartiles divide a dataset into four equal parts.
The interquartile range (IQR) is the difference between the third quartile (Q3, 75th percentile)
and the first quartile (Q1, 25th percentile). The IQR represents the spread of the middle 50%
of the data and is less sensitive to outliers than the range, making it a robust measure of spread.
Outliers: Outliers are data points that fall significantly far from the other values in a dataset.
A common rule defines outliers as those falling below Q1 - 1.5IQR or above Q3 + 1.5IQR.
Identifying outliers is important as they can skew statistical analyses and may warrant special
attention.
Five point summary and Box Plots: A five-point summary consists of the minimum value,
first quartile (Q1), median (Q2), third quartile (Q3), and maximum value, arranged in that
order. Box plots, also known as box-and-whisker plots, visually represent this summary for
continuous variables, showing the data's distribution and spread, including the interquartile
range (IQR) within the box and potential outliers beyond the whiskers. The position of the
median line within the box indicates skewness.
2.3.4 Shape
Skewness and kurtosis, known as moments, describe the symmetry/asymmetry and peak
location of a dataset.
Skeweness
Fig. 2.4: (a) Positive Skewed and (b) Negative Skewed Data
Kurtosis
Kurtosis measures the "peakedness" and tail heaviness of a data distribution relative to a normal
distribution. High kurtosis indicates a sharper peak and heavier tails (more outliers), while low
kurtosis implies a flatter peak and lighter tails (fewer outliers). It essentially quantifies the
presence of extreme values.
• Measures Average Absolute Difference: MAD calculates the average of the absolute
differences between each data point and the mean, providing a measure of spread.
• Robust to Outliers: Unlike standard deviation, MAD is less sensitive to extreme values
because it doesn't square the deviations.
• Used for Outlier Detection: MAD can be used to identify potential outliers by
comparing the deviation from the median to MAD.
The coefficient of variation (CV) shows how spread out data is compared to its average,
especially useful for comparing datasets with different units.
Stem and leaf plots are a simple way to visualize data distribution by splitting each number
into a "stem" (usually the tens digit) and a "leaf" (usually the units digit). Ideally, data should
follow a bell-shaped curve (normal distribution) for many statistical tests to be valid. Q-Q plots
are used to assess normality by comparing data quartiles against a theoretical normal
distribution.
The steam and leaf plot for the English subject marks, say, {45, 60, 60, 80, 85} is given in
below figure.