KEMBAR78
ML Module 01 | PDF | Machine Learning | Quartile
0% found this document useful (0 votes)
37 views28 pages

ML Module 01

The document provides an overview of machine learning, its necessity, types, and applications, highlighting its significance in processing large datasets for better decision-making. It explains the relationship between machine learning and fields like artificial intelligence, data science, and statistics, along with the challenges faced in machine learning implementation. Additionally, it outlines the machine learning process and various applications such as sentiment analysis, voice assistants, and recommendation systems.

Uploaded by

prajwal4560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views28 pages

ML Module 01

The document provides an overview of machine learning, its necessity, types, and applications, highlighting its significance in processing large datasets for better decision-making. It explains the relationship between machine learning and fields like artificial intelligence, data science, and statistics, along with the challenges faced in machine learning implementation. Additionally, it outlines the machine learning process and various applications such as sentiment analysis, voice assistants, and recommendation systems.

Uploaded by

prajwal4560
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning BCS602

Module-1
Introduction: Need for Machine Learning, Machine Learning Explained, Machine Learning
in Relation to other Fields, Types of Machine Learning, Challenges of Machine Learning,
Machine Learning Process, Machine Learning Applications.

Understanding Data-1: Introduction, Big Data Analysis Framework, Descriptive Statistics,


Univariate Data Analysis and Visualization.

1.1 NEED FOR MACHINE LEARNING

Business organizations generate massive amounts of data daily, but previously struggled to
utilize it fully due to data being scattered across disparate systems and a lack of appropriate
analytical tools.

Machine Learning has become so popular because of three reasons:

1. High Volume of available data to manage: Big Companies such as Facebook, Twitter,
and YouTube generate huge amount of data that grows at a phenomenal rate. It is
estimated that the data approximately gets doubled every year.
2. Second reason is that the cost of storage has reduced. The hardware cost has also
dropped. Therefore, it is easier now to capture, process, store, distribute, and transmit
the digital information.
3. Third reason for popularity of machine learning is the availability of complex
algorithms now. Especially with the advent of deep learning, many algorithms are
available for machine learning.

Fig. 1.1: The Knowledge Pyramid

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 1


Machine Learning BCS602

What is data? All facts are data. Data can be numbers or text that can be processed by a
computer. Processed data is called information. This includes patterns, associations, or
relationships among data.

The objective of machine learning is to process these archival data for organizations to take
better decisions to design new products, improve the business processes, and to develop
effective decision support systems.

1.2 MACHINE LEARNING EXPLAINED

Machine learning is an important sub-branch of Artificial Intelligence (AI). A frequently quoted


definition of machine learning was by Arthur Samuel, one of the pioneers of Artificial
Intelligence. He stated that “Machine Learning is the field of study that gives the computers
ability to learn without being explicitly programmed.”

As humans take decisions based on an experience, computers build models based on patterns
extracted from input data. Then use these data-filled models for prediction and decision-
making. Essentially, for computers, the learned model serves as the equivalent of human
experience. This concept is illustrated in Figure 1.2

Fig. 1.2: (a) A Learning system for Humans (b) A Learning system for Machine Learning

The learning program summarizes the raw data in a model. Formally stated, a model is an
explicit description of patterns within the data in the form of:

1. Mathematical equation.
2. Relational diagrams like trees/graphs.
3. Logical if/else rules, or
4. Grouping called clusters.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 2


Machine Learning BCS602

Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gains experience by various means. They gain knowledge by rote learning. They
observe others and imitate it. Humans gain a lot of knowledge from teachers and books. We
learn many things by trail and error. Once the knowledge is gained, when a new problem is
encountered, humans search for similar past situations and then formulate the heuristics and
use that for prediction. But, in systems, experience is gathered by these steps:

❖ Collection of data.
❖ Once data is gathered, abstract concepts are formed out of that data. Abstraction is used
to generate concepts. This is equivalent to humans’ idea of objects.
❖ Generalization converts the abstraction into an actionable form of intelligence. It can
be viewed as ordering of all possible concepts. So, generalization involves ranking of
concepts, inferencing from them and formation of heuristics, and actionable aspect of
intelligence. Heuristics are educated guesses for all tasks.
❖ Heuristics normally works! But, occasionally, it may fail too. It is not the fault of
heuristics as it is just a ‘rule of thumb’. The course correction is done by taking
evaluation measures. Evaluation checks the thoroughness of the models and to-do
course correction, if necessary, to generate better formulations.

1.3 MACHINE LEARNING IN RELATION TO OTHER FIELDS

Machine learning uses the concept of artificial Intelligence, Data Science, and Statistics
primarily. It is the resultant of combined ideas of diverse fields.

1.3.1 Machine Learning and Artificial Intelligence

Fig. 1.3: Relationship of AI with Machine Learning

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 3


Machine Learning BCS602

1. Artificial Intelligence (AI) is the overarching field focused on creating intelligent


agents, encompassing various approaches. Initially, AI aimed to replicate human
intelligence through logic and reasoning. However, this approach faced setbacks ("AI
winters"). The field saw a resurgence with the development of data-driven systems.

2. Machine Learning (ML) is a subfield of AI that concentrates on extracting


patterns from data to make predictions. ML involves learning from examples and
includes areas like reinforcement learning. It's the engine that powers many of the
practical AI applications we see today.

3. Deep Learning (DL) is a specialized area within Machine Learning that utilizes
neural networks to build models. Neural networks, inspired by the human brain,
consist of interconnected neurons that process information through activation functions.
Deep learning has driven significant advancements in areas like image recognition and
natural language processing.

In essence, AI is the broad concept, ML is a method within AI, and DL is a technique within
ML.

1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics

• Data Science: Data science is an interdisciplinary field that encompasses various areas
focused on gathering, analyzing, and extracting knowledge from data.

• Machine Learning: Machine learning is a branch of data science that focuses on


developing algorithms allowing computers to learn from data and make predictions or
decisions.

• Big Data: Big data is a field within data science that deals with massive datasets
characterized by volume, variety, and velocity, often used in machine learning
applications.

• Data Mining: Data mining is the process of discovering patterns and insights from
large datasets, often considered closely related to machine learning with a focus on
pattern extraction.

• Data Analytics: Data analytics focuses on examining raw data to draw useful
conclusions and includes various types like predictive analytics, which is closely related
to machine learning.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 4


Machine Learning BCS602

• Pattern Recognition: Pattern recognition is a field that uses machine learning


algorithms to identify and classify patterns in data, often seen as a specific application
of machine learning.

Fig. 1.4: Relationship of machine learning with other major fields

1.3.3 Machine learning and Statistics

Statistics, a branch of mathematics with a strong theoretical foundation in statistical learning,


shares with machine learning (ML) the ability to learn from data. However, while statistical
methods focus on discovering inherent patterns by initially setting hypotheses and conducting
experiments to validate them, machine learning often requires less statistical knowledge and
relies more on computational tools to automate learning with fewer assumptions. Despite these
differences, some argue that machine learning is essentially a modern iteration of statistical
methods, highlighting the close relationship between the two fields.

1.4 TYPES OF MACHINE LEARNING

What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of interaction
of the program with its environment. It can be compared with the interaction between a teacher
and a student. There are four types of machine Learning.

1. Supervised learning
2. Unsupervised learning
3. Semi-supervised
4. Reinforcement learning

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 5


Machine Learning BCS602

Fig. 1.5: Types of Machine Learning

1.4.1 Supervised Learning:

• Labelled Data: Relies on training data that includes both input features and
corresponding correct output values (labels).

• Prediction Focus: Aims to learn a mapping function to predict outputs for new, unseen
inputs.

• Examples: Classification (predicting categories) and Regression (predicting continuous


values).

o Classification: A supervised learning task where the goal is to assign data points
to specific categories or classes (e.g., classifying emails as spam or not spam).

o Regression: A supervised learning task where the goal is to predict a continuous


value (e.g., predicting house prices based on features like size and location).

1.4.2 Unsupervised Learning:

• Unlabelled Data: Works with data that only contains input features, no output labels.

• Pattern Discovery: Focuses on finding hidden patterns, structures, or groupings within


the data.

• Examples: Clustering (grouping similar data points), Association Mining (finding


relationships between variables), and Dimensionality Reduction (reducing the number
of features).

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 6


Machine Learning BCS602

o Cluster Analysis: A method of grouping similar data points together (e.g.,


grouping customers based on purchasing behaviour).

o Association Mining: A technique for discovering relationships between


variables in large datasets (e.g., finding items frequently bought together in a
supermarket).

o Dimension Reduction: A method for reducing the number of features in a


dataset while preserving important information (e.g., simplifying complex data
for easier analysis).

1.4.3 Semi-supervised Learning:

• Combined Data: Uses a mix of labelled and unlabelled data for training.

• Leverages Unlabelled Data: Seeks to improve learning by incorporating information


from the unlabelled data, especially when labelled data is scarce.

• Bridging the Gap: Acts as a bridge between supervised and unsupervised learning.

1.4.4 Reinforcement Learning:

• Agent and Environment: Involves an agent interacting with an environment.

• Reward System: The agent learns through trial and error by taking actions and receiving
rewards (or penalties) for those actions.

• Optimal Policy: The goal is to learn a policy (set of rules) that maximizes the
cumulative reward the agent receives over time.

1.5 CHALLENGES OF MACHINE LEARNING

❖ Ill-Posed Problems: Machine learning systems struggle with problems that are not
clearly defined or have incomplete specifications. They require "well-posed" problems
with sufficient information for a solution.

❖ Data Dependency: Machine learning heavily relies on the availability of large amounts
of high-quality data. Insufficient, missing, or incorrect data can significantly hinder the
performance and accuracy of machine learning models.

❖ High Computation Demand: Processing and analyzing large datasets, especially in


"Big Data" scenarios, demands significant computational resources. Powerful hardware

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 7


Machine Learning BCS602

like GPUs or TPUs are often necessary to handle the complexity and time requirements
of machine learning algorithms.

❖ Algorithm Complexity: Selecting, designing, applying, and evaluating appropriate


algorithms is crucial and challenging. The complexity of algorithms and the need to
compare them make it difficult for machine learning professionals to find optimal
solutions.

❖ Bias/Variance Tradeoff: Machine learning models face the challenge of balancing bias
and variance. Overfitting (high variance) occurs when a model performs well on
training data but poorly on test data due to memorizing noise. Underfitting (high bias)
happens when a model fails to capture the underlying patterns in the data. Finding the
right balance is essential for good generalization.

1.6 MACHINE LEARNING PROCESS

The emerging process model for the data mining solutions for business organizations is CRISP-
DM. since machine learning is like data mining, except for the aim, this process can be used
for machine learning. CRISP-DM stands for Cross Industry Standard Process- Data mining.
This process involves six steps.

Fig. 1.6: Machine learning process

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 8


Machine Learning BCS602

1. Understanding the Business:

o Define clear business objectives and requirements.

o Formulate a problem statement that guides the data mining process.

2. Understanding the Data:

o Collect and explore the available data.

o Analyse data characteristics and form initial hypotheses.

3. Data Preparation:

o Clean and preprocess the raw data.

o Handle missing values and inconsistencies.

4. Modelling:

o Select and apply appropriate data mining algorithms.

o Build models to discover patterns or relationships.

5. Evaluation:

o Assess model performance using relevant metrics.

o Validate results and ensure they meet business needs.

6. Deployment:

o Integrate models into existing systems or processes.

o Use insights to improve decision-making and outcomes.

1.7 MACHINE LEARNING APPLICATIONS

Machine learning technologies are used widely now in different domains. Machine learning
applications are everywhere! One encounters many machine learning applications in the day-
to-day life. Some applications are listed below.

1. Sentiment Analysis: Machine learning algorithms analyse text to determine the


emotional tone behind it, like classifying movie reviews as positive or negative.

2. Voice Assistants: Virtual assistants like Siri and Alexa use machine learning to
understand and respond to voice commands, performing tasks or providing information.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 9


Machine Learning BCS602

3. Recommendation Systems: These systems use machine learning to predict what items
a user might like based on their past behavior or preferences, as seen in Amazon's book
suggestions or Netflix's movie recommendations.

4. Navigation (like Google Maps): Machine learning powers route optimization by


analysing traffic patterns and road conditions to suggest the fastest routes.

Table 1.1: Applications survey table

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 10


Machine Learning BCS602

UNDERSTANDING DATA

1.8 WHAT IS DATA? Data, facts encoded as bits in computer systems, can be human-
interpretable or require computer processing. Organizations accumulate massive datasets from
various sources, categorized as operational (for daily processes) or non-operational (for
decision-making). Raw data is meaningless until processed and labelled, transforming it into
information that reveals patterns and insights. This processed data enables analysis and
knowledge extraction, such as identifying top-selling products from sales data.

Elements of Big Data

Data whose volume is less and can be stored and processed by a small-scale computer is called
‘small data’. These data are collected from several sources, and integrated and processed by a
small-scale computer. Big data, on the other hand, is a larger data whose volume is much larger
than ‘small data’ and is characterized as follows.

1. Volume: Big Data is characterized by massive amounts of data, often measured in


petabytes and exabytes, due to decreasing storage costs.

2. Velocity: Data in Big Data flows in at high speeds from various sources like IoT
devices, requiring rapid processing.

3. Variety: Big Data comes in diverse formats, including structured, semi-structured, and
unstructured data like text, images, and videos.
• Form: Big Data encompasses diverse data types like text, images, audio,
video, and more, including complex combinations.
• Function: Big Data serves various purposes, from recording human
conversations to tracking transactions and preserving archival information.
• Source: Big Data originates from numerous sources, including public data,
social media, and multimodal platforms.
4. Veracity: Big Data has challenges with trustworthiness and quality due to potential
errors and inconsistencies in the data sources.

5. Validity: The accuracy and relevance of Big Data for decision-making or problem-
solving is crucial for its usefulness.

6. Value: The ultimate importance of Big Data lies in its capacity to provide valuable
insights that drive decisions and create positive impacts.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 11


Machine Learning BCS602

1.8.1 Types of Data

In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.

Structured Data: Organized data stored in a structured format like a database table, easily
accessed with tools like SQL.

• Record Data: Data organized as a collection of records (rows) with attributes (columns),
often represented as a matrix.

• Data Matrix: A type of record data with numeric attributes, enabling standard matrix
operations.

• Graph Data: Data representing relationships between objects, like web pages linked by
hyperlinks.

• Ordered Data: Data with an inherent order among attributes, including temporal,
sequential, and spatial data.

• Temporal Data: Data associated with time, like customer purchase patterns over time.

• Sequence Data: Data with a sequence but no explicit timestamps, like DNA sequences.

• Spatial Data: Data with positional attributes, like points on a map.

Unstructured Data: This includes formats like video, images, audio, textual documents,
programs, and blog data. It is estimated that 80% of the data are unstructured data.

Semi-structured Data: semi-structured data are partially structured and partially unstructured.
These include data like XML/JSON data, RSS feeds, and hierarchical data.

1.8.2 Data Storage and Representation

Flat Files: Simple and Accessible: Easy to create and read, using plain text format (ASCII).
Limited Scalability: Not suitable for large datasets; minor data changes can significantly impact
analysis results.

CSV and TSV Files (as types of Flat Files):

• Common Formats: CSV (comma-separated) and TSV (tab-separated) are widely used
for data exchange between applications.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 12


Machine Learning BCS602

• Easy to Process: Can be readily opened and manipulated by spreadsheet software like
Excel and Google Sheets.

Database Systems (Relational):

• Organized Structure: Data is stored in tables with rows (records) and columns
(attributes), enabling efficient querying and management.

• Data Integrity and Management: DBMS provides tools for data consistency, security,
access control, and concurrent updates.

o Transactional Database: Stores records of individual transactions, often used


for analysing relationships between items.

o Time-Series Database: Organizes data points collected over time, like sensor
readings or stock prices, to track changes and trends.

o Spatial Database: Stores information related to locations and shapes, enabling


analysis of geographic data like maps or urban planning layouts.

World Wide Web (WWW): A vast, global repository of information that data mining
algorithms aim to analyse for interesting patterns.
XML (eXtensible Markup Language): A human and machine-readable format used for
representing and sharing data across different systems.
Data Stream: Continuously flowing, dynamic data characterized by high volume, fixed-order
movement, and real-time processing needs.
RSS (Really Simple Syndication): A format for easily sharing real-time updates and feeds
across various platforms.
JSON (JavaScript Object Notation): A popular data interchange format frequently used in
machine learning applications.

1.9 BIG DATA ANALYTICS AND TYPES OF ANALYTICS

Data analysis helps businesses make informed decisions, like identifying top-selling products
to improve marketing. While often used interchangeably, data analytics is a broader term
encompassing data collection and preprocessing, while data analysis focuses specifically on
analyzing historical data to generate insights.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 13


Machine Learning BCS602

There are four types of data analytics:

1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics

Descriptive Analytics:

• Summarizes and describes the main features of a dataset.


• Focuses on quantifying collected data, often using statistical measures.

Diagnostic Analytics:

• Investigates why certain events or outcomes occurred.


• Aims to identify cause-and-effect relationships within the data.

Predictive Analytics:

• Uses algorithms and patterns to forecast future outcomes.


• Forms the core of machine learning, focusing on "what will happen?"

Prescriptive Analytics:

• Recommends actions to optimize outcomes and decision-making.


• Goes beyond prediction to suggest solutions and mitigate potential risks.

2.1 BIG DATA ANALYSIS FRAMEWORK

Many frameworks exist for data analytics, sharing common factors and often using a layered
architecture for benefits like genericness. Big data framework is a 4-layer architecture.

❖ Data connection layer


❖ Data management layer
❖ Data analytics layer
❖ Presentation layer

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 14


Machine Learning BCS602

Data Connection Layer:

• Handles data ingestion, importing raw data into suitable structures.


• Performs Extract, Transform, Load (ETL) operations for data preparation.

Data Management Layer:

• Preprocesses data, enabling parallel query execution.


• Implements data management schemes like data-in-place or data warehouses.

Data Analytic Layer:

• Offers functionalities like statistical tests and machine learning algorithms.


• Constructs and validates machine learning models.

Presentation Layer:

• Provides tools like dashboards and applications to display analysis results.


• Visualizes output from analytical engines and machine learning models.

Thus, the Big Data processing cycle involves data management that consists of the following
steps.

• Data Collection
• Data preprocessing
• Applications of machine learning algorithm
• Interpretation of results and visualization of machine learning algorithm

2.1.1 Data Collection

Importance of data collection for machine learning and outlines the characteristics of "good
data" as well as different data sources.

Importance of Data Collection:

• Time-Consuming: The text emphasizes that a significant portion of a data scientist's


time is spent on collecting data, highlighting its crucial role in the machine learning
process.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 15


Machine Learning BCS602

• Impact on Results: The quality of the data directly affects the quality of the results.
"Good data" leads to better, more reliable outcomes.

Characteristics of "Good Data":

• Timeliness: Data should be up-to-date and relevant to the current task. Old or obsolete
data can lead to inaccurate conclusions.
• Relevancy: Data should contain the necessary information for the machine learning task
and be free from biases that could skew the results.
• Understandability: Data should be interpretable and self-explanatory, allowing domain
experts to understand its meaning and use it effectively.

Data Sources:

• Open/Public Data: This type of data is freely available and has no copyright restrictions,
making it accessible for various purposes. Examples include:
o Government census data
o Digital libraries
o Scientific datasets (genomic, biological)
o Healthcare data (patient records, insurance info)
• Social Media Data: Data generated by social media platforms like Twitter, Facebook,
YouTube, and Instagram. This is a massive source of information but can be noisy and
require cleaning.
• Multimodal Data: Data that combines multiple formats like text, video, audio, and
images. Examples include:
o Image archives with associated text and numerical data
o The World Wide Web, which contains a mix of various data types.

Key takeaway: The text stresses that collecting high-quality data is the foundation of successful
machine learning. It outlines the key attributes of good data and describes the different types
of data sources that can be used. The emphasis on data quality and the diverse nature of data
sources highlight the complexities and challenges of data collection in machine learning.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 16


Machine Learning BCS602

2.1.2 Data Preprocessing

Real-world data is often "dirty," meaning it's incomplete, inaccurate, inconsistent, contains
outliers, missing values, or duplicates. Data preprocessing is crucial to improve the quality of
data mining results by cleaning errors and transforming data into a processable format for
machine learning algorithms. This involves detecting and removing errors (data cleaning), and
making the data suitable for analysis (data wrangling). Common data errors include human
mistakes, structural issues, omissions, duplications, noise (random distortions), and artifacts
(deterministic distortions).

Missing Data Analysis

1. Ignore the tuple: Remove the entire record if it has missing values, but this is inefficient
with lots of missing data.
2. Fill in the values manually: Experts analyze and manually input correct values, but this
is time-consuming and impractical for large datasets.
3. Use a global constant: Replace missing values with a fixed value like "Unknown," but
this can skew analysis.
4. Fill with the attribute mean: Replace missing values with the average of that attribute
for all records.
5. Use the attribute mean for the same class: Replace missing values with the average for
that attribute within the same category or group.
6. Use the most probable value: Employ methods like classification or decision trees to
predict and fill in missing values.

Removal of Noisy or Outlier Data

Noise, random errors in data, can be reduced by binning, a method that sorts data into equal-
frequency bins (buckets) and smooths values using neighbors. Common binning techniques
include smoothing by means, medians, or boundaries, where values are replaced by the bin's
average, middle value, or nearest edge, respectively.

Data Integration and Data Transformation

• Data Integration: Combines data from various sources into a single, unified dataset,
addressing potential redundancies.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 17


Machine Learning BCS602

• Data Transformation: Modifies data to improve its suitability for analysis and enhance
the performance of data mining algorithms.
• Normalization: Scales attribute values to a specific range (e.g., 0-1) to prevent
dominance by larger values and improve algorithm convergence.

Some of the normalization procedures used are:

• Min-Max Normalization: Scales data linearly based on the minimum and maximum
values of the attribute.

𝑽−𝒎𝒊𝒏
min-max = 𝒎𝒂𝒙−𝒎𝒊𝒏 × (new max – new min) + new min

• z-Score Normalization: Standardizes data based on the mean and standard deviation of
the attribute.

V* = V-µ/𝝈

Data Reduction

Data reduction shrinks dataset size while preserving key information for similar results.
Methods like aggregation, feature selection, and dimensionality reduction achieve this by
summarizing, selecting the most relevant features, or transforming data into a lower-
dimensional space.

2.2 DESCRIPTIVE STATISTICS

Descriptive statistics summarizes and describes data without delving into machine learning
algorithms. Data visualization uses plots to explain and present data, often to clients. Together,
these techniques form Exploratory Data Analysis (EDA), which focuses on understanding and
preparing data for machine learning by revealing its nature and informing the selection of
appropriate tasks. EDA encompasses both descriptive statistics and data visualization.

Dataset and Data Types

A dataset can be assumed to be collection of data objects. The data objects may be records,
points, vectors, patterns, events, cases, samples or observations. These records contain many
attributes. An attribute can be defined as the property of an object.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 18


Machine Learning BCS602

For example, consider the following database shown in below table

Patient ID Name Age Blood Test Fever Disease


1 John 21 Negative Low No
2 Andre 36 Positive High Yes

Table 1.2: Sample Patient table

Every attribute should be associated with a value. This process is called measurement. The type
of attribute determines the data types, often referred to as measurement scale types. The data
types are shown in below figure.

Fig. 1.7: Types of data

Broadly data can be classified into two types:

• Categorical or qualitative data


• Numerical or quantitative data

1. Categorical (or Qualitative) Data: This type of data represents qualities or characteristics.

• Nominal Data: This data consists of categories with no inherent order. Think of them
as labels.
o Example: Patient IDs are nominal. You can tell if two IDs are the same or
different, but you can't say one is "greater" than the other.
o Key Feature: Only equality comparisons (=, ≠) make sense.
• Ordinal Data: This data has categories with a meaningful order.
o Example: Fever levels (Low, Medium, High) are ordinal. You know High is a
higher level than Medium.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 19


Machine Learning BCS602

o Key Feature: Order matters, but the intervals between values aren't necessarily
equal. Transformations can be applied while preserving the order.

2. Numeric (or Quantitative) Data: This type of data represents quantities and can be
measured.

• Interval Data: This data has meaningful intervals between values, but the zero point is
arbitrary.
o Example: Temperature in Celsius or Fahrenheit is interval data. The difference
between 20°C and 30°C is the same as the difference between 30°C and 40°C,
but 0°C1 doesn't mean "no temperature."
o Key Feature: Addition and subtraction are meaningful, but ratios are not (e.g.,
20°C is not "twice as hot" as 10°C).
• Ratio Data: This data has meaningful intervals and a true zero point.
o Example: Weight, height, or income are ratio data. 0 kg means "no weight," and
100 kg is twice as heavy as 50 kg.
o Key Feature: All arithmetic operations are meaningful, including ratios.

Another way of classifying the data is to classify it as:

• Discrete value data


• Continuous data

Discrete Data: This type of data is countable and consists of whole numbers. Think of it as
data you can list out (1, 2, 3...).

o Examples: Survey responses (if they are, say, multiple choice or counting
things), employee ID numbers.

Continuous Data: This type of data can take on any value within a range, including decimals
and fractions.

o Examples: Age (you can be 12.5 years old), height, weight.

Third way of classifying the data is based on the number of variables used in the dataset. Based
on that, the data can be classified as univariate data, bivariate data, and multivariate data. This
is shown in below figure.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 20


Machine Learning BCS602

Fig. 1.8: Types of data based on variables

Univariate Data: Deals with a single variable. looking at one characteristic at a time. For
example, Analyzing only the heights of students in a class.

Bivariate Data: Deals with two variables. Examining the relationship between two
characteristics. For example, Studying the relationship between height and weight of students.

Multivariate Data: Deals with three or more variables. Exploring the relationships among
multiple characteristics. For example, Analyzing the relationship between height, weight, and
age of students.

2.3 UNIVARIATE DATA ANALYSIS AND VISUALIZATION

Univariate analysis is the simplest way to analyse data since it only looks at one variable at a
time. It aims to describe the data and find patterns within that single variable, without exploring
relationships with other variables.

2.3.1 Data Visualization

Graph visualization is essential for understanding data and presenting it effectively, especially
to clients. Common univariate analysis graphs include bar charts, histograms, frequency
polygons, and pie charts. These graphs help in presenting, summarizing, describing, exploring,
and comparing data.

1. Bar chart

Bar charts display the frequency (or count) of different categories or values of a variable. They
are useful for showing how many times each value occurs.

The bar chart for students marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown
in below figure.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 21


Machine Learning BCS602

Fig. 1.9: Bar chart

2. Pie chart

Pie charts show the relative proportions of different categories or values that make up a whole.
Each slice of the pie represents a category, and the size of the slice is proportional to its
percentage frequency. The percentage frequency distribution of students marks {22, 22, 40, 40,
70, 70, 70, 85, 90, 90}

Fig. 2.1: Pie chart

It can be observed that the number of students with 22 marks are 2. The total number of students
are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in above figure.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 22


Machine Learning BCS602

3. Histogram

Histograms visually display the distribution of data by grouping data points into bins or ranges
(intervals) and showing how many data points fall into each bin. The vertical axis represents
the frequency (count) of data points within each bin.

The histogram for student marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75,
76-100 is given in below figure. One can visually inspect from below figure that the number
of students in the range 76-100 is 2.

Fig. 2.2: Sample Histogram of English Marks

4. Dot Plots

Dot plots display the frequency or value of data points for different categories or values. They
show how many times each value occurs in a dataset. The dot plot of English marks for five
students with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in below figure.

Fig. 2.3: Dot Plots

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 23


Machine Learning BCS602

2.3.2 Central Tendency

Central tendency is a crucial statistical measure that summarizes a dataset by identifying a


typical or central value. It simplifies data analysis and facilitates comparisons by representing
the overall distribution.

1. Mean:

▪ Arithmetic Mean
o Simple and Widely Used: It's the most common type of average and easy to
calculate.
o Sensitive to Outliers: Extreme values can significantly distort the mean, making
it less representative of the data's center in some cases.

▪ Weighted Mean
o Accounts for Importance: Allows you to assign different levels of importance
(weights) to data points, useful when some values are more influential than
others.
o More Complex Calculation: Requires determining and applying appropriate
weights, which can add complexity compared to the basic arithmetic mean.
▪ Geometric Mean
o Suitable for Growth Rates: Ideal for data that exhibits exponential growth or
multiplicative relationships, like compound interest or population growth.
o Less Affected by Outliers (than arithmetic mean): While not completely
immune, it's generally less sensitive to extreme values compared to the
arithmetic mean, especially when dealing with proportional changes.

In large cases, computing geometric mean is difficult. Hence, it is usually calculated as:

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 24


Machine Learning BCS602

2. Median: The median is the middle value in a dataset when the data is ordered. If there's an
odd number of values, it's the exact middle one. If even, it's the average of the two central
values. It splits the data into two halves, with half the values below and half above it.

3. Mode: The mode is the value that appears most frequently in a dataset. It's the data point
with the highest frequency. Datasets can have one mode (unimodal), two modes (bimodal), or
three modes (trimodal). It's primarily used for discrete data.

2.3.3 Dispersion

Range: The range is the simplest measure of dispersion, calculated as the difference between
the maximum and minimum values in a dataset.1 While easy to compute, it's highly sensitive
to outliers and doesn't reflect the distribution of the data between these extremes.

Standard Deviation: Standard deviation measures the average distance of data points from
the mean. It provides a more comprehensive picture of spread than the range, as it considers all
data points. A higher standard deviation indicates greater dispersion, while a lower one
suggests data points are clustered closer to the mean. It's a crucial measure in statistical
analysis.

Quartiles and Interquartile Range (IQR): Quartiles divide a dataset into four equal parts.
The interquartile range (IQR) is the difference between the third quartile (Q3, 75th percentile)
and the first quartile (Q1, 25th percentile). The IQR represents the spread of the middle 50%
of the data and is less sensitive to outliers than the range, making it a robust measure of spread.

Outliers: Outliers are data points that fall significantly far from the other values in a dataset.
A common rule defines outliers as those falling below Q1 - 1.5IQR or above Q3 + 1.5IQR.
Identifying outliers is important as they can skew statistical analyses and may warrant special
attention.

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 25


Machine Learning BCS602

Five point summary and Box Plots: A five-point summary consists of the minimum value,
first quartile (Q1), median (Q2), third quartile (Q3), and maximum value, arranged in that
order. Box plots, also known as box-and-whisker plots, visually represent this summary for
continuous variables, showing the data's distribution and spread, including the interquartile
range (IQR) within the box and potential outliers beyond the whiskers. The position of the
median line within the box indicates skewness.

2.3.4 Shape

Skewness and kurtosis, known as moments, describe the symmetry/asymmetry and peak
location of a dataset.

Skeweness

Skewness measures the asymmetry of a data distribution. A perfectly symmetrical distribution


has zero skewness, while a skewed distribution has a longer tail on one side. Positive skew
(right skew) occurs when the tail extends to the right, indicating more high values and a mean
greater than the median. Negative skew (left skew) occurs when the tail extends to the left,
indicating more low values and a mean less than the median. Skewness implies a higher chance
of outliers and can affect the performance of data mining algorithms. Pearson's coefficient and
a formula involving standardized values are used to quantify skewness.

Fig. 2.4: (a) Positive Skewed and (b) Negative Skewed Data

Kurtosis

Kurtosis measures the "peakedness" and tail heaviness of a data distribution relative to a normal
distribution. High kurtosis indicates a sharper peak and heavier tails (more outliers), while low

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 26


Machine Learning BCS602

kurtosis implies a flatter peak and lighter tails (fewer outliers). It essentially quantifies the
presence of extreme values.

Mean Absolute Deviation (MAD)

• Measures Average Absolute Difference: MAD calculates the average of the absolute
differences between each data point and the mean, providing a measure of spread.
• Robust to Outliers: Unlike standard deviation, MAD is less sensitive to extreme values
because it doesn't square the deviations.
• Used for Outlier Detection: MAD can be used to identify potential outliers by
comparing the deviation from the median to MAD.

Coefficient of variation (CV)

The coefficient of variation (CV) shows how spread out data is compared to its average,
especially useful for comparing datasets with different units.

2.3.5 Special Univariate Plots

Stem and leaf plots are a simple way to visualize data distribution by splitting each number
into a "stem" (usually the tens digit) and a "leaf" (usually the units digit). Ideally, data should
follow a bell-shaped curve (normal distribution) for many statistical tests to be valid. Q-Q plots
are used to assess normality by comparing data quartiles against a theoretical normal
distribution.

The steam and leaf plot for the English subject marks, say, {45, 60, 60, 80, 85} is given in
below figure.

Fig. 2.10: Stem and Leaf Plot for English Marks

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 27


Machine Learning BCS602

Fig. 2.11: Normal Q-Q Plot

NOOR SUMAIYA, ASST.PROF, DEPT OF CSE, TOCE Page 28

You might also like