DATA SCIENCE
UNIT-1
INTRODUCTION TO DATA SCIENCE
1. Types of Data: structured and unstructured data,
2. Data Science Road Map: Frame the Problem; Understand the Data,
3. Data Wrangling, Exploratory Analysis, Extract Features, Model and Deploy Code.
4. Graphical Summaries of Data: Pie Chart, Bar Graph, Pareto Chart, Histogram.
5. Measures of central tendency of Quantitative Data: Mean, Median, Mode.
6. Measures of Variability of Quantitative Data: Range, Standard Deviation and Variance.
7. Probability: Introduction to Probability, Conditional Probability
Data science is the art and science of extracting meaningful insights from data using a mix of mathematics, statistics,
programming, and domain expertise. It’s like detective work—uncovering patterns, trends, and predictions that help
businesses and researchers make smarter decisions.
Key Components of Data Science
1. Data Collection & Cleaning: Gathering raw data from multiple sources and preprocessing it to remove
inconsistencies.
2. Exploratory Data Analysis (EDA): Understanding data patterns using visualization tools like Matplotlib and
Seaborn.
3. Machine Learning (ML) & Artificial Intelligence (AI): Creating predictive models using algorithms like decision
trees, neural networks, and clustering.
4. Big Data Technologies: Handling massive datasets using Apache Spark, Hadoop, and cloud computing.
5. Statistical Analysis: Applying statistical methods like hypothesis testing, regression, and probability theory.
6. Data Engineering: Designing pipelines to ensure data is stored, processed, and accessed efficiently.
7. Data Visualization: Representing data using dashboards and tools like Tableau, Power BI, and Python libraries.
Applications of Data Science
Healthcare: Predicting diseases, optimizing treatments, and analyzing patient data.
Finance: Fraud detection, risk assessment, and algorithmic trading.
E-commerce: Personalized recommendations, demand forecasting, and customer segmentation.
Social Media: Sentiment analysis, trend prediction, and content optimization.
Autonomous Systems: Self-driving cars, robotics, and AI-powered assistants.
Data Science Skills-
All these data science actions are performed by Data Scientists. Let’s see essential skills required for data scientists
Programming Languages: Python, R, SQL.
Mathematics: Linear Algebra, Statistics, Probability.
Machine Learning: Supervised and unsupervised learning, deep learning basics.
Data Manipulation: Pandas, NumPy, data wrangling techniques.
Data Visualization: Matplotlib, Seaborn, Tableau, Power BI.
Big Data Tools: Hadoop, Spark, Hive.
Databases: SQL, NoSQL, data querying and management.
DATA SCIENCE
Cloud Computing: AWS, Azure, Google Cloud.
Version Control: Git, GitHub, GitLab.
Domain Knowledge: Industry-specific expertise for problem-solving
Data Science Tools and Library:
There are various tools required to analyze data, build models, and derive insights. Here are some of the most important
tools in data science:
Jupyter Notebook: Interactive environment for coding and documentation.
Google Colab: Cloud-based Jupyter Notebook for collaborative coding.
TensorFlow: Deep learning framework for building neural networks.
PyTorch: Popular library for machine learning and deep learning.
Scikit-learn: Tools for predictive data analysis and machine learning.
Docker: Containerization for reproducible environments.
Kubernetes: Managing and scaling containerized applications.
Apache Kafka: Real-time data streaming and processing.
Tableau: A powerful tool for creating interactive and shareable data visualizations.
Power BI: A business intelligence tool for visualizing data and generating insights.
Keras: A user-friendly library for designing and training deep learning models.
Data Science Lifecycle- Certainly! The Data Science Life Cycle is a structured approach that data scientists follow to
analyze and interpret data effectively. Here's a detailed breakdown of each stage:
DATA SCIENCE
1. Business Understanding
Objective: Clearly define the project goals.
Key Activities:
Engage stakeholders to gather requirements.
Identify the problem to solve or the opportunity to explore.
Outline success criteria and performance metrics.
2. Data Mining
Objective: Gather and understand available data.
Key Activities:
Identify data sources relevant to the business problem.
Collect both structured and unstructured data.
Assess data quality, volume, and relevance.
3. Data Cleaning
Objective: Prepare data for analysis.
Key Activities:
Identify and fix inconsistencies, errors, and missing values.
Normalize data formats and eliminate duplicates.
Ensure the dataset is suitable for analysis.
4. Data Exploration
Objective: Gain initial insights from the data.
Key Activities:
Use statistical techniques to summarize data.
Visualize data distributions and relationships.
DATA SCIENCE
Formulate initial hypotheses about data patterns and trends.
5. Feature Engineering
Objective: Create meaningful features to improve model performance.
Key Activities:
Select relevant features based on domain knowledge.
Create new features through transformation or aggregation.
Ensure features are relevant to the business problem.
6. Predictive Modeling
Objective: Build models to make predictions or classifications.
Key Activities:
Choose appropriate algorithms based on the problem.
Train models using prepared data.
Validate models to ensure accuracy and reliability.
7. Data Visualization
Objective: Communicate results effectively.
Key Activities
Use visualization tools to present data insights.
Create dashboards to plot key metrics.
Ensure visualizations are understandable to stakeholders.
Types of Data in Data Science:
Data- Data refers to raw facts, figures, or symbols that represent information, observations, or measurements. On its
own, data doesn't have meaning until it's processed or analyzed.
1.Structured data - Structured data is data that is organized in a predefined format—usually in rows and columns—
making it easy to enter, store, query, and analyze.
Key Characteristics of Structured Data:
DATA SCIENCE
Highly organized and formatted.
Stored in relational databases (like SQL).
Uses schemas (defined rules or tables).
Easy to search with tools like SQL queries.
Ex-
ID Name Age Email
1 Alice 25 alice@example.com
2 Bob 30 bob@example.com
Common Sources of Structured Data:
Customer databases
Spreadsheets (like Excel)
Sensor data (with defined formats)
Transaction records
Advantages:
Easy to store, search, and analyze
Compatible with traditional data tools
2. Unstructured Data - Unstructured data is information that doesn't follow a specific format or structure, making it
harder to organize, search, and analyze using traditional tools like relational databases.
🔍 Key Characteristics:
No predefined model or schema
Often text-heavy or media-rich
More difficult to process automatically
Requires advanced tools (like AI, machine learning, or natural language processing) for analysis
📂 Examples of Unstructured Data:
Emails
Social media posts
Images and videos
Audio recordings
PDF documents
Chat messages
Website content
🔧 Tools for Handling Unstructured Data:
Text analysis tools (e.g., NLP)
AI/ML algorithms
DATA SCIENCE
NoSQL databases (e.g., MongoDB)
Big Data platforms (e.g., Hadoop, Spark)
3. Quantitative Data (Numerical Data)
Quantitative data refers to measurable numerical values. It’s further divided into:
Continuous Data: Can take any value within a range.
o Example: Temperature (22.5°C, 37.8°C), height (5.8 feet, 6.1 feet).
Discrete Data: Can only take whole numbers.
o Example: Number of students in a class (30, 45, 50), number of cars in a parking lot.
Count Data: Represents the number of occurrences.
Example: Website visits per day (1,000, 1,500, 2,000).
4. Categorical Data (Qualitative Data)
Categorical data is non-numerical and represents characteristics.
Binary Data: Two possible values.
o Example: Gender (Male/Female), Pass/Fail.
Nominal Data: Categories without a defined order.
o Example: Types of pets (dog, cat, rabbit), car brands (Toyota, Ford, BMW).
Ordinal Data: Categories with a ranking.
Example: Customer satisfaction (poor, average, good, excellent), education level (high school, bachelor's,
master's).
5. Other Data Types
Cross-Sectional Data: Collected at a single point in time.
o Example: A company’s revenue report for 2024.
Time Series Data: Collected over time.
o Example: Daily stock prices, monthly rainfall data.
Longitudinal Data: Collected repeatedly over long periods.
o Example: A medical study tracking patients for 10 years.
Balanced Data: Each category is equally represented.
o Example: A survey where 50% of responses are from men and 50% from women.
Imbalanced Data: Unequal representation in categories.
Example: Fraud detection dataset where 99% are legitimate transactions and 1% are fraud.
6. Online vs. Offline Data
Offline Data (Batch Data): Processed in chunks.
o Example: Monthly electricity consumption reports.
Live Streaming (Online Data): Processed in real-time.
Example: Live stock market price updates.
7. Big Data vs. Non-Big Data
DATA SCIENCE
Big Data: Massive datasets that require specialized tools.
o Example: Social media activity logs, weather forecasting data from satellites.
Non-Big Data: Smaller, simpler datasets.
Example: A small retail store’s daily sales record.
8. Semi-Structured Data
A mix between structured and unstructured data.
HTML Files: Example: Web pages containing embedded data.
XML Files: Example: Data exchange formats used in banking.
JSON Files: Example: API responses in mobile apps.
Data Wrangling: Data wrangling, also known as data munging, is the process of transforming and mapping raw data into
a more understandable format. It is essential for preparing data for analysis, ensuring that it is clean, organized, and
usable.
Data wrangling techniques:
Data structuring: Organize data into a consistent format
Data integration: Combine data from different sources into a single dataset
Data enrichment: Create new variables or features to highlight other attributes of the data
Data wrangling tools:
Python libraries: Such as Pandas and OpenRefine, these libraries can help clean, organize, and analyze data
Google DataPrep: An intelligent data service that can clean and prepare structured and unstructured data
OpenRefine: An open-source tool that can clean and transform messy data.
Exploratory Data Analysis (EDA):
EDA is a critical approach in data analysis that helps in understanding the underlying structure, trends, and
relationships in the data. It involves visualizing and summarizing data to uncover insights and patterns.
Examine the data to identify patterns, trends, outliers, and relationships between variables
using visualization techniques like histograms, scatter plots, and box plots.
DATA SCIENCE
Key Objectives of EDA:
Data Understanding: Gain a clear insight into the nature of the data, including its types, distribution, and
potential anomalies.
Hypothesis Generation: Formulate hypotheses based on patterns observed in the data.
Identify Relationships: Discover relationships among different variables in the dataset.
Data Cleaning: Identify missing values, outliers, and erroneous data points that may need to be addressed
before further analysis.
Tools Used for EDA:
Programming Languages: Python (with libraries like Pandas, Matplotlib, Seaborn) and R (with packages like
ggplot2).
Software: Tableau, Microsoft Excel, RapidMiner, and other data visualization tools.
Feature engineering (extract features): Feature engineering is the process of transforming raw data into meaningful
features that enhance the performance of machine learning models. It's a crucial step in data science because well-
crafted features help algorithms make better predictions.
Why is Feature Engineering Important?
Raw data often contains information that isn't immediately useful for modeling. Feature engineering extracts the most
relevant information, improving accuracy and efficiency.
Model (Model Training) : Model training in data science refers to the process of teaching a machine learning algorithm
to recognize patterns and make predictions by learning from data. This step is essential for building predictive models
that can solve real-world problems
Deploy code :
Package the trained model and associated code into a deployable format (e.g., web API, standalone
application).
Integrate the model into a production environment where new data can be fed in to generate predictions.
Graphical Summary of Data: A graphical summary of data in data science refers to visual representations of data that
help in understanding patterns, relationships, and distributions. Instead of dealing with raw numbers, graphs and
charts make insights clearer and more intuitive.
Data Representations
Data representation involves the presentation of information in a meaningful and understandable manner. In
statistics, this is crucial for analyzing and interpreting data effectively. Common methods of data representation
include:
Graphical Representation of Data is where numbers and facts become lively pictures and colorful diagrams.
Instead of staring at boring lists of numbers, we use fun charts, cool graphs, and interesting visuals to understand
information better.
The branch is widely spread and has many real-life applications such as Business Analytics, demography, Astro
statistics, and so on.
Graphics Representation is a way of representing any data in picturized form. It helps a reader to understand the
large set of data very easily as it gives us various data patterns in a visualized form.
There are two ways of representing data,
Tables
Pictorial Representation through graphs.
DATA SCIENCE
Types of Graphical Representations:
Line Graphs: A Line Graph is used to show how the value of a particular variable changes with time. We plot this graph
by connecting the points at different values of the variable. It can be useful for analyzing the trends in the data and
predicting further trends.
Bar Graphs: A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn
with equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are
represented by the height of the bars.
Histograms: This is similar to bar graphs, but it is based on frequency of numerical values rather than their actual
values. The data is organized into intervals and the bars represent the frequency of the values in that range. That is, it
counts how many values of the data lie in a particular range.
DATA SCIENCE
Line Plot: To create a line plot, you typically follow these steps. As you can see in below example line plot shows the
scores of the students in a class and this count of circles above each score are the students numbers who obtained
this marks.
Stem and Leaf Plot: A stem and leaf plot is a graphical representation used to organize and display quantitative data in
a semi-tabular form. It helps in visualizing the distribution of the data set and retains the original data values, making
it easy to identify the shape, central tendency, and variability of the data.
A stem and leaf plot splits each data point into a "stem" and a "leaf." The "stem" represents the leading digits, while
the "leaf" represents the trailing digit. This separation makes it easy to organize data and see patterns.
DATA SCIENCE
Box and Whisker Plot: A Box and Whisker Plot, also called a box plot, is a graphical representation of numerical data
that shows the distribution and variability. It's great for detecting outliers, identifying median values, and
understanding data spread.
Pie Chart: A pie chart is a circular statistical graph that visually represents data as proportional slices of a whole. Each
slice corresponds to a category, and the size of the slice is determined by its percentage of the total
Pareto Chart : A Pareto chart is a bar graph or the combination of bar and line graphs. The purpose of using this chart
is to represent a set of data in a bar graph chart. The individual values are represented by the length of the bars and
the line shows the combined total. The values are expressed from the longest bar to the shortest bar in the graph.
These charts are also created using the excel sheets. Basically these graphs give statistical information on a bulk of
information for each category. Hence, it is a part of Mathematical statistics.
Advantages and Disadvantages of Using Graphical System
Advantages
DATA SCIENCE
It gives us a summary of the data which is easier to look at and analyze.
It saves time.
We can compare and study more than one variable at a time.
Disadvantages
It usually takes only one aspect of the data and ignores the other. For example, A bar graph does not
represent the mean, median, and other statistics of the data.
Interpretation of graphs can vary based on individual perspectives, leading to subjective conclusions.
Poorly constructed or misleading visuals can distort data interpretation and lead to incorrect conclusions.
Measures of central tendency of Quantitative Data:
Measures of central tendency are statistical metrics that summarize a dataset by identifying the center point or typical
value within that data. For quantitative data, there are three primary measures of central tendency: the mean,
median, and mode. Here’s a detailed explanation of each.
Mean-
The mean is the average of a set of numbers. To find the mean, add up all the numbers in a dataset and then divide by
the total number of values.
Mean = Sum of all values / Total number of values
Where,
xˉxˉ Is the mean,
∑xi is the sum of all terms in the data set,
N is the total number of terms.
Mean for Ungrouped Data: Mean for ungrouped data, also known as the arithmetic mean (x ˉ), is calculated by
summing up all individual values in the dataset and dividing the sum by the total number of values. This provides a
single representative value that reflects the central tendency of the data. It’s commonly used to understand the
average value or typical value of a given set of observations.Formula for same is
DATA SCIENCE
Geometric Mean
The geometric mean is a type of average that is especially useful for sets of numbers whose values are meant
to be multiplied together or are exponential in nature. It is calculated by multiplying all the values together
and then taking the n-th root (where n is the number of values).
Harmonic Mean
The harmonic mean is another type of average that is particularly useful for rates. It is calculated as the
reciprocal of the average of the reciprocals of a set of numbers.
2. Median
The median is the middle value in a dataset when it's arranged in ascending or descending order. If there's an even
number of values, the median is the average of the two middle numbers.
.Characteristics:
Not affected by extreme values (more robust against outliers).
Can be used with ordinal data, as well as with interval and ratio data.
DATA SCIENCE
Median of Ungrouped Data: To calculate the Median, the observations must be arranged in ascending or descending
order. If the total number of observations is N then there are two cases
Case 1: N is Odd
Median = Value of observation at [(n + 1) ÷ 2]th Position
Case 2: N is Even
Median = Arithmetic mean of Values of observations at (n ÷ 2)th and [(n ÷ 2) + 1]th Position
Example 1: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20, 32 then the Median is given by
Arranging data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32, 36, 38
N = 10 which is even then
Median = Arithmetic mean of values at (10 ÷ 2)th and [(10 ÷ 2) + 1]th position
⇒ Median = (Value at 5th position + Value at 6th position) ÷ 2
⇒ Median = (26 + 28) ÷ 2
⇒ Median = 27
Example 2: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20 then the Median is given by
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 36, 38
N = 9 which is odd then
Median = Value at [(9 + 1) ÷ 2]th position
⇒ Median = Value at 5th position
⇒ Median = 26
Median of Grouped Data
Median of Grouped Data is given as follows:
Median=l+((N/2-Cf)/2)×h
where,
l is Lower limit of median class
n is Total number of observations
cf is Cumulative frequency of the preceding class
DATA SCIENCE
f is Frequency of each class
3. Mode
Definition: the mode is the value that appears most frequently in a dataset. A dataset may have one mode
(unimodal), more than one mode (multimodal), or no mode at all.
Calculation:
Identify the number that appears most often.
A dataset can have one mode (unimodal), multiple modes (bimodal or multimodal), or no mode at all.
Characteristics:
Useful for categorical data as well as quantitative data.
Indicates the most common value in a dataset..
Mode of Ungrouped Data:
Mode of Ungrouped Data can be simply calculated by observing the observation with the
highest frequency. Let’s see an example of the calculation of the mode of ungrouped data.
Measures of Variability of Quantitative Data
DATA SCIENCE
Understanding the variability of quantitative data is essential for data analysis in data science.
Variability measures help assess how spread out or dispersed the data points are within a dataset.
Here’s a summary of the three key measures: Range, Standard Deviation, and Variance.
1. Range
Definition: The range is the difference between the maximum and minimum values in a
dataset.
Formula: Range=Maximum Value−Minimum Value
Usefulness:
It provides a simple measure of variability.
However, it is sensitive to outliers, which can skew the result.
Standard Deviation
Standard deviation is the square root of the variance. It provides a more interpretable measure of
how spread out the values are in comparison to the mean.
Where,
xi represents each term in the data set
σ2 is the variance,
√σ2 is the standard deviation.
xˉxˉ Is the mean
Variance
Variance measures how spread out the values in a dataset are. It's calculated by finding the average of
the squared differences between each value and the mean.
DATA SCIENCE
Where,
σ2 is the variance
∑(xi− xˉxˉ2 is2 the sum of squared differences between each term and the mean
N is the total number of terms.
Probability-
Probability is a measure of the likelihood or chance of an event occurring. It is expressed as a number
between 0 and 1, where 0 indicates an impossible event, and 1 signifies a sure event. The probability of
an event is calculated by dividing the number of favorable outcomes by the total number of possible
outcomes. In simple terms, it quantifies the likelihood of an outcome in a given set of circumstances,
providing a basis for making informed predictions and decisions in various fields, including mathematics,
statistics, and everyday life.
Statistics-
Statistics is the branch of mathematics that involves the collection, analysis, interpretation, presentation,
and organization of data. It provides methods for making inferences about populations based on
samples. In a broader sense, statistics helps to quantify uncertainty and variation in data, enabling
researchers, analysts, and decision-makers to draw meaningful conclusions and make informed decisions.
It encompasses various techniques, including descriptive statistics to summarize data and inferential
statistics to make predictions or test hypotheses about larger populations.
Terms Related to Probability and Statistics-
Random Experiment: An experiment is a set of steps that gives clear results. A random
experiment is one where the exact outcome cannot be predicted.
Outcome: Outcome means any possible result in a group of results, called a sample space, noted
as S. For example, when you flip a fair coin, the sample space is {heads, tails}.
Sample Space: Sample space is the collection of all possible outcomes in an experiment. Like in a
coin flip, the sample space is {heads, tails}.
Event: An event is any subset of the sample space. If an event A occurs, it means one of the
outcomes belongs to A. For instance, if event A is rolling an even number on a fair six-sided die,
getting 2, 4, or 6 means event A occurred. If you get 1, 3, or 5, event A did not happen.
Trial: A trial is each time you experiment, like flipping a coin. In the coin-flipping experiment,
each flip of the coin is a trial.
Mean: The mean of a random variable is the average of all possible values it can take, weighted
by their probabilities.
Expected Value: The expected value is the mean of a random variable. For instance, if we roll a
six-sided die, the expected value is the average of all possible outcomes, which is 3.5.
Probability Formulas
Probability is the likelihood of an event occurring and is calculated using the following formula:
P(A) = Number of Favourable Outcomes / Total Number of Possible Outcomes
Where:
P(A) is the probability of event A.
Number of Favorable Outcomes is the count of outcomes where event A occurs.
Total Number of Possible Outcomes is the count of all possible outcomes.
DATA SCIENCE
Addition Rule Formula-
The addition rule of probability is used to find the probability that at least one of two events occurs.
If events A and B are mutually exclusive (they cannot happen at the same at same time), then the
P(A or B) = P(A ∪ B) = P(A) + P(B) - P(A ∩ B) ( If A and B are not mutually exclusive events)
probability of either event A or event B occurring is:
P(A or B) = P(A ∪ B) = P(A) + P(B), ( If A and B are mutually exclusive events)
where P(A ∩ B) is the probability of A and B occurring.
Multiplication Rule Formula
The multiplication rule of probability is used to find the probability of two events occurring together.
If events A and B are independent(they do not affect each other), then:
P(A ∩ B)=P(A)×P(B)
P(A ∩ B)=P(A)×P(B∣A)
If events A and B are dependent( the occurrence of A affects the occurrence of B), then:
Here, P(B∣A) is the likelihood of event B happening when event A has already occurred.
Bayes' Rule
Bayes' Rule is a formula used to update probabilities based on new evidence. It calculates the probability
of an event A happening given the occurrence of another event B. The formula is as follows:
P(A∣B)=P(B∣A)×P(A)P(B)P(A∣B)=P(B)P(B∣A)×P(A)
Here:
P(A∣B) is the probability of event A occurring given that event B has occurred.
P(B∣A) is the probability of event B occurring given that event A has occurred.
P(A) and P(B) are the probabilities of events A and B occurring, respectively.
Conditional Probability
Conditional probability defines the probability of an event occurring based on a given condition
or prior knowledge of another event. Conditional probability is the likelihood of an event
occurring, given that another event has already occurred. In probability this is denoted as A
given B, expressed as P(A | B), indicating the probability of event A when the event B has already
occurred.
DATA SCIENCE
Where,
P (A ∩ B) represents the probability of both events A and B occurring simultaneously.
P(A) represents the probability of event A occurring.
Steps to Find Probability of One Event Given Another Has Already Occurred
To calculate the conditional probability, we can use the following step-by-step method:
Step 1: Identify the Events. Let's call them Event A and Event B.
Step 2: Determine the Probability of Event A i.e., P(A)
Step 3: Determine the Probability of Event B i.e., P(B)
Step 4: Determine the Probability of Event A and B i.e., P(A ∩ B).
Step 5: Apply the Conditional Probability Formula and calculate the required probability.