Introduction to Data Science
(23CSH-283)
ALL UNITS - NOTES & QUESTIONS
Compiled by : Subhayu
Contents :
(Click on the Unit below, to skip to that particular unit)
Unit 1……………………………………………………………………………………………………………………………………..
Unit 2…………………………………………………………………………………………………………………………………….
Unit 3……………………………………………………………………………………………………………………………………..
MST 1 and 2 solutions………………………………………………………………………………………………………
Sample Questions…………………………………………………..…………………………………………………………..
UNIT-1 : Data Science - An Overview
Contact Hours: 10
Chapter 1 : Introduction
Definition and Description
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines elements of mathematics, statistics, computer
science, and domain expertise to analyze and solve real-world problems.
● Key Components of Data Science:
○ Data Collection: Gathering data from various sources (databases,
web scraping, sensors, etc.).
○ Data Processing: Cleaning, transforming, and organizing data to
make it usable.
○ Data Analysis: Using statistical methods and algorithms to
understand data patterns.
○ Modeling: Building predictive or descriptive models using Machine
Learning (ML) and Artificial Intelligence (AI).
○ Visualization: Presenting data insights in a human-readable format
using charts, graphs, and dashboards.
Important Terminologies in Data Science
1. Data: Information in raw form that can be structured (tables, databases)
or unstructured (text, images, videos).
○ Big Data: Extremely large datasets that traditional data-processing
tools cannot handle efficiently.
○ Metadata: Data about data, providing information like structure,
format, and origin.
2. Data Science Pipeline:
○ Data Acquisition: Gathering raw data.
○ Data Cleaning: Removing inaccuracies or inconsistencies.
○ Exploratory Data Analysis (EDA): Gaining preliminary insights into
the dataset.
○ Model Building: Using statistical or machine learning models.
○ Model Evaluation: Testing the accuracy and performance of models.
○ Deployment: Applying the model to real-world data.
3. Machine Learning (ML): A subset of AI focused on building models that
enable computers to learn from and make decisions based on data.
4. Artificial Intelligence (AI): Broader than ML, it involves machines mimicking
human intelligence to perform tasks.
5. Feature Engineering: The process of selecting, transforming, or creating
variables (features) to improve model performance.
6. Overfitting and Underfitting:
○ Overfitting: Model performs well on training data but poorly on new
data.
○ Underfitting: Model is too simple and performs poorly on both
training and test data.
Overview of Data Science Techniques
1. Data Wrangling: The process of cleaning and structuring raw data into a
desired format.
2. Exploratory Data Analysis (EDA):
○ Understanding data characteristics using visualization (e.g.,
histograms, scatter plots).
○ Summarizing data using descriptive statistics like mean, median,
mode, and variance.
3. Statistical Modeling:
○ Regression Analysis: Understanding the relationship between
variables.
○ Hypothesis Testing: Checking if assumptions about data are valid.
4. Machine Learning:
○ Supervised Learning: Predicting outcomes using labeled data (e.g.,
Linear Regression, Decision Trees).
○ Unsupervised Learning: Finding patterns in unlabeled data (e.g.,
Clustering, PCA).
○ Reinforcement Learning: Learning through trial and error to
maximize rewards.
5. Data Visualization: Creating visual representations of data using tools like
Matplotlib, Seaborn, Tableau, and Power BI.
6. Big Data Analytics: Using frameworks like Hadoop, Spark to process and
analyze massive datasets.
7. Natural Language Processing (NLP): Techniques for analyzing and
processing text data (e.g., sentiment analysis, text summarization).
Challenges in Data Science
1. Data Quality Issues:
○ Missing, inconsistent, or inaccurate data can hinder analysis.
2. High Dimensionality:
○ Handling datasets with many features or variables is computationally
challenging.
3. Data Privacy and Security:
○ Ensuring the ethical and secure use of sensitive data.
4. Scalability:
○ Managing and processing massive datasets efficiently.
5. Model Interpretability:
○ Making complex machine learning models understandable to
stakeholders.
6. Domain Knowledge:
○ Lack of subject expertise can lead to incorrect assumptions or
interpretations.
7. Evolving Tools and Techniques:
○ Rapidly changing technologies make it challenging to keep up.
Opportunities in Data Science
1. Business Insights:
○ Providing actionable insights to improve business strategies.
2. Personalization:
○ Enhancing customer experiences through recommendation systems
(e.g., Netflix, Amazon).
3. Healthcare:
○ Using predictive analytics for early diagnosis and personalized
treatments.
4. Automation:
○ Automating repetitive tasks with AI-driven systems.
5. Fraud Detection:
○ Identifying anomalies in financial transactions to prevent fraud.
6. Environmental Monitoring:
○ Using data to track climate change, predict natural disasters, etc.
Chapter 2 : Data Science and Business Analytics
1. Difference between Data Science and Business
Analytics
2. Importance of Data Science
● Improved Decision-Making: Provides actionable insights for better business
strategies.
● Automation: Enables the development of AI systems to automate routine
tasks.
● Trend Identification: Helps detect patterns and trends for proactive
business strategies.
● Customer Understanding: Personalizes user experiences by analyzing
consumer behavior.
● Optimization: Optimizes processes, resources, and operations to increase
efficiency.
● Competitive Advantage: Offers deeper insights than traditional analytics,
helping businesses stay ahead.
3. Primary Components of Data Science
1. Data Collection: Gathering data from various sources such as databases,
APIs, sensors, or web scraping.
○ Tools: SQL, NoSQL, APIs.
2. Data Processing: Cleaning and transforming raw data into usable formats.
○ Techniques: Handling missing values, normalization, encoding
categorical data.
3. Data Analysis: Analyzing data to identify trends, correlations, and
patterns.
○ Methods: Statistical analysis, exploratory data analysis (EDA).
4. Data Visualization: Representing data visually for better understanding.
○ Tools: Tableau, Power BI, Matplotlib, Seaborn.
5. Modeling and Algorithms: Using machine learning or statistical models for
predictions and solutions.
○ Examples: Regression, Classification, Clustering.
6. Deployment and Communication: Deploying models in production and
communicating results to stakeholders.
○ Tools: Flask, Streamlit, Dash, Excel for reports.
4. Users of Data Science
● Business Analysts: Use insights for strategic planning and decision-making.
● Data Scientists: Build predictive models and develop machine learning
algorithms.
● Marketing Professionals: Analyze customer behavior and create targeted
campaigns.
● Healthcare Professionals: Predict diseases and improve patient care.
● Engineers: Use data science for predictive maintenance and system
optimization.
● Government: Leverage data for policy-making and citizen services.
5. Data Science Hierarchy
The Data Science hierarchy describes the step-by-step process involved in data
science workflows:
1. Data Collection
○ Collecting structured and unstructured data from multiple sources.
2. Data Cleaning and Preprocessing
○ Removing errors, handling missing values, and transforming data.
3. Data Exploration (EDA)
○ Understanding data distributions, patterns, and anomalies.
4. Feature Engineering
○ Creating new features and selecting the most relevant ones.
5. Model Building
○ Training predictive models using machine learning techniques.
6. Model Evaluation
○ Testing model performance using metrics like accuracy, precision,
recall, etc.
7. Deployment and Monitoring
○ Deploying the model in production and monitoring its performance
over time.
Chapter 3 : Linear Algebra in Data Science
Sample Questions :
UNIT-2: Mathematics &
Statistics in Data Science
Contact Hours: 10
Chapter 4 : Mathematics in Data
Science
1. Role of Mathematics in Data Science
Mathematics is the backbone of Data Science, enabling modeling, analysis, and
decision-making. The key mathematical areas used in Data Science include:
● Linear Algebra → Used for handling datasets (e.g., matrices in machine
learning).
● Probability & Statistics → Helps in predictions, measuring uncertainty, and
hypothesis testing.
● Calculus → Used in optimization (e.g., gradient descent for training ML
models).
● Discrete Mathematics → Important for algorithms and data structures in
Data Science.
🔹 Example:
● Predicting stock prices → Uses probability & statistics.
● Image recognition → Uses linear algebra for processing pixel data.
2. Importance of Probability & Statistics in Data
Science
Probability:
● Measures the likelihood of an event happening.
● Used in Bayesian inference, Markov Chains, and Machine Learning models.
● Helps in making predictions based on past data.
🔹 Example:
● Weather forecasting → Predicts rain probability based on past data.
Statistics:
● Helps in analyzing, summarizing, and visualizing data.
● Provides insights into data trends, variability, and patterns.
● Used in feature selection, anomaly detection, and model evaluation.
🔹 Example:
● Medical studies → Analyzing patient recovery data using statistical tests.
3. Important Types of Statistical Measures in Data
Science
(i) Descriptive Statistics
● Summarizes data to provide insights.
● Includes: Mean, Median, Mode, Standard Deviation, Variance.
● Used in: Exploratory Data Analysis (EDA).
🔹 Example:
● A company wants to analyze employee salaries. They compute average
salary, salary distribution, and standard deviation to understand
disparities.
(ii) Predictive Statistics
● Helps in making future predictions based on patterns in data.
● Includes: Regression, Time Series Analysis.
● Used in: Forecasting trends and future outcomes.
🔹 Example:
● Predicting house prices based on past sales data.
(iii) Prescriptive Statistics
● Provides actionable recommendations based on data analysis.
● Uses optimization techniques and decision models to suggest the best
actions.
🔹 Example:
● Amazon’s recommendation system suggests products based on user
preferences and past purchases.
4. Exploratory Data Analysis (EDA) and
Visualization Techniques
EDA is the process of analyzing datasets to summarize key characteristics using
visuals and statistics.
EDA Steps:
1. Understand the dataset (Columns, Data Types).
2. Check for missing values.
3. Detect outliers using boxplots.
4. Identify patterns & correlations using scatter plots, histograms, and
heatmaps.
5. Summarize statistics using mean, variance, standard deviation.
🔹 Common Visualization Techniques:
● Histograms → Show frequency distribution.
● Boxplots → Detect outliers.
● Scatter Plots → Identify relationships between two variables.
● Heatmaps → Show correlations between multiple variables.
5. Difference Between Exploratory and Descriptive
Statistics
Feature Exploratory Statistics (EDA) Descriptive Statistics
Purpose Finds patterns & relationships in Summarizes data in a
data meaningful way
Tools Graphs, visualizations, hypothesis Measures like mean, median,
Used testing standard deviation
Example Checking missing values, detecting Computing average sales revenue
trends in data of a company
🔹 Example:
● Descriptive Statistics: "The average height of students in a class is 5.6 ft."
● Exploratory Data Analysis: "Let's check if height and weight are
correlated using a scatter plot."
✅ Conclusion:
● Mathematics is essential for data-driven decision-making.
● Probability & Statistics help in understanding data, predicting trends, and
making decisions.
● EDA & Visualization are crucial for analyzing datasets and finding hidden
patterns.
Chapter 5 : Statistics in Data Science
1. Statistical Modeling in Data Science
Statistical modeling is the process of applying statistical techniques to understand
data patterns, relationships, and trends. It helps in making predictions, estimating
probabilities, and optimizing decision-making processes.
Types of Statistical Models
1. Descriptive Models: Summarize past data patterns (e.g., mean, variance,
histograms).
2. Predictive Models: Forecast future trends using past data (e.g., regression
models).
3. Prescriptive Models: Suggest actions based on predictive insights (e.g.,
decision trees).
Statistical models play a crucial role in machine learning, data analysis, and
hypothesis testing by allowing us to quantify relationships between variables.
2. Descriptive Statistics
Descriptive statistics help summarize and organize data for easy interpretation.
2.1 Measures of Central Tendency
These describe the "center" of a dataset:
● Mean (Average):
Mean=∑Xi / n
It is sensitive to outliers.
● Median: The middle value when data is sorted. It is robust to outliers.
● Mode: The most frequently occurring value in the dataset. Useful for
categorical data.
2.2 Measures of Dispersion (Spread of Data)
● Range: Difference between the maximum and minimum values.
2 2
● Variance: Measures the spread of data around the mean.σ =∑(𝑋𝑖 − 𝑥) /n
● Standard Deviation: Square root of variance, providing a more
interpretable measure of data spread. σ= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
● Interquartile Range (IQR): Difference between the 75th and 25th
percentiles, useful for detecting outliers.
2.3 Shape of Distribution
● Skewness:
○ Positive Skew: Tail on the right, data is concentrated on the left.
○ Negative Skew: Tail on the left, data is concentrated on the right.
● Kurtosis: Measures the "tailedness" of a distribution (high kurtosis = heavy
tails).
3. Notion of Probability
Probability quantifies the likelihood of an event occurring. It is essential in
statistics for modeling uncertainty.
3.1 Basic Probability Rules
● Probability of an event P(A): 0≤P(A)≤10
● Addition Rule: P(A∪B)=P(A)+P(B)−P(A∩B)
● Multiplication Rule (for independent events): P(A∩B)=P(A)×P(B)
3.2 Conditional Probability
P(A∣B)=P(A∩B)/P(B)
It describes the probability of event A occurring given that event B has already
occurred.
3.3 Bayes’ Theorem
P(A∣B)=P(B∣A) / P(B)
Used in classification algorithms like Naïve Bayes in machine learning.
4. Probability Distributions
A probability distribution defines how values in a dataset are spread out.
4.1 Discrete Probability Distributions
1. Bernoulli Distribution: Models a single binary outcome (success/failure).
2. Binomial Distribution: Number of successes in multiple trials.
3. Poisson Distribution: Probability of a fixed number of events occurring in a
given time frame.
4.2 Continuous Probability Distributions
1. Normal (Gaussian) Distribution: Bell-shaped curve used in statistics and
machine learning.
○ Properties: Symmetric, mean = median = mode, 68%-95%-99.7%
rule.
2. Exponential Distribution: Models waiting times between independent
events.
5. Mean, Variance, and Covariance
5.1 Mean (Expected Value)
The mean represents the average value of a dataset:
E[X]=∑XiP(Xi)
5.2 Variance
Variance measures the spread of data points from the mean:
2 2
Var(X)=E[𝑥 ]−(𝐸[𝑥])
Higher variance means greater spread.
5.3 Covariance
Covariance measures the relationship between two variables:
Cov(X,Y)=E[(X−μX)(Y−μY)]
● Positive Covariance: Variables increase together.
● Negative Covariance: One increases while the other decreases.
6. Covariance Matrix
A covariance matrix summarizes the relationships between multiple variables:
Used in Principal Component Analysis (PCA) for dimensionality reduction.
7. Understanding Univariate and Multivariate
Normal Distributions
7.1 Univariate Normal Distribution
A normal distribution with one variable, defined by:
7.2 Multivariate Normal Distribution
An extension of the normal distribution for multiple variables:
where:
● X is a vector of variables.
● μ is the mean vector.
● Σ is the covariance matrix.
Applications:
● Used in machine learning, PCA, and Gaussian Mixture Models (GMMs).
UNIT-3: Machine Learning
in Data Science
Contact Hours: 10
Chapter 6 : Machine Learning
Unit-3 (Machine Learning in Data Science)
Chapter 6: Machine Learning
What is Machine Learning?
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables
systems to automatically learn and improve from experience without being
explicitly programmed.
The primary goal is to develop algorithms that can identify patterns in data and
make decisions or predictions based on it.
Role of Machine Learning in Data Science
Machine Learning plays a crucial role in extracting patterns, making predictions,
and automating decisions within the broader field of Data Science.
● Prediction: Predicting customer behavior, product sales, stock prices.
● Classification: Email spam detection, sentiment analysis.
● Clustering: Customer segmentation, anomaly detection.
● Recommendation: Netflix/movie recommendations, product suggestions.
● Automation: Fraud detection, self-driving cars, chatbots.
Types of Machine Learning Techniques
Machine Learning is broadly divided into four types:
1. Supervised Learning
● Definition: The algorithm is trained on labeled data, i.e., input-output pairs
are known.
● Objective: Learn a mapping from inputs to outputs to predict unseen data.
● Examples:
○ Email spam detection (Spam / Not Spam)
○ Loan approval prediction
○ Handwriting recognition
Types of Supervised Learning:
● Classification: Output is a category (e.g., 'yes' or 'no').
● Regression: Output is a continuous value (e.g., price of a house).
2. Unsupervised Learning
● Definition: The algorithm is trained on unlabeled data, and it tries to
discover hidden patterns or structures.
● Objective: Find groupings or relationships in the data.
● Examples:
○ Customer segmentation
○ Market basket analysis
○ Anomaly detection (fraud)
Types of Unsupervised Learning:
● Clustering: Grouping similar data points (e.g., K-means)
● Dimensionality Reduction: Reducing data features (e.g., PCA)
3. Reinforcement Learning
● Definition: The algorithm learns by interacting with an environment. It
receives rewards or penalties based on its actions and learns to maximize
cumulative reward.
● Objective: Learn a policy to take optimal actions.
● Examples:
○ Game playing (e.g., AlphaGo)
○ Robotics
○ Self-driving cars
Key components:
● Agent: Learner or decision-maker
● Environment: Where the agent operates
● Action: What the agent can do
● Reward: Feedback from the environment
4. Deep Learning
● Definition: A subset of machine learning that uses neural networks with
multiple layers to model complex patterns.
● Objective: Automatically extract features and solve problems where
traditional algorithms fail.
● Examples:
○ Image recognition
○ Speech recognition
○ Language translation
○ ChatGPT
Structure:
● Built with Artificial Neural Networks (ANNs).
● Works best with large amounts of data and computational power.
Comparison of Machine Learning Techniques
Aspect Supervised Unsupervised Reinforcement Deep Learning
Learning Learning Learning
Data Labeled Unlabeled Interactive Large
Requirement feedback labeled/unlabeled
Goal Predict Find structure/ Learn through High-level
output patterns trial/error feature learning
Example Linear K-means, PCA Q-learning, CNN, RNN,
Algorithms regression, SARSA LSTM
SVM
Application Fraud Customer Game playing, Vision, NLP,
Areas detection, grouping robotics audio
diagnosis
Feedback Direct None Reward-based Through error
Mechanism (known backpropagation
outputs)
Scope of Machine Learning
Machine Learning has vast applications across almost every industry:
● Healthcare: Disease prediction, drug discovery
● Finance: Stock price prediction, credit scoring
● Retail: Customer behavior prediction, demand forecasting
● Manufacturing: Predictive maintenance, quality control
● Agriculture: Crop yield prediction, soil health monitoring
● Entertainment: Content recommendation
● Transportation: Route optimization, autonomous vehicles
Conclusion
Machine Learning is the engine driving intelligent systems today. Understanding its
various types and their applications helps in building efficient solutions tailored to
different kinds of data and problems. As data grows, so does the scope and power
of ML.
Chapter 7 : Classification & Prediction
1. Introduction to Machine Learning Algorithms
Machine Learning algorithms are the core tools used to analyze data, learn from
patterns, and make decisions or predictions without being explicitly programmed.
Common Machine Learning Algorithms Used in Classification and Prediction
Algorithm Type Description
Linear Regression Prediction Models linear relationships between inputs
and continuous outputs.
Logistic Regression Classificati Estimates the probability that a data point
on belongs to a certain class.
Decision Trees Both Splits the dataset based on features to predict
class or value.
K-Nearest Classificati Classifies based on the most common class
Neighbors (KNN) on among neighbors.
Support Vector Classificati Finds a hyperplane that best separates
Machine on classes.
Naive Bayes Classificati Uses probability and Bayes’ Theorem for text
on classification and spam filtering.
Random Forest Both An ensemble of decision trees to increase
accuracy and prevent overfitting.
Neural Networks Both Mimics the human brain to learn from large
datasets.
2. Importance of Machine Learning in Today’s Business
Machine Learning provides data-driven decision-making capabilities that enhance
efficiency, customer experience, and profits.
Key Business Benefits:
1. Customer Insights:
○ Predict purchasing behavior.
○ Personalize marketing campaigns.
2. Fraud Detection:
○ Identify unusual patterns in transactions.
3. Product Recommendations:
○ Netflix and Amazon recommend content/products using classification
algorithms.
4. Forecasting and Demand Prediction:
○ Predict product demand, inventory requirements.
5. Process Automation:
○ Chatbots, automated support, document classification.
6. Healthcare:
○ Disease prediction and diagnosis models.
7. Finance:
○ Credit scoring, stock trend prediction.
3. Classification vs Prediction
Both classification and prediction are part of Supervised Learning, but they serve
different purposes.
Feature Classification Prediction (Regression)
Output Categorical (discrete labels) Numerical (continuous values)
Type
Goal Assign input to a category Estimate a numeric value
Examples Spam or Not Spam, Yes or No, House price, temperature, sales
Class A/B/C forecast
Algorithms Decision Trees, SVM, Naive Bayes, Linear Regression, Decision
KNN Trees, ANN
Example of Classification:
Given customer data, predict whether they will buy a product (Yes/No).
Example of Prediction:
Given features like square footage, number of bedrooms, and location, predict the
price of a house.
4. Use Cases to Understand the Difference
Scenario Task Type
Predict if a loan applicant will default Classification
Forecast next quarter’s revenue Prediction
Detect fraudulent transaction Classification
Estimate fuel consumption Prediction
Classify handwritten digits Classification
5. Conclusion
● Classification is about identifying what category a data point belongs to.
● Prediction is about estimating a value based on input data.
● Both techniques are critical in solving real-world business problems and
contribute significantly to data-driven strategies across industries.
Solutions of Mid Semester Tests
Mid Semester Test 1
Section A (2x5= 10)
1. Identify the advantages of Data Science.
2. Describe the characteristics of over determined equation systems.
3. Explain the difference between Data Science and Business Analytics.
4. Differentiate between structured and unstructured data.
5. Describe a scenario where the responsibilities of a data scientist directly
impact decision-making.
Section B (5x2= 10)
6. Explain three types of data with example. Differentiate among them.
7. Demonstrate how does the pseudo-inverse help in solving an over
determined system of linear equations, and why is it important.
Solutions/Answers :
1. Identify the advantages of Data Science.
Answer:
● Better Decision Making: Enables data-driven decisions by uncovering
hidden patterns and trends.
● Business Intelligence: Helps businesses improve strategies, marketing,
operations, and financial planning.
● Automation: Machine learning and AI reduce human effort by automating
tasks.
● Customer Insights: Improves customer understanding through behavior
analysis, leading to better personalization.
● Innovation: Enables development of new products and services by analyzing
large datasets.
● Competitive Advantage: Gives organizations a strategic edge over
competitors by utilizing data efficiently.
2. Describe the characteristics of overdetermined equation
systems.
Answer:
An overdetermined system has more equations than unknowns. This often arises
in real-world data where measurements or constraints are more than the number
of variables.
Characteristics:
● Typically no exact solution exists (inconsistent system).
● Often used in regression analysis to find the best approximate solution
using least squares.
● Represented as: Ax = b where A is an m × n matrix with m > n.
Example: If we have 3 equations and 2 variables:
x + y = 2
2x + 3y = 5
4x + 5y = 6
→ Overdetermined system (3 equations, 2 unknowns)
3. Explain the difference between Data Science and Business
Analytics.
Answer:
Feature Data Science Business Analytics
Focus Technical, scientific approach Business-driven insights and
to data strategies
Technique Machine Learning, AI, Big Descriptive & Predictive Analytics
s Data
Scope Broader; includes data Narrower; focused on
engineering, ML decision-making
Tools Python, R, TensorFlow Excel, Power BI, Tableau, SQL
Objective Discover patterns, build models Understand business problems and
solve them
4. Differentiate between structured and unstructured data.
Answer:
Feature Structured Data Unstructured Data
Format Organized in rows and No predefined format
columns (tables)
Storage Relational Databases (SQL) Data lakes, NoSQL, cloud
storage
Examples Excel sheets, SQL tables Images, videos, audio, emails,
PDFs
Ease of Easy to analyze using Requires advanced processing
Analysis traditional tools (NLP, ML)
5. Describe a scenario where the responsibilities of a data scientist
directly impact decision-making.
Answer:
Scenario: A retail company wants to optimize its inventory and avoid
overstocking.
Data Scientist’s Role:
● Analyze historical sales, seasonal demand, and customer preferences.
● Use predictive models to forecast future demand.
● Recommend inventory levels and reorder times.
Impact on Decision-Making:
● Helps management decide how much stock to keep.
● Reduces storage cost and wastage.
● Improves customer satisfaction by ensuring availability.
This shows how a data scientist contributes quantitatively to a critical business
function, impacting profits and operations directly.
6. Explain three types of data with example. Differentiate among
them.
Answer:
The three main types of data are:
1. Structured Data
● Definition: Data that is organized in a predefined format (rows and
columns), usually stored in relational databases.
Example:
| Name | Age | Salary |
|--------|-----|--------|
| Alice | 30 | 50000 |
| Bob | 28 | 60000 |
●
● Tools: SQL, Excel
2. Semi-Structured Data
● Definition: Data that doesn't reside in a traditional database but still has
some organizational properties (tags or markers).
Example:
{
"Name": "Alice",
"Age": 30,
"Salary": 50000
}
●
● Format: XML, JSON
3. Unstructured Data
● Definition: Data without a fixed structure. It’s typically raw, unorganized,
and requires preprocessing before analysis.
● Example: Images, videos, emails, PDFs, social media posts
Feature Structured Semi-Structured Unstructured
Organization Highly organized Partially organized Not organized
Storage Relational NoSQL, JSON/XML Data lakes, file
Databases files systems
Ease of Easy Moderate Complex
Analysis
Examples SQL Tables JSON/XML Images, Audio,
Video
7. Demonstrate how the pseudo-inverse helps in solving an
overdetermined system of linear equations, and why it is
important.
Answer:
An overdetermined system has more equations than unknowns (m > n). Such
systems often have no exact solution, especially when inconsistent.
Let the system be:
Ax = b
Where:
● A is an m×n matrix (with m > n)
● x is an n×1 vector of variables
● b is an m×1 vector of constants
Since exact solutions often don’t exist, we approximate x such that the error
between Ax and b is minimized. This leads to least squares approximation.
Pseudo-Inverse Approach:
Why it is important:
● Stability: Works even when exact solutions are not possible.
● Efficiency: Widely used in machine learning for fitting linear regression
models.
● Generalization: Helps deal with inconsistent systems arising from
real-world data.
Example:
Solve using pseudo-inverse:
Mid Semester Test 2
Section A (2x5= 10)
1. State the purpose of the gradient descent algorithm in optimization.
2. Define data visualization and discuss its significance in data analysis.
3. Differentiate between predictive and prescriptive statistics with examples.
4. Determine the covariance between two datasets : X={2,4,6} and Y={3,6,9}.
5. Identify the role of Learning Rate in convergence of Gradient Descent
algorithm.
Section B (5x2= 10)
6. Apply hypothesis testing techniques to solve data analysis problems and
demonstrate them step-by-step process of conducting a hypothesis test.
7. Compare the three types of statistical measures and analyze their
applications in data analysis.
Solutions/Answers :
1. State the purpose of the gradient descent algorithm in
optimization.
Answer:
The purpose of the Gradient Descent algorithm is to find the minimum of a
function, commonly used in optimization problems in machine learning and deep
learning.
● In ML, it minimizes loss functions to improve model accuracy.
● It does this by iteratively adjusting parameters (weights) in the direction
of the steepest decrease of the function (negative gradient).
Example: In linear regression, gradient descent finds the optimal slope and
intercept that minimize the error between predicted and actual values.
2. Define data visualization and discuss its significance in data
analysis.
Answer:
Data Visualization is the graphical representation of information and data using
visual elements like charts, graphs, maps, and plots.
Significance:
● Simplifies complex data and patterns
● Enhances understanding and communication of insights
● Speeds up decision-making
● Helps identify outliers, trends, and correlations
Example: A line graph showing the rise in global temperature over the years
quickly communicates climate trends.
3. Differentiate between predictive and prescriptive statistics with
examples.
Aspect Predictive Statistics Prescriptive Statistics
Purpose Forecast future outcomes Suggest actions to achieve a desired
outcome
Based On Historical data and Predictive models + optimization and
statistical models simulation methods
Question It "What is likely to "What should we do about it?"
Answers happen?"
Example Predicting next month’s Recommending how many items to
sales based on trends produce to maximize profit
4. Determine the covariance between two datasets: X = {2, 4, 6}, Y
= {3, 6, 9}.
Answer:
5. Identify the role of Learning Rate in convergence of Gradient
Descent algorithm.
Answer:
The Learning Rate (η) controls how big a step the gradient descent algorithm
takes toward the minimum during each iteration.
Roles:
● A small learning rate ensures smooth convergence but takes longer.
● A large learning rate converges faster but risks overshooting the minimum
or diverging.
Balance is key: An ideal learning rate ensures the algorithm converges efficiently
without oscillation or divergence.
Illustration:
● If η is too small: The process is slow.
● If η is too large: The updates might jump over the minimum repeatedly or
never converge.
6. Apply hypothesis testing techniques to solve data analysis
problems and demonstrate the step-by-step process of conducting
a hypothesis test.
Answer:
Let’s say a company claims their product has an average weight of 500g, but a
competitor suspects it’s less.
We take a sample of 10 products and find the average weight = 490g, with
standard deviation = 15g.
We want to test at 5% significance level if the mean is less than 500g.
Step-by-step Hypothesis Testing:
Step 1: State the Hypotheses
● Null Hypothesis (H₀): μ = 500 (the mean is 500g)
● Alternative Hypothesis (H₁): μ < 500 (the mean is less than 500g)
Step 2: Select the Significance Level (α)
● α = 0.05 (5%)
Step 3: Calculate the Test Statistic
Use t-test (since population standard deviation is unknown and n < 30):
Step 4: Determine the Critical Value
Degrees of freedom (df) = 10 - 1 = 9
From the t-distribution table, critical value at 5% (one-tailed) for df = 9 ≈ -1.833
Step 5: Make a Decision
● Since -2.11 < -1.833, we reject the null hypothesis.
Step 6: Conclusion
There is enough evidence at 5% level to conclude that the average weight is less
than 500g.
7. Compare the three types of statistical measures and analyze
their applications in data analysis.
Answer:
The three types of statistical measures are:
1. Measures of Central Tendency
● Includes Mean, Median, and Mode.
● Purpose: Describe the center or average of the data.
● Application: Used in summarizing salary data, average scores, etc.
Example: Average marks of students in a class.
2. Measures of Dispersion
● Includes Range, Variance, Standard Deviation, Interquartile Range (IQR).
● Purpose: Measure spread or variability in data.
● Application: Used to analyze risk, consistency in quality control, and
investment volatility.
Example: A low standard deviation in product weight shows consistent
manufacturing.
3. Measures of Shape
● Includes Skewness and Kurtosis.
● Purpose: Describe the symmetry and peakedness of data distribution.
● Application: Crucial in understanding distribution properties before
applying statistical models (especially in finance and machine learning).
Example: Positive skew indicates long tail on the right (income distributions often
show this).
Summary Table:
Measure Key Metrics Purpose Application Example
Type
Central Mean, Median, Central Value Average salary of
Tendency Mode employees
Dispersion Std Dev, Spread of data Risk in stock returns
Variance, Range
Shape Skewness, Shape of Assessing normality in
Kurtosis distribution predictive models
Question Bank for UNIT-3: Optimization and Complexity
2-Marks Questions (12 Questions)
1. Define Machine Learning in the context of Data Science.
2. State any two real-life applications of Supervised Learning.
3. What is the role of training data in Machine Learning?
4. List two differences between supervised and unsupervised learning.
5. Mention any two commonly used Machine Learning algorithms.
6. What is the difference between classification and prediction in ML?
7. Define reinforcement learning with a simple example.
8. What is a labeled dataset? How is it used in supervised learning?
9. State any two differences between deep learning and traditional ML.
10.Why is machine learning important in today’s business environment?
11. Give an example where prediction is preferred over classification.
12.What does “learning from data” mean in ML?
5-Marks Questions (6 Questions)
1. Compare and contrast supervised, unsupervised, and reinforcement learning
using suitable examples.
2. Explain the significance of machine learning in modern business
decision-making with a relevant scenario.
3. Illustrate how classification is performed using any one ML algorithm (e.g.,
Decision Tree or KNN).
4. Differentiate between deep learning and machine learning in terms of
architecture, data requirements, and performance.
5. Discuss three key types of machine learning algorithms and their areas of
application.
6. Explain how machine learning supports personalization in online platforms
(e.g., Netflix or Amazon).
10-Marks Questions (6 Questions)
1. Explain in detail the various types of Machine Learning techniques
(Supervised, Unsupervised, Reinforcement, and Deep Learning).
2. Assume a dataset with customer transactions for a bank. Design a machine
learning approach to classify customers as ‘high risk’ or ‘low risk’.
3. Discuss the process of building a classification model using logistic
regression.
4. Elaborate on how machine learning is revolutionizing predictive analytics in
industries.
5. Design a use-case that combines supervised and unsupervised learning for a
real-world business scenario.
6. Differentiate classification and prediction using appropriate datasets.