AIML MOD1
2 Marks
1. Define Artificial Intelligence (AI).
Artificial Intelligence is the simulation of human intelligence processes by machines, especially
computer systems, to perform tasks such as learning, reasoning, problem-solving, and
understanding language.
2. What are the key goals of AI?
The key goals of AI are to create systems that can:
• Learn from experience
• Understand natural language
• Reason and solve problems
• Perceive and interact with the environment
• Mimic human intelligence and behavior
3. Name any three subfields of AI.
• Machine Learning
• Natural Language Processing
• Computer Vision
4. What is the Turing Test?
The Turing Test, proposed by Alan Turing, is a method to determine a machine’s ability to exhibit
intelligent behavior equivalent to, or indistinguishable from, that of a human.
5. Mention two ethical concerns in AI.
• Job displacement due to automation
• Bias and discrimination in AI decision-making
6. What is heuristic search?
Heuristic search is a search strategy that uses a heuristic (an estimate or rule of thumb) to make
decisions and find solutions more efficiently than blind search methods.
7. Differentiate between uninformed and informed search.
• Uninformed search uses no domain-specific knowledge (e.g., Breadth-First Search).
• Informed search uses problem-specific knowledge (heuristics) to find solutions faster (e.g.,
A* Search).
8. Define optimization in AI.
Optimization in AI refers to the process of finding the best solution from all feasible solutions, often
by maximizing or minimizing a particular objective function.
9. What is game theory in AI?
Game theory in AI studies strategies and decision-making in competitive environments where
multiple agents interact and each agent's outcome depends on the actions of others.
10. State the difference between propositional logic and predicate logic.
• Propositional logic deals with statements that are either true or false.
• Predicate logic (First-Order Logic) includes variables and quantifiers, allowing more
expressive statements involving objects and their properties.
11. What is a knowledge base in AI?
A knowledge base is a collection of facts, rules, and information used by an AI system to make
decisions and infer new information.
12. Give an example of a propositional logic statement.
Example: “If it rains, then the ground is wet.” (Symbolically: R → W)
13. What is backward chaining in logic?
Backward chaining is a reasoning method that starts with a goal and works backward to find facts or
rules that support the goal, commonly used in expert systems.
14. Define first-order logic (FOL).
First-order logic is a formal logical system that expresses relationships between objects using
quantifiers, variables, predicates, and logical connectives.
Five-Mark (Short Explanatory) questions:
1. Briefly discuss the history and evolution of AI.
The concept of AI dates back to ancient myths, but the formal field began in 1956 at the Dartmouth
Conference. Early AI focused on symbolic methods and problem-solving. In the 1980s, expert
systems became popular. The 1990s saw improvements in machine learning and robotics. From the
2010s onward, AI experienced rapid growth due to advances in deep learning, big data, and powerful
computing, leading to applications in image recognition, language translation, and autonomous
systems.
2. Explain the role of AI in modern problem-solving.
AI assists in solving complex problems by simulating human thinking. It is used in healthcare for
diagnosis, in finance for fraud detection, in transportation for route planning, and in customer service
through chatbots. AI can analyze large datasets, identify patterns, and make predictions, helping in
decision-making across various domains.
3. What are the ethical challenges of AI? Give two examples.
Ethical challenges in AI include:
• Bias and Discrimination: AI systems trained on biased data can lead to unfair treatment,
such as biased hiring algorithms.
• Privacy Concerns: AI can collect and misuse personal data, such as facial recognition
systems used without consent.
4. Explain the difference between weak AI and strong AI with examples.
• Weak AI is designed for specific tasks and cannot think or understand like humans (e.g., Siri,
Google Assistant).
• Strong AI aims to have general intelligence like humans, capable of reasoning, learning, and
understanding any task (e.g., a hypothetical AI with human-level cognition, which doesn't exist
yet).
5. Describe the components of an AI system.
An AI system typically includes:
• Perception: Sensing the environment (e.g., cameras, sensors)
• Knowledge Representation: Storing and managing data
• Reasoning and Inference: Drawing conclusions from facts
• Learning: Improving performance from data
• Planning and Decision-making: Choosing actions to achieve goals
• Action: Executing tasks through actuators or software
6. Compare depth-first search and breadth-first search.
• Depth-First Search (DFS): Explores as far as possible along one path before backtracking.
Uses a stack.
• Breadth-First Search (BFS): Explores all nodes at the current depth before going deeper. Uses
a queue.
DFS is memory-efficient but may not find the shortest path. BFS finds the shortest path but uses
more memory.
7. What is the significance of heuristic functions in search algorithms?
Heuristic functions estimate the cost to reach the goal from a node. They guide search algorithms
like A*, making them faster and more efficient by prioritizing paths likely to lead to the goal. Good
heuristics reduce computation and improve performance.
8. Explain the Minimax algorithm in game theory.
The Minimax algorithm is used in two-player games to minimize the possible loss for a worst-case
scenario. One player aims to maximize their score, while the other tries to minimize it. It evaluates
possible moves by simulating all potential outcomes and chooses the optimal strategy based on
assumed rationality of opponents.
9. How does optimization contribute to AI problem-solving?
Optimization helps AI find the best solution from many possibilities, improving accuracy and
efficiency. It’s used in training machine learning models, scheduling, resource allocation, and
pathfinding. AI systems often use optimization techniques to maximize performance and minimize
error or cost.
10. Differentiate between propositional logic and predicate logic with examples.
• Propositional Logic: Deals with simple statements like "It is raining" (R). Limited in expressing
relationships.
• Predicate Logic: Expresses statements involving objects and their properties, like "All humans
are mortal" → ∀x(Human(x) → Mortal(x)). It is more expressive and suitable for real-world
knowledge representation.
11. Explain the importance of knowledge representation in AI.
Knowledge representation is how AI stores and organizes information. It enables reasoning,
inference, and decision-making. Proper representation allows AI to mimic human understanding and
solve problems effectively using rules, facts, or structured data.
12. What are the limitations of propositional logic?
Propositional logic:
• Cannot express relationships between objects
• Lacks quantifiers like "all" or "some"
• Becomes inefficient for large knowledge bases
• Unsuitable for representing complex real-world scenarios
13. Describe the role of inference in logical reasoning.
Inference is the process of deriving new facts or conclusions from known information. It allows AI
systems to apply logical rules to knowledge bases and deduce outcomes, make decisions, and solve
problems. For example, using Modus Ponens: If A → B and A is true, then B must be true.
14. How does predicate logic help in representing real-world knowledge?
Predicate logic allows representation of facts involving objects, attributes, and relationships using
variables and quantifiers. For example, "All students study" can be written as ∀x(Student(x) →
Studies(x)). This makes it suitable for modeling complex domains like expert systems.
15. Discuss the significance of planning in AI with an example.
Planning enables AI to decide a sequence of actions to achieve a goal. It is crucial in robotics,
logistics, and game playing. For example, a robot vacuum plans a path to clean a room efficiently
while avoiding obstacles. AI planning involves defining initial states, goals, and available actions.
Ten-Mark Questions -
1. Discuss the history and evolution of AI, highlighting major milestones.
Artificial Intelligence (AI) has evolved significantly since its inception:
• 1940s–1950s: Theoretical groundwork laid by Alan Turing, who proposed the concept of a
"universal machine" (Turing Machine) and introduced the Turing Test in 1950.
• 1956 (Dartmouth Conference): The term "Artificial Intelligence" was coined. Early pioneers
like John McCarthy, Marvin Minsky, and Allen Newell laid foundational theories.
• 1950s–1970s (Early AI): Development of symbolic reasoning, problem-solving systems like
Logic Theorist and General Problem Solver (GPS). AI was rule-based and domain-specific.
• 1980s (Expert Systems): AI flourished through systems like MYCIN for medical diagnosis.
These systems used a knowledge base and inference engine.
• 1990s: Emphasis shifted to statistical methods and machine learning. IBM’s Deep Blue
defeated world chess champion Garry Kasparov in 1997.
• 2000s–2010s: Rise of big data and deep learning. Speech recognition (e.g., Siri, Google Now)
and computer vision systems became feasible.
• 2010s–present: Development of transformer-based models (like GPT), autonomous
vehicles, and AI applications in healthcare, finance, and robotics.
2. Explain the different subfields of AI and their real-world applications.
1. Machine Learning (ML):
Enables machines to learn from data.
Example: Spam email detection, recommendation systems.
2. Natural Language Processing (NLP):
Allows computers to understand and generate human language.
Example: Chatbots, translation services.
3. Computer Vision:
Enables machines to interpret visual information.
Example: Facial recognition, medical image analysis.
4. Robotics:
AI-driven machines capable of performing tasks.
Example: Industrial robots, autonomous drones.
5. Expert Systems:
Rule-based systems that simulate human expertise.
Example: Medical diagnosis systems.
6. Planning and Scheduling:
AI makes decisions about future actions.
Example: Autonomous vehicle route planning.
7. Reinforcement Learning:
Learning via trial-and-error to maximize rewards.
Example: AlphaGo, robot navigation.
3. What are the ethical concerns in AI? Discuss their impact on society.
1. Bias and Discrimination:
AI systems can inherit biases from training data, leading to unfair outcomes (e.g., in hiring or lending).
2. Privacy Violations:
AI in surveillance or data analysis can infringe on individual privacy.
3. Job Displacement:
Automation through AI threatens many traditional jobs, especially in manufacturing and customer
service.
4. Accountability and Transparency:
It is difficult to trace decisions made by complex AI models (black-box issue).
5. Weaponization:
Use of AI in autonomous weapons raises serious ethical concerns.
Impact on Society:
While AI improves efficiency and innovation, unchecked development could worsen inequality,
reduce trust in systems, and pose risks to democratic values.
4. Compare and contrast uninformed search and informed search strategies,
giving examples.
Feature Uninformed Search Informed Search
Uses domain No Yes (uses heuristics)
knowledge
Efficiency Slower, explores all options Faster, explores promising paths
Examples BFS, DFS, Uniform Cost Search Greedy Search, A* Search
Optimality Depends on the algorithm Often more optimal with a good
heuristic
Memory usage Can be high (e.g., BFS) Can be optimized depending on
heuristic
Example:
• BFS finds the shortest path in an unweighted graph but checks all neighbors.
• A* search uses estimated costs to focus only on promising paths.
5. Explain A* search algorithm with an example.
A* Search combines cost from start node to current node (g(n)) and estimated cost from current
node to goal (h(n)) using:
f(n) = g(n) + h(n)
Example:
In a graph:
• Start = A, Goal = G
• A → B (cost 2), B → G (cost 3)
• h(A) = 5, h(B) = 2, h(G) = 0
Then:
• f(A) = 0 + 5 = 5
• f(B) = 2 + 2 = 4
• f(G) = 5 + 0 = 5
A* selects the path with the lowest f(n). It’s complete and optimal if the heuristic is admissible (never
overestimates).
6. Describe the role of optimization in AI and explain different optimization
techniques.
Role of Optimization:
AI uses optimization to find the best solution (e.g., lowest cost, highest accuracy) among many
possible choices.
Applications:
• Training ML models (minimize loss)
• Path planning in robotics
• Resource allocation
Techniques:
1. Gradient Descent:
Iteratively adjusts parameters to minimize a loss function in ML.
2. Genetic Algorithms:
Uses natural selection concepts to evolve solutions.
3. Simulated Annealing:
Mimics cooling process of metals to escape local minima.
4. Linear Programming:
Solves problems where objective and constraints are linear.
5. Constraint Satisfaction:
Optimizes solutions under given constraints.
7. Discuss game theory in AI and explain its significance with examples.
Game Theory studies strategies in competitive situations involving two or more rational agents.
Applications in AI:
• Multi-agent systems
• Economic modeling
• Strategic decision-making
Key Concepts:
• Minimax Algorithm: Used in turn-based games (e.g., chess) to minimize opponent’s
maximum gain.
• Nash Equilibrium: Stable state where no player can benefit by changing strategy.
Example: In a two-player game, AI evaluates each move and its opponent’s responses. Minimax
ensures the AI selects the move with the least possible loss, assuming the opponent plays optimally.
8. Explain propositional logic and predicate logic, giving their syntax, semantics,
and applications.
Aspect Propositional Logic Predicate Logic (FOL)
Syntax Propositions (P, Q), operators (∧, ∨, Predicates, functions, variables,
¬, →) quantifiers
Semantics Assigns true/false to propositions Assigns truth based on objects and
relations
Expressive Limited Higher (can represent relationships)
ness
Example P→Q ∀x(Student(x) → Studies(x))
Application Simple rule-based systems Knowledge representation, expert
s systems
9. What are knowledge representation techniques in AI? Explain with examples.
Knowledge Representation (KR) allows AI to understand, reason, and act using information.
Techniques:
1. Semantic Networks:
Graph of nodes (concepts) and edges (relationships).
Example: "Cat → is-a → Animal"
2. Frames:
Structured templates representing stereotyped situations.
Example: A "Car" frame with slots like "Wheels: 4", "Engine: petrol"
3. Rules (IF-THEN):
Used in expert systems.
Example: IF patient has fever AND rash → THEN suspect measles
4. Ontologies:
Formal representation of knowledge with categories and relations.
Used in the Semantic Web.
5. Logic-based KR:
Uses propositional or predicate logic to represent facts and rules.
10. Discuss the importance of logical reasoning in AI and compare forward and
backward chaining.
Logical Reasoning enables AI to infer new information from existing knowledge. It forms the basis of
decision-making in expert systems and automated reasoning.
Importance:
• Enables inference
• Allows AI to explain conclusions
• Supports planning and diagnostics
Forward Chaining:
• Starts from known facts, applies rules to infer new facts
• Example: IF A → B, and A is true → infer B
• Data-driven
Backward Chaining:
• Starts from a goal, works backward to verify conditions
• Example: To prove B, check if A is true and A → B
• Goal-driven
Comparison:
Feature Forward Chaining Backward Chaining
Direction From facts to From goal to facts
conclusion
Use case Diagnosis systems Expert systems, theorem
proving
Control flow Data-driven Goal-driven
MOD 2
2.1
Show that a rational agent’s action may depend not just on the state of the
environment but also on the time step it has reached.
A rational agent chooses actions that maximize performance. If the performance
measure is concerned with only the first T steps, then the time step becomes crucial.
For example, if an action is beneficial in early time steps but detrimental later, a
rational agent should act accordingly. Hence, the action chosen must consider when it
is being executed, not just the current state. For example, a vacuum-cleaner agent
might skip a slightly dirty spot near the end of the time frame if cleaning it takes too
long and lowers overall performance.
2.2
a.
Show that the simple vacuum-cleaner agent function described in Figure 2.3 is
rational under the assumptions on page 38.
The simple vacuum cleaner agent:
• Cleans if the current square is dirty.
• Else, moves left or right arbitrarily.
Under assumptions (fully observable, deterministic, static, clean-only goal), this
behavior maximizes performance by always cleaning a dirty square. It’s rational
because it always takes an action that improves its environment according to the goal.
b.
Describe a rational agent function for the case in which each movement costs one
point.
A rational agent must now minimize movement while maximizing cleaning. The agent
should:
• Clean when on a dirty square.
• Avoid unnecessary moves.
• Possibly keep track of cleaned/visited areas.
This requires internal state to remember where it has been and avoid returning.
Without internal state, it may waste moves revisiting clean areas, reducing its
performance score.
c.
Discuss agent designs for unknown geography and dirty squares. Should the
agent learn?
Yes, learning is necessary. In an unknown environment:
• The agent should explore to build a map.
• It must learn where dirt tends to accumulate.
• It can apply reinforcement learning to adapt over time.
If it does not learn, it may repeatedly revisit areas or miss parts of the environment. A
learning agent improves coverage and efficiency over time.
2.3
For each assertion, mark True or False with reasoning:
a. False.
An agent with partial information can still be rational if it makes the best decision
based on what it knows and maximizes expected performance (e.g., partially
observable environments with probabilistic reasoning).
b. True.
In some environments, reflex agents (acting only on current percept) can’t behave
rationally because past history or unobservable state is needed for correct decisions
(e.g., maze navigation).
c. False.
Not every agent in a task environment is rational. Some may behave irrationally due to
poor design, incomplete information, or randomness.
d. False.
The input to the agent program is a percept, but the agent function maps percept
histories to actions. Thus, the agent program may do more (e.g., maintain state), while
the agent function is an abstract mapping.
e. True.
By the Church-Turing thesis, every agent function can be implemented by some
program on a machine, although some may be inefficient or impractical.
f. True.
Even a random agent can be rational in a deterministic environment if randomness
happens to yield optimal actions (though it’s rare). The definition of rationality is based
on expected performance, not process.
g. True.
It’s possible for the same agent to be perfectly rational in two different environments if
its behavior matches the optimal policy in both (e.g., when task goals and
environments align well).
h. False.
In an unobservable environment, an agent can't be truly rational without
assumptions or probabilities, because it can’t distinguish states, leading to potentially
suboptimal decisions.
i. False.
A perfectly rational poker agent may still lose due to chance and opponent
randomness, but it will maximize its expected utility over time, not win every game.
2.4 – PEAS Descriptions (5 marks each)
Activity Performance Environment Actuator Sensors
Measure s
Playing soccer Goals scored, Field, players, Legs, Vision, GPS,
teamwork, ball head, microphone,
penalties avoided eyes ball detector
Exploring Data collected, Water, rocks, Propeller Sonar, camera,
subsurface area mapped, time ocean floor, s, arms, pressure,
oceans of Titan efficiency pressure lights temperature
Shopping for AI Price, relevance, Websites, Mouse, Screen, user
books online speed, reviews catalogues, keyboard profile, cookies
databases
Playing a tennis Points won, faults Court, Arm, Vision, ball
match avoided, opponent, legs, eyes tracking, audio
endurance weather
Practicing Number of returns, Court, wall Arm, legs Ball sensor,
tennis against a speed, accuracy camera
wall
Performing a Height cleared, Track, bar Legs, Vision, balance
high jump form, take-off body sensor
efficiency
Knitting a Pattern accuracy, Needles, yarn Arms, Vision, touch
sweater efficiency, fingers
neatness
Bidding at an Items won, money Auction Mouse/ke Auction
auction spent, timing environment yboard, interface, price
speech display, clock
2.5 – Definitions (2 marks each)
• Agent: An entity that perceives its environment through sensors and acts upon it
through actuators.
• Agent Function: A mapping from percept sequences to actions.
• Agent Program: The software that implements the agent function.
• Rationality: Doing the "right thing" based on performance measure, percepts,
knowledge, and actions.
• Autonomy: Ability to operate without human intervention and learn from
experiences.
• Reflex Agent: Responds to current percepts without considering percept history.
• Model-Based Agent: Uses internal models to track aspects of the world not
directly observable.
• Goal-Based Agent: Takes actions to achieve specific goals.
• Utility-Based Agent: Chooses actions based on a utility function (degree of
happiness or success).
• Learning Agent: Improves its performance through learning from past actions
and outcomes.
2.6 – Agent Functions vs Agent Programs (5 marks each)
a.
Yes, more than one agent program can implement the same agent function.
Example: Two programs may use different data structures (array vs linked list) but
implement the same input-output mapping. The function (behavior) is the same, but
internal implementation differs.
b.
Yes, some agent functions cannot be implemented.
There are infinitely many agent functions, but only countably many agent programs
(due to finite machines and code). So, some functions cannot be implemented—this is
a limitation of computational resources.
c.
Yes, each agent program on a fixed architecture implements exactly one agent
function.
Given the same percept sequence, the program will always produce the same action.
Thus, it defines one function uniquely.
e.
No, speeding up the machine does not change the agent function.
The output for a given input remains the same—it just happens faster. Since the
function is defined by input-output behavior, time does not affect it unless time is part
of the percept.
2.7 – Pseudocode for Goal-Based and Utility-Based Agent Programs (10
marks)
Goal-Based Agent Pseudocode (Vacuum Cleaner Example):
function GoalBasedAgent(percept):
location, status = percept
if status == 'Dirty':
return 'Suck'
else if location == 'A':
return 'Right'
else:
return 'Left'
Utility-Based Agent Pseudocode:
function UtilityBasedAgent(percept):
location, status = percept
if status == 'Dirty':
utility = 10
return 'Suck'
else:
# Evaluate utility of moving based on expected dirtiness
if location == 'A':
expected_utility = evaluate('Right')
return 'Right' if expected_utility > 0 else 'NoOp'
else:
expected_utility = evaluate('Left')
return 'Left' if expected_utility > 0 else 'NoOp'
1. Differentiate between BFS and DFS. (2 Marks)
• BFS (Breadth-First Search) explores nodes level by level, starting from the root
and moving outward.
• DFS (Depth-First Search) goes as deep as possible along a branch before
backtracking.
Key Difference: BFS uses a queue (FIFO) and is complete and optimal for unweighted
graphs, while DFS uses a stack (LIFO) and may not be complete or optimal.
2. Can BFS be used for solving puzzles like the 8-puzzle problem? Why or why not?
(5 Marks)
Yes, BFS can be used for the 8-puzzle as it systematically explores all possible moves
level by level.
Pros: It guarantees the shortest path to the solution.
Cons: It consumes a lot of memory and time due to the large branching factor in such
puzzles.
Hence, it’s theoretically usable but not practical for complex puzzles due to high
space complexity.
3. Differentiate between Uninformed and Informed search. (2 Marks)
• Uninformed Search (Blind): No additional information; examples include BFS
and DFS.
• Informed Search (Heuristic): Uses problem-specific knowledge; examples
include A* and Greedy Best-First Search.
Key Point: Informed search is generally more efficient due to heuristic guidance.
4. Differentiate between the following: (2 Marks each × 5 = 10 Marks)
a. Partially observable vs. Fully observable:
• Fully observable: Agent can see the complete state (e.g., chess).
• Partially observable: Some information is hidden (e.g., poker).
b. Static vs. Dynamic environment:
• Static: Environment does not change while thinking (e.g., crossword puzzle).
• Dynamic: Environment can change (e.g., driving a car).
c. Cooperative vs. Competitive task environment:
• Cooperative: Multiple agents working together (e.g., robot soccer team).
• Competitive: Agents compete (e.g., chess, auctions).
d. Single agent vs. Multiagent environment:
• Single agent: Only one agent is involved (e.g., pathfinding robot).
• Multiagent: Multiple agents interact (e.g., autonomous cars on a road).
e. Agent function vs. Agent program:
• Agent function: Maps percepts to actions.
• Agent program: Concrete implementation (code) of the agent function.
5. What is the difference between a state space and a search tree? (2 Marks)
• State Space: The set of all possible states reachable from the initial state.
• Search Tree: A tree representation of how the algorithm explores the state
space. Nodes may be repeated in a search tree.
6. What does it mean for a search algorithm to be complete? (2 Marks)
A search algorithm is complete if it guarantees finding a solution when one exists.
7. Calculate the time complexity of breadth-first search. (2 Marks)
Time complexity of BFS = O(b^d)
where b = branching factor, d = depth of the shallowest solution.
8. Explain the difference between breadth-first search and depth-first search. (5
Marks)
• BFS:
o Explores level-wise.
o Uses a queue.
o Complete and optimal for unweighted graphs.
o Time & space complexity: O(b^d).
• DFS:
o Explores depth-wise.
o Uses a stack or recursion.
o Not complete (in infinite depth), not optimal.
o Time complexity: O(b^m), m = max depth.
9. What is the role of the evaluation function in A search? (5 Marks)*
The evaluation function in A* is f(n) = g(n) + h(n), where:
• g(n) = cost to reach node n.
• h(n) = heuristic estimate of cost to goal from n.
Role: It guides the search toward the goal efficiently by balancing path cost and
heuristic.
10. Mention some of the challenges and limitations of search algorithms? (5
Marks)
• High time and space complexity (e.g., BFS).
• Choosing an effective heuristic (for informed search).
• Handling dynamic or partially observable environments.
• Risk of infinite loops (e.g., DFS in cyclic graphs).
• Scalability to large problem spaces.
11. Explore the potential advantages of using bidirectional search? (5 Marks)
• Searches from both start and goal simultaneously.
• Reduces search time from O(b^d) to O(b^(d/2)).
• More efficient in memory and processing for large search spaces.
Example: Useful in pathfinding applications like GPS.
12. Write down the factors based on which the search algorithm’s performance get
measured. (5 Marks)
• Completeness – Guarantees solution?
• Optimality – Is the found solution the best one?
• Time Complexity – Time taken to find solution.
• Space Complexity – Memory usage.
• Scalability – Performance on large problems.
13. Uninformed search is called as blind search – Explain. (5 Marks)
Uninformed search lacks domain knowledge; it explores blindly without guidance.
Examples: BFS, DFS.
Called "blind" because it does not use problem-specific information like a heuristic to
decide which path to explore.
14. Explain how the A algorithm balances completeness and optimality while searching
for the shortest path in a graph. (10 Marks)*
A* search uses f(n) = g(n) + h(n) to balance:
• g(n): Actual cost so far (ensures completeness).
• h(n): Heuristic estimate (guides search to goal efficiently).
If h(n) is admissible and consistent, A* is both complete and optimal.
It avoids unnecessary paths and finds the least-cost solution.
15. Describe the role of the heuristic function in the A algorithm. What are the
properties of an admissible heuristic? (10 Marks)*
• Role of h(n):
Estimates cost from node n to the goal. Helps prioritize promising paths.
• Admissible heuristic:
Never overestimates the actual cost to reach the goal.
i.e., h(n) ≤ h*(n) where h*(n) is the true cost.
• Properties:
o Non-negative.
o Admissibility ensures optimality in A*.
o Should be as close as possible to h* for efficiency.
16. Given a graph with specified edge weights and heuristic values, apply the A
algorithm to find the shortest path from a given start node to a goal node. (10 Marks)*
To apply A* algorithm:
• Use f(n) = g(n) + h(n) for each node.
• g(n): Cost from start to current node.
• h(n): Heuristic estimate to goal.
• Expand the node with the lowest f(n) value.
• Continue until the goal node is reached.
Example:
Given graph nodes A, B, C, D with edge costs and heuristics:
• A → B (cost 1), A → C (cost 4), B → D (cost 5), C → D (cost 1)
• h(B) = 3, h(C) = 2, h(D) = 0
Start: A, Goal: D
• Path: A → C → D with cost 5, as f(C) = 4+2 = 6, f(B) = 1+3 = 4, but final cost from A →
C → D = 4+1 = 5
17. Discuss the effect of overestimating or underestimating the heuristic function in the
A algorithm. Provide examples to support your explanation. (10 Marks)*
• Underestimating (Admissible heuristic):
A* guarantees optimality.
Example: h(n) = 0 → becomes Dijkstra’s algorithm.
• Overestimating:
A* may miss the optimal path.
Example: If h(n) > actual cost, A* may pick a suboptimal route to the goal.
Conclusion: Heuristics must be admissible and consistent for A* to be optimal.
18. Explain the concept of Hill Climbing in artificial intelligence. What are its
advantages and disadvantages? (10 Marks)
• Hill Climbing: Local search algorithm that moves in the direction of increasing
value (better states).
Advantages:
o Simple to implement.
o Requires less memory.
Disadvantages:
o May get stuck in local maxima, plateaus, or ridges.
o No backtracking or memory of previous states.
19. Differentiate between Simple Hill Climbing, Steepest-Ascent Hill Climbing,
and Stochastic Hill Climbing with examples. (10 Marks)
• Simple Hill Climbing: Evaluates neighboring states and selects the first better
one.
• Steepest-Ascent Hill Climbing: Examines all neighbors and picks the best.
• Stochastic Hill Climbing: Picks randomly among better neighbors.
Example:
Climbing a mountain with fog:
• Simple: Take first upward step.
• Steepest-Ascent: Scan around and pick steepest.
• Stochastic: Choose one of the better upward paths randomly.
20. Compare Hill Climbing with A in terms of their approach, efficiency, and limitations
in solving optimization problems. (10 Marks)*
Aspect Hill A* Search
Climbing
Uses Yes Yes
Heuristics
Path No Yes
Tracking
Memory Low High
Usage
Completene No Yes (if admissible
ss h(n))
Optimality No Yes
Efficiency Fast but Slower but reliable
risky
Hill climbing is efficient but can get stuck in local optima, while A* is more robust and
optimal with the right heuristic.
21. Explain the concept of Bidirectional Search and how it improves efficiency
compared to Unidirectional Search. (10 Marks)
• Bidirectional Search: Simultaneously searches from start and goal.
• Halts when frontiers meet.
• Reduces time complexity from O(b^d) to O(b^(d/2)).
Efficiency: Less space and time than unidirectional BFS.
Example: Pathfinding in graphs.
22. Describe the conditions under which Bidirectional Search is most effective.
Provide examples to illustrate your answer. (10 Marks)
Effective when:
• The goal state is known.
• The search space is symmetric.
• Both directions have similar branching factors.
Example:
Finding shortest path in undirected graphs like road maps.
Not suitable when the goal is not clearly defined or in asymmetric environments.
23. Compare and contrast Bidirectional Search with A Search in terms of space
complexity and efficiency. (10 Marks)*
Criteria Bidirectional A* Search
Search
Direction Two (start & goal) One
Space O(b^(d/2)) O(b^d)
Heuristic Not required Required
s
Optimali Yes Yes (if h is
ty admissible)
Use Symmetric Heuristic-driven
Case problems problems
Conclusion: A* is better for heuristic-informed problems; Bidirectional is space-
efficient for known goals.
24. What are the challenges of implementing Bidirectional Search in real-world
applications? How can they be addressed? (10 Marks)
Challenges:
• Identifying the meeting point of frontiers.
• Different state representations in forward/backward search.
• Goal state may not be known in advance.
Solutions:
• Use hash tables for checking visited nodes.
• Normalize state representations.
• Apply in environments where goal is clearly defined.
25. Given a sample graph, demonstrate how Bidirectional Search finds the
shortest path between two nodes. (10 Marks)
Steps:
1. Start BFS from both start and goal nodes.
2. Expand nodes level-wise.
3. When a node appears in both frontiers, stop.
4. Combine the two paths.
Example:
Start: A, Goal: G
Forward search: A → B → D
Backward search: G → F → D
Meeting at D gives path A → B → D → F → G.
26. Compare Breadth-First Search (BFS) with Bidirectional Search in terms of
memory usage and time complexity. (10 Marks)
Aspect BFS Bidirectional Search
Time Complexity O(b^d) O(b^(d/2))
Space Complexity O(b^d) O(b^(d/2))
Frontier Size Large Smaller
Application General Suitable when goal is
known
Conclusion: Bidirectional search is
more efficient in space and time but
needs more control logic and goal
knowledge.
1. What is the role of regression model in exploratory data analysis? (2 marks)
Answer:
Regression models help identify and quantify relationships between variables during
Exploratory Data Analysis (EDA). They allow us to understand how changes in independent
variables influence the dependent variable, enabling data-driven insights and trend
detection.
2. Distinguish between the terms: Classification, Regression and Estimation. (5 marks)
Answer:
Feature Classification Regression Estimation
Categorize data into Predict continuous Predict parameters or
Purpose
classes values missing values
Continuous (real Often single-valued or
Output Discrete (labels/classes)
numbers) derived stats
Spam/Not Spam, Predicting prices, Estimating mean, missing
Examples
Disease/No disease temperatures income
Model Logistic Regression, Linear Regression, Statistical inference
Type Decision Tree Polynomial Reg. methods
Confidence Intervals,
Evaluation Accuracy, Precision, Recall RMSE, MAE
Bias
3. What is the difference between Classification and Regression models? (2 marks)
Answer:
Feature Classification Regression
Output Discrete categories Continuous numerical values
Example Predicting email as spam/not spam Predicting house price
4. What is the principle of ordinary least square in linear regression? What are the pros
and cons of regression models? (5 marks)
Answer:
• Principle of OLS:
Ordinary Least Squares (OLS) minimizes the sum of squared differences between
observed values and predicted values to find the best-fitting line.
• Pros:
1. Simple and easy to interpret.
2. Fast computation.
3. Good for linear relationships.
4. Can identify key predictors.
5. Widely used and understood.
• Cons:
1. Poor with non-linear data.
2. Sensitive to outliers.
3. Assumes constant variance (homoscedasticity).
4. Assumes independence of errors.
5. Requires normally distributed errors for inference.
5. Based on Table 5.11 (Week vs. Hours spent in library), predict hours for the 7th and 9th
week using linear regression. (10 marks)
Table 5.11:
Week (x) 1 2 3 4 5
Hours (y) 12 18 22 28 35
Step-by-step solution:
1. Calculate means:
Mean of x = (1 + 2 + 3 + 4 + 5) / 5 = 3
Mean of y = (12 + 18 + 22 + 28 + 35) / 5 = 23
2. Find slope (b1):
Formula:
b1 = Σ[(xi - x̄)(yi - ȳ)] / Σ[(xi - x̄)²]
Create a table:
x y x-3 y-23 (x-3)(y-23) (x-3)2
1 12 -2 -11 22 4
2 18 -1 -5 5 1
3 22 0 -1 0 0
4 28 1 5 5 1
5 35 2 12 24 4
3.Find intercept (b0):
b0 = ȳ − b1 × x̄
b0 = 23 − 5.6 × 3 = 23 − 16.8 = 6.2
4.Regression line equation:
y = 6.2 + 5.6x
5.Predictions:
o For week 7:
y = 6.2 + 5.6 × 7 = 6.2 + 39.2 = 45.4 hours
o For week 9:
y = 6.2 + 5.6 × 9 = 6.2 + 50.4 = 56.6 hours
Final Answers:
• Predicted hours for week 7: 45.4
• Predicted hours for week 9: 56.6
6. Fit a line of best fit for the height data of boys (x) and girls (y) in Table 5.12. (10 marks)
Table 5.12:
Boys (x) 65 70 75 78
Girls (y) 63 67 70 73
Step-by-step solution:
1. Calculate means:
• Mean of x = (65 + 70 + 75 + 78) / 4 = 72
• Mean of y = (63 + 67 + 70 + 73) / 4 = 68.25
2. Create the table:
x y x - 72 y - 68.25 (x−72)(y−68.25) (x−72)²
65 63 -7 -5.25 36.75 49
70 67 -2 -1.25 2.50 4
x y x - 72 y - 68.25 (x−72)(y−68.25) (x−72)²
75 70 3 1.75 5.25 9
78 73 6 4.75 28.50 36
73.00 98
3. Calculate slope and intercept:
• Slope (b1) = 73 / 98 ≈ 0.7449
• Intercept (b0) = 68.25 − (0.7449 × 72) = 14.62
4. Regression equation:
• y = 14.62 + 0.7449x
7. Using multiple regression, fit a model for Table 5.13 where z (equity) depends on x (net
sales) and y (assets). (10 marks)
Table 5.13:
x (Sales) 4 6 7 8 11
y (Assets) 12 18 22 28 35
z (Equity) 8 12 16 36 42
Multiple Regression Formula:
z = a + b1·x + b2·y
Using regression software or calculator:
• Intercept (a) ≈ -3.39
• Coefficient for x (b1) ≈ -1.13
• Coefficient for y (b2) ≈ 1.62
Final Regression Equation:
z = -3.39 − 1.13x + 1.62y
8. How does the polynomial regression model work? (5 marks)
Answer (in 5 points):
1. Polynomial regression models non-linear relationships by using powers of x (like x²,
x³).
2. It extends linear regression to fit curved trends.
3. The model has the form: y = a + b1x + b2x² + b3x³ + ... + bn·xⁿ.
4. It improves prediction accuracy when the data curve isn't straight.
5. Higher degree polynomials can overfit, so model degree must be chosen wisely.
9. How is logistic regression model different from linear regression? (2 marks)
Feature Linear Regression Logistic Regression
Output Type Continuous numeric value Probability between 0 and 1
Purpose Predicts quantities (e.g., salary) Classifies outcomes (e.g., pass/fail)
Graph Shape Straight line S-curve (sigmoid)
Dependent Variable Numeric Categorical (binary/multiclass)
1. How does the structure of a decision tree help in classifying a data instance? (2 marks)
A decision tree uses a hierarchical, tree-like structure where each internal node represents a
test on an attribute, each branch represents an outcome of the test, and each leaf node
represents a class label. It classifies a data instance by following the path from the root to a
leaf based on attribute values.
2. What are the different metrics used in deciding the splitting attribute? (2 marks)
The common metrics used for splitting attributes in decision trees include:
• Information Gain
• Gain Ratio
• Gini Index
3. Define Entropy. (2 marks)
Entropy is a measure of impurity or randomness in a dataset. It quantifies the uncertainty in
predicting the class of a data point and is calculated as:
Entropy(S) = - Σ p(i) * log₂ p(i)
where p(i) is the proportion of instances belonging to class i.
4. Relate Entropy and Information Gain. (5 marks)
1. Entropy measures the impurity in the data.
2. Information Gain quantifies the reduction in entropy after a dataset is split on an
attribute.
3. Information Gain = Entropy(parent) − Weighted Sum of Entropies(children)
4. High Information Gain means better split—more pure subsets.
5. Decision trees use Information Gain to choose the best splitting attribute.
5. How does a C4.5 algorithm perform better than ID3? What metric is used in the
algorithm? (5 marks)
Feature ID3 C4.5
Splitting Metric Information Gain Gain Ratio
Handles Continuous Data No Yes
Handles Missing Values No Yes
Tree Pruning Not available Post-pruning supported
Performance Lower on noisy/incomplete data Better due to advanced features
C4.5 uses Gain Ratio as the splitting metric, which normalizes Information Gain and avoids
bias toward attributes with many values.
6. What is CART? (2 marks)
CART (Classification and Regression Trees) is a decision tree algorithm that builds binary
trees for both classification and regression tasks. It uses the Gini index for classification and
variance reduction for regression.
7. How does CART solve regression problems? (2 marks)
CART solves regression problems by splitting the dataset into subsets such that the variance
of the target variable within each subset is minimized. The leaf nodes contain the average of
target values in that region, providing numeric predictions.
8. What is meant by pre-pruning and post-pruning? Compare both the methods. (5 marks)
Feature Pre-Pruning Post-Pruning
Timing Done during tree construction Done after full tree is built
Feature Pre-Pruning Post-Pruning
Strategy Stops tree growth early Trims back a large tree
Criteria Uses thresholds (depth, info gain, etc.) Uses error rates, validation sets
Risk May stop before capturing patterns May overfit first, then reduce
Efficiency Faster and less memory usage More accurate but computationally heavy
9. How are continuous attributes discretized? (2 marks)
Continuous attributes are discretized by splitting them into intervals using thresholds. The
decision tree algorithm chooses split points (like x ≤ 5.7) that best reduce entropy or Gini
index, based on the target class distribution.
10. Discretize the continuous attribute "Percentage". (5 marks)
Table 6.42: Training Dataset
S.No. Percentage Averted
1 95 Yes
2 80 Yes
3 72 No
4 65 Yes
5 95 Yes
6 32 No
7 66 No
8 54 No
9 89 Yes
10 72 Yes
Step-by-step Discretization:
1. Sort by Percentage: 32, 54, 65, 66, 72, 72, 80, 89, 95, 95
2. Observe class changes:
o Between 66 (No) and 72 (Yes) → change: No → Yes
o Between 54 (No) and 65 (Yes) → change: No → Yes
o So candidate thresholds: (54+65)/2 = 59.5, (66+72)/2 = 69.0
3. Choose thresholds based on purity:
o Discretize into intervals:
▪ ≤59.5 → Mostly No
▪ 59.5–69.0 → Mixed
▪ 69.0 → Mostly Yes
Final Discretized Intervals:
Percentage Range Label
≤ 59.5 Low
59.6 – 69.0 Medium
> 69.0 High
11. Construct Decision Trees using ID3, C4.5, and CART (10 marks)
Table 6.43: Training Dataset
S.No. Assessment Assignment Project Seminar Result
1 Good Yes No Good Pass
2 Average Yes No Poor Fail
3 Good No No Good Pass
4 Poor No No Poor Fail
5 Good Yes No Good Pass
6 Average No No Good Pass
7 Good No No Fair Pass
8 Poor Yes No Good Fail
9 Average No No Poor Fail
10 Good Yes Yes Fair Pass
A. ID3 Algorithm
• Splitting Criterion: Information Gain
• Root Attribute: Assessment (provides highest gain)
• Branches:
o Good → Mostly Pass → Pass
o Poor → Mostly Fail → Fail
o Average → Further split on Seminar
B. C4.5 Algorithm
• Splitting Criterion: Gain Ratio
• Handles continuous data (if any) and missing values
• Similar tree to ID3, but:
o Prefers attributes with better normalized information gain
o May choose Seminar as the root due to balanced distribution
C. CART Algorithm
• Splitting Criterion: Gini Index
• Binary Tree Structure
• First split: Seminar
o Good → Majority: Pass
o Poor → Majority: Fail
o Fair → Further split on Assessment
Summary Table:
Algorithm Split Criterion Tree Type Notes
ID3 Information Gain Multi-way May overfit, no pruning
C4.5 Gain Ratio Multi-way Handles missing/continuous data
CART Gini Index Binary Always binary splits, uses pruning
1. What is meant by Probabilistic-based learning? (2 marks)
Probabilistic-based learning refers to a class of machine learning methods that use
probability theory to handle uncertainty in the data and models. It involves the use of
probabilistic models to make predictions, where outcomes are represented as probability
distributions, and the model updates its belief as new evidence is observed.
2. Differentiate between probabilistic model and deterministic model. (2 marks)
Aspect Probabilistic Model Deterministic Model
Predicts outcomes with
Nature Deals with uncertainty and randomness
certainty
Provides a single, definite
Output Gives probabilities for different outcomes
outcome
Model Models use probability distributions and
Fixed set of rules or equations
Type randomness
Linear Regression, Decision
Example Naïve Bayes, Hidden Markov Models
Trees
3. What is meant by Bayesian learning? (2 marks)
Bayesian learning is a method of machine learning that applies Bayes' theorem to update
the probability of a hypothesis based on prior knowledge and observed evidence. It involves
updating beliefs about the model parameters in a probabilistic manner as new data is
observed.
4. Define the following: (10 marks)
• Conditional probability: The probability of an event AAA occurring given that event
BBB has already occurred, denoted as P(A∣B)P(A|B)P(A∣B).
P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}P(A∣B)=P(B)P(A∩B)
• Joint probability: The probability of two events AAA and BBB occurring together,
denoted as P(A∩B)P(A \cap B)P(A∩B).
P(A∩B)=P(A∣B)×P(B)=P(B∣A)×P(A)P(A \cap B) = P(A|B) \times P(B) = P(B|A) \times
P(A)P(A∩B)=P(A∣B)×P(B)=P(B∣A)×P(A)
• Bayesian Probability: Refers to the probability interpretation used in Bayesian
learning, where the posterior probability of a hypothesis is updated based on the
prior probability and the likelihood of the observed data.
P(H∣D)=P(D∣H)P(H)P(D)P(H|D) = \frac{P(D|H)P(H)}{P(D)}P(H∣D)=P(D)P(D∣H)P(H)
• Marginal probability: The probability of an event occurring regardless of the
outcomes of other events, obtained by summing over all possible values of the other
variables.
P(A)=∑BP(A∩B)P(A) = \sum_{B} P(A \cap B)P(A)=B∑P(A∩B)
5. What is the difference between prior, posterior, and likelihood probabilities? (5 marks)
Term Definition Example
P(H)P(H)P(H): Probability of having a
The probability of a hypothesis
Prior disease before knowing the result of a
before any evidence is observed.
medical test.
The updated probability of a
Posterior hypothesis after observing new ( P(H
evidence.
The probability of observing the
Likelihood ( P(D
data given a specific hypothesis.
6. Define Maximum A Posteriori (MAP) Hypothesis (hMAPh_{MAP}hMAP) and Maximum
Likelihood (ML) Hypothesis (hMLh_{ML}hML). (5 marks)
• MAP Hypothesis (h<sub>MAP</sub>):
The Maximum A Posteriori hypothesis is the one that is most probable after
considering both the prior probability of the hypothesis and the likelihood of the
observed data.
It uses Bayes' Theorem to find the hypothesis that best explains the data, given
what we already believe.
Formula:
h<sub>MAP</sub> = argmax<sub>h</sub> P(h | D) = argmax<sub>h</sub> [P(D | h) ×
P(h)]
This means we choose the hypothesis that has the highest product of the likelihood of
data and the prior probability.
• ML Hypothesis (h<sub>ML</sub>):
The Maximum Likelihood hypothesis is the one that makes the observed data most
likely, without considering any prior knowledge.
It focuses only on the likelihood, not on any pre-existing belief about the
hypothesis.
Formula:
h<sub>ML</sub> = argmax<sub>h</sub> P(D | h)
This means we choose the hypothesis that best explains the data, based only on the data
itself.
1. Discrete Attributes Problem: Naïve Bayes with Laplace Correction (10 marks)
Given:
Test data: Assessment = Average, Assignment = Yes, Project = No, Seminar = Good
Training Dataset:
S.No. Assessment Assignment Project Seminar Result
1 Good Yes Yes Good Pass
2 Average Yes No Poor Fail
3 Good No Yes Good Pass
4 Average No No Poor Fail
5 Average No Yes Good Pass
6 Good No No Poor Pass
7 Average Yes Yes Good Fail
8 Good Yes Yes Poor Pass
Step 1: Calculate the prior probabilities
• P(Pass) = 5/8
• P(Fail) = 3/8
Step 2: Calculate the likelihoods for each attribute given the result
For Pass:
• P(Assessment = Average | Pass) = 2/5
• P(Assignment = Yes | Pass) = 3/5
• P(Project = No | Pass) = 3/5
• P(Seminar = Good | Pass) = 2/5
For Fail:
• P(Assessment = Average | Fail) = 2/3
• P(Assignment = Yes | Fail) = 1/3
• P(Project = No | Fail) = 2/3
• P(Seminar = Good | Fail) = 0/3
Step 3: Apply Laplace Correction to calculate probabilities
For Pass:
• P(Assessment = Average | Pass) = (2+1)/(5+3) = 3/8
• P(Assignment = Yes | Pass) = (3+1)/(5+3) = 4/8
• P(Project = No | Pass) = (3+1)/(5+3) = 4/8
• P(Seminar = Good | Pass) = (2+1)/(5+3) = 3/8
For Fail:
• P(Assessment = Average | Fail) = (2+1)/(3+3) = 3/6
• P(Assignment = Yes | Fail) = (1+1)/(3+3) = 2/6
• P(Project = No | Fail) = (2+1)/(3+3) = 3/6
• P(Seminar = Good | Fail) = (0+1)/(3+3) = 1/6
Step 4: Calculate the posterior probabilities using Bayes' theorem
For Pass:
P(Pass|D)∝P(Pass)×P(Assessment=Average∣Pass)×P(Assignment=Yes∣Pass)×P(Project=No∣Pas
s)×P(Seminar=Good∣Pass)
P(Pass ∣ D)∝(5/8)×(3/8)×(4/8)×(4/8)×(3/8)=(5x3x4x4x3)/8^5
For Fail:
P(Fail∣D)∝P(Fail)×P(Assessment=Average∣Fail)×P(Assignment=Yes∣Fail)×P(Project=No∣Fail)×P(
Seminar=Good∣Fail)
P(Fail | D) ∝(3/8)×(3/6)×(2/6)×(3/6)×(1/6)=(3×3×2×3×1)/8^5
Step 5: Compare the posterior probabilities
• If P(Pass∣D)>P(Fail∣D)P(Pass | D) > P(Fail | D)P(Pass∣D)>P(Fail∣D), predict "Pass".
• If P(Fail∣D)>P(Pass∣D)P(Fail | D) > P(Pass | D)P(Fail∣D)>P(Pass∣D), predict "Fail".
2. Continuous Attributes Problem: Gaussian Naïve Bayes (10 marks)
Given:
Test data: Assessment Marks = 75, Assignment Marks = 6, Seminar Done = Poor
Training Dataset:
S.No. Assessment Marks Assignment Marks Seminar Done Result
1 95 8 Good Pass
2 71 5 Poor Fail
S.No. Assessment Marks Assignment Marks Seminar Done Result
3 93 9 Good Pass
4 62 4 Poor Fail
5 81 9 Good Pass
6 93 8.5 Poor Pass
7 65 9 Good Pass
8 45 3 Poor Fail
9 78 8.5 Good Pass
10 56 4 Poor Fail
Step 1: Calculate the mean and variance for each class and feature
For Pass:
• Mean and variance of Assessment Marks (given the rows with "Pass")
• Mean and variance of Assignment Marks (given the rows with "Pass")
• Mean and variance of Seminar Done (given the rows with "Pass")
For Fail:
• Mean and variance of Assessment Marks (given the rows with "Fail")
• Mean and variance of Assignment Marks (given the rows with "Fail")
• Mean and variance of Seminar Done (given the rows with "Fail")
Step 2: Apply Gaussian Naïve Bayes to calculate the likelihood for each class
Use the Gaussian probability density function (PDF) to compute the likelihoods of the test
data given each feature for both classes.
Step 3: Calculate the posterior probability for each class using Bayes' Theorem
P(Pass∣D)∝P(Pass)×P(AssessmentMarks∣Pass)×P(AssignmentMarks∣Pass)×P(SeminarDone∣Pa
ss)
P(Fail∣D)∝P(Fail)×P(AssessmentMarks∣Fail)×P(AssignmentMarks∣Fail)×P(SeminarDone∣Fail)
Step 4: Compare the posterior probabilities
• If P(Pass∣D)>P(Fail∣D)P(Pass | D) > P(Fail | D)P(Pass∣D)>P(Fail∣D), predict "Pass".
• If P(Fail∣D)>P(Pass∣D)P(Fail | D) > P(Pass | D)P(Fail∣D)>P(Pass∣D), predict "Fail".
1. What is a margin in SVM? (2 marks)
Answer: The margin is the distance between the decision boundary (hyperplane) and
the closest data points (support vectors) from each class. A larger margin indicates
better model generalization.
2. List the advantages of SVM. (2 marks)
Answer:
o Effective in high-dimensional spaces
o Robust to overfitting (especially with good margin)
o Versatile through kernel functions
o Memory efficient (uses only support vectors)
o Provides unique optimal solution
3. Formulate the optimization problem for maximal margin classification. (5 marks)
Answer:
The primal optimization problem is:
Minimize: ½||w||²
Subject to: yᵢ(wᵀxᵢ + b) ≥ 1 ∀i
Where:
o w = weight vector
o b = bias term
o xᵢ = data points
o yᵢ = class labels (-1 or +1)
4. List advantages and disadvantages of mapping functions in SVM. (5 marks)
Answer:
Advantages:
o Handles non-linear decision boundaries
o Flexible modeling through kernel trick
o Can capture complex patterns
Disadvantages:
o Computationally intensive for large datasets
o Kernel selection affects performance
o Risk of overfitting with improper parameters
5. List the types of kernels used in SVM. (2 marks)
Answer:
1. Linear: K(x,y) = xᵀy
2. Polynomial: K(x,y) = (γxᵀy + r)^d
3. RBF/Gaussian: K(x,y) = exp(-γ||x-y||²)
4. Sigmoid: K(x,y) = tanh(γxᵀy + r)
6. What is a kernel trick? (2 marks)
Answer: The kernel trick allows SVM to implicitly map input data into high-
dimensional feature spaces by replacing dot products with kernel functions, avoiding
explicit transformation and reducing computational cost.
Long Answer Questions (5-10 marks each)
1. Explain the construction of kernel-based classifiers. (5 marks)
Answer:
Kernel-based classifiers are constructed by:
1. Selecting an appropriate kernel function that measures similarity between
data points
2. Formulating the dual optimization problem using the kernel matrix (Gram
matrix)
3. Solving for the Lagrange multipliers (α) that determine support vectors
4. The decision function becomes:
f(x) = sign(∑ αᵢyᵢK(xᵢ,x) + b)
5. Key properties:
▪ Never computes coordinates in feature space
▪ Relies only on pairwise similarities (kernel evaluations)
▪ Complexity depends on number of support vectors
2. Numerical Problems Solutions
1. Compute the norm of the vector (2, 8) (2 marks)
Solution:
Norm=22+82=4+64=68=217≈8.246Norm=22+82=4+64=68=217
≈8.246
Answer:
The norm is 217217 (≈ 8.246).
3. Hyperplane Comparison (5 marks)
Given:
• Classifier 1: 5+2x1+2x2=05+2x1+2x2=0
• Classifier 2: 3+20x1+20x2=03+20x1+20x2=0
Solution:
1. Margin Calculation:
o Margin = ∣b∣∥w∥∥w∥∣b∣.
o For Classifier 1: 522+22=522≈1.76822+225=225≈1.768.
o For Classifier 2: 3202+202=3202≈0.106202+2023=2023≈0.106.
2. Conclusion:
Classifier 1 has a larger margin, making it more robust to noise.
Answer:
Classifier 1 is better due to its larger margin (≈ 1.768 vs. ≈ 0.106).
4. Optimal Hyperplane for Given Points (5 marks)
Given:
• Class +1: (3,1), (3,-1), (4,0)
• Class -1: (1,0), (0,1), (0,-1)
Solution:
1. Plot Points:
o The optimal hyperplane is the midline between the closest points of opposite
classes, e.g., (3,1) and (1,0).
2. Equation:
A possible hyperplane is x1−2x2−1=0x1−2x2−1=0.
Answer:
The hyperplane x1−2x2−1=0x1−2x2−1=0 separates the classes with maximum margin.
5. Mapping Function Application (2 marks)
Given:
ϕ(x1,x2)=(x12,2x1x2,x22)ϕ(x1,x2)=(x12,2x1x2,x22)
Solution:
• For (2,3):
(4,32,9)(4,32,9).
• For (6,7):
(36,72,49)(36,72,49).
Answer:
Mapped points are (4,32,9)(4,32,9) and (36,72,49)(36,72,49).
6. Transformations (5 marks)
Given Points: (1,4) and (2,7)
Solution:
1. Transformation (a): [x,y,x3+y3][x,y,x3+y3]
o (1,4): [1,4,1+64]=[1,4,65][1,4,1+64]=[1,4,65].
o (2,7): [2,7,8+343]=[2,7,351][2,7,8+343]=[2,7,351].
2. Transformation (b): [x,y,2xy][x,y,2xy]
o (1,4): [1,4,42][1,4,42].
o (2,7): [2,7,142][2,7,142].
Answer:
• (a): [1,4,65] and [2,7,351].
• (b): [1,4,4√2] and [2,7,14√2].
1. What is similarity-based learning? (2 marks)
Answer:
A machine learning approach where predictions are made by comparing new instances
to stored examples using similarity measures (e.g., Euclidean distance). Examples include
k-NN and case-based reasoning.
2. Instance-based vs. Model-based Learning (2 marks)
Answer:
Aspect Instance-based Model-based
Approach Stores raw data, no explicit model Builds a generalized model
Prediction Compares to stored instances Uses learned parameters
Example k-NN Linear Regression
3. Why are instance-based learners called lazy? (2 marks)
Answer:
They defer computation until prediction time (no training phase), simply storing the
training data instead of learning a model.
4. Lazy vs. Eager Learning (2 marks)
Answer:
• Lazy: No training (e.g., k-NN). Fast training but slow predictions.
• Eager: Builds model during training (e.g., SVM). Slow training but fast predictions.
5. Why is k-NN memory-based? (2 marks)
Answer:
It stores the entire training dataset in memory to compare new instances during
prediction, rather than discarding data after training.
6. Why normalize data in k-NN? (2 marks)
Answer:
To prevent features with larger scales (e.g., age 0-100) from dominating distance
calculations over smaller-scale features (e.g., salary 0-1,000,000).
7. Benefits & Limitations of k-NN (2 marks)
Answer:
• Benefits: Simple, no training phase, handles non-linear data.
• Limitations: Slow predictions, sensitive to noise/irrelevant features, curse of
dimensionality.
8. k-NN Classification Solution (10 marks)
Given:
• Test Instance: GPA = 7.8, Projects = 4
• k=3
• Distance Metric: Euclidean distance
Step 1: Calculate Euclidean Distances
Distance=(GPA1−GPA2)2+(Projects1−Projects2)2Distance=(GPA1−GPA2)2+(Projects1
−Projects2)2
Distance
S.No. GPA Projects Award Distance
Calculation
√[(7.8-9.5)² + √[2.89 + 1] ≈
1 9.5 5 Yes
(4-5)²] 1.97
√[(7.8-8.0)² + √[0.04 + 0] =
2 8.0 4 Yes
(4-4)²] 0.20
√[(7.8-7.2)² + √[0.36 + 9] ≈
3 7.2 1 No
(4-1)²] 3.06
√[(7.8-6.5)² + √[1.69 + 1] ≈
4 6.5 5 Yes
(4-5)²] 1.64
√[(7.8-9.5)² + √[2.89 + 0] ≈
5 9.5 4 Yes
(4-4)²] 1.70
√[(7.8-3.2)² + √[21.16 + 9]
6 3.2 1 No
(4-1)²] ≈ 5.49
√[(7.8-6.6)² + √[1.44 + 9] ≈
7 6.6 1 No
(4-1)²] 3.23
√[(7.8-5.4)² + √[5.76 + 9] ≈
8 5.4 1 No
(4-1)²] 3.84
√[(7.8-8.9)² + √[1.21 + 1] ≈
9 8.9 3 Yes
(4-3)²] 1.49
√[(7.8-7.2)² + √[0.36 + 0] =
10 7.2 4 Yes
(4-4)²] 0.60
Step 2: Identify 3 Nearest Neighbors
1. Instance 2: Distance = 0.20 (Award = Yes)
2. Instance 10: Distance = 0.60 (Award = Yes)
3. Instance 9: Distance = 1.49 (Award = Yes)
Step 3: Majority Vote
• All 3 nearest neighbors have Award = Yes.
Final Prediction:
The test instance (GPA = 7.8, Projects = 4) is classified as "Yes" for the award.
Question 1: Metrics from Confusion Matrix
Given Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive 5 5
Actual Negative 7 12
From this:
• True Positives (TP) = 5
• False Negatives (FN) = 5
• False Positives (FP) = 7
• True Negatives (TN) = 12
Now calculate the metrics:
• Accuracy = (TP + TN) / Total = (5 + 12) / (5 + 5 + 7 + 12) = 17 / 29 ≈ 0.5862
• Precision = TP / (TP + FP) = 5 / (5 + 7) = 5 / 12 ≈ 0.4167
• Recall = TP / (TP + FN) = 5 / (5 + 5) = 5 / 10 = 0.5
• F1-score = 2 × (Precision × Recall) / (Precision + Recall) =
2 × (0.4167 × 0.5) / (0.4167 + 0.5) ≈ 0.4545
Question 2: From Dataset – Confusion Matrix, Precision, Recall
ID Actual Predicted
1 Yes Yes
2 No No
3 Yes Yes
4 No No
5 Yes No
6 No No
7 Yes Yes
8 No No
9 No No
Confusion matrix:
Predicted Yes Predicted No
Actual Yes 3 (TP) 1 (FN)
Actual No 0 (FP) 5 (TN)
Now calculate:
• Precision = TP / (TP + FP) = 3 / (3 + 0) = 1.0
• Recall = TP / (TP + FN) = 3 / (3 + 1) = 0.75
• F1-score = 2 × (1.0 × 0.75) / (1.0 + 0.75) = 0.8571
• Accuracy = (TP + TN) / Total = (3 + 5) / 9 = 0.8889
3. What is K-Fold Cross Validation? Explain. In what specific case Leave-One-Out
(LOO) Cross-Validation is costlier than K-Fold Cross Validation? (5 marks)
K-Fold Cross Validation:
• K-Fold Cross Validation is a technique used to evaluate the performance of a machine
learning model.
• The dataset is divided into k equal-sized subsets or “folds.”
• The model is trained on k−1 folds and tested on the remaining one fold.
• This process is repeated k times, each time using a different fold as the test set.
• The final performance metric is the average of the results from the k iterations.
Leave-One-Out Cross Validation (LOOCV):
• LOOCV is a special case of K-Fold Cross Validation where k = n (number of samples).
• Each iteration uses one instance for testing and the remaining n-1 for training.
• This means the model is trained n times.
When is LOOCV costlier?
• When the dataset is large, LOOCV requires significantly more computation compared
to k-fold (e.g., 10-fold) validation.
• Thus, LOOCV becomes computationally expensive for large datasets due to the high
number of iterations.
4. Derive the Mathematical Expressions for Coefficients of a Linear Regression
Model Using Least Squares Method (10 marks)
Derivation of Linear Regression Coefficients Using Least Squares Method
Given:
A dataset with nn observations (xi,yi)(xi,yi) and the linear model:
y=β0+β1x+ϵy=β0+β1x+ϵ
where:
• β0β0 = y-intercept
• β1β1 = slope
• ϵϵ = error term
Objective:
Minimize the Sum of Squared Errors (SSE):
SSE=∑i=1n(yi−(β0+β1xi))2SSE=i=1∑n(yi−(β0+β1xi))2
Step 1: Compute Partial Derivatives
Take partial derivatives of SSE with respect to β0β0 and β1β1 and set them to zero.
1. Partial derivative w.r.t. β0β0:
∂SSE∂β0=−2∑i=1n(yi−β0−β1xi)=0∂β0∂SSE=−2i=1∑n(yi−β0−β1xi)=0
Simplifies to:
∑yi−nβ0−β1∑xi=0(Equation 1)∑yi−nβ0−β1∑xi=0(Equation 1)
2. Partial derivative w.r.t. β1β1:
∂SSE∂β1=−2∑i=1nxi(yi−β0−β1xi)=0∂β1∂SSE=−2i=1∑nxi(yi−β0−β1xi)=0
Simplifies to:
∑xiyi−β0∑xi−β1∑xi2=0(Equation 2)∑xiyi−β0∑xi−β1∑xi2=0(Equation 2)
Step 2: Solve the Normal Equations
From Equation 1:
β0=∑yi−β1∑xin(A)β0=n∑yi−β1∑xi(A)
Substitute β0β0 into Equation 2:
∑xiyi−(∑yi−β1∑xin)∑xi−β1∑xi2=0∑xiyi−(n∑yi−β1∑xi)∑xi−β1∑xi2=0
Multiply through by nn:
n∑xiyi−(∑yi−β1∑xi)∑xi−nβ1∑xi2=0n∑xiyi−(∑yi−β1∑xi)∑xi−nβ1∑xi2=0
Simplify to solve for β1β1:
β1=n∑xiyi−∑xi∑yin∑xi2−(∑xi)2β1=n∑xi2−(∑xi)2n∑xiyi−∑xi∑yi
From (A), compute β0β0:
β0=yˉ−β1xˉβ0=yˉ−β1xˉ
where xˉ=∑xinxˉ=n∑xi and yˉ=∑yinyˉ=n∑yi.
Final Coefficients:
β1=n∑xiyi−∑xi∑yin∑xi2−(∑xi)2β0=yˉ−β1xˉβ1β0=n∑xi2−(∑xi)2n∑xiyi−∑xi∑yi=yˉ−β1xˉ
5. Naïve Bayes Classification (10 marks)
Given Test Instance:
X = (age = youth, income = medium, student = yes, credit_rating = fair)
Step 1: Calculate Priors
• Total records = 14
• Yes = 9
• No = 5
P(Yes)=914,P(No)=514P(Yes) = \frac{9}{14}, \quad P(No) = \frac{5}{14}P(Yes)=149
,P(No)=145
Step 2: Calculate Likelihoods
Feature Value P(Value | Yes) P(Value | No)
----------------------------------------------------------
Age Youth 2/9 3/5
Income Medium 4/9 2/5
Student Yes 6/9 1/5
Credit Rating Fair 6/9 2/5
Step 3: Apply Naïve Bayes Formula
P(Class∣X)∝P(Class)×∏P(feature∣Class)
For "Yes":
P(Yes∣X)∝(9/14)×(2/9)×(4/9)×(6/9)×(6/9)=(9/14)×(288/6561)≈0.0281
For "No":
P(No∣X)∝(5/14)×(3/5)×(2/5)×(1/5)×(2/5)=(5/14)×(12/625)≈0.0137
Step 4: Conclusion
Since:
P(Yes∣X)>P(No∣X)P(Yes|X) > P(No|X)P(Yes∣X)>P(No∣X)
Final Classification: Yes
Question 6: K-Nearest Neighbors (KNN) Classification
Task: For the given dataset, classify the test point (4.9, 1.75) using KNN
with k=1,2,3k=1,2,3. Use Euclidean distance.
Dataset:
Attribute_1 Attribute_2 Class
4.9 1.5 1
5.1 1.6 1
5.0 1.4 1
6.0 2.0 2
6.1 1.9 2
6.2 2.1 2
Solution:
1. Calculate Euclidean distances from (4.9, 1.75) to all points:
o D1=((4.9−4.9)2+(1.75−1.5)2)^1/2=0.25
o D2=((4.9−5.1)2+(1.75−1.6))^1/2≈0.25
o D3=((4.9−5.0)2+(1.75−1.4)2)^1/2≈0.36
o D4=((4.9−6.0)2+(1.75−2.0)2)^1/2≈1.12
o D5=((4.9−6.1)2+(1.75−1.9)2)^1/2≈1.21
o D6=((4.9−6.2)2+(1.75−2.1)2)^1/2≈1.34
2. Rank distances in ascending order:
D1,D2,D3,D4,D5,D6D1,D2,D3,D4,D5,D6
3. Classify for different kk:
o k=1k=1: Nearest neighbor is D1D1 (Class 1).
o k=2k=2: Two nearest are D1,D2D1,D2 (Both Class 1).
o k=3k=3: Three nearest are D1,D2,D3D1,D2,D3 (All Class 1).
Final Answer:
For all k=1,2,3k=1,2,3, the test point (4.9, 1.75) is classified as Class 1.
8. How does regularization solve overfitting? (2 marks)
Answer: Regularization adds penalty terms (L1/L2) to the loss function, reducing
model complexity by shrinking coefficients, thus preventing over-reliance on noisy
training data.
9. How are internal nodes split in Decision Trees? (2 marks)
Answer: By selecting the feature and threshold that maximize information gain (or
minimize impurity) using metrics like Gini or Entropy.
10. How is impurity measured in Decision Trees? (2 marks)
Answer: Using:
o Gini Index: 1−∑pi2
o Entropy: −∑pilog2pi
where pi is class probability.
11. "Entropy = 0.5" significance? (2 marks)
Answer: The split is moderately impure (max entropy = 1 for equal class distribution;
0.5 indicates partial separation).
12. What is underfitting? (2 marks)
Answer: When a model is too simple to capture patterns in the data, exhibiting high
bias and poor performance on both training and test sets.
13. What is overfitting? (2 marks)
Answer: When a model memorizes training data (including noise), performing well
on training but poorly on unseen data due to high variance.
14. What is a confusion matrix? (2 marks)
Answer: A table showing:
Predicted + Predicted -
Actual + TP FN
Actual - FP TN
15. What is an ROC curve? (2 marks)
Answer: A plot of True Positive Rate (TPR) vs. False Positive Rate (FPR) at various
thresholds, measuring classifier performance.
16. Class conditional independence in Naïve Bayes? (2 marks)
Answer: Assumes features are independent given the class label, simplifying joint
probability to P(x1,…,xn∣y)=∏P(xi∣y)P(x1,…,xn∣y)=∏P(xi∣y).
17. MLE for univariate Gaussian parameters (5 marks)
Answer: For data {x1,…,xn}{x1,…,xn}:
o Mean (μμ): μ^=1n∑xiμ^=n1∑xi
o Variance (σ2σ2): σ^2=1n∑(xi−μ^)2σ^2=n1∑(xi−μ^)2.
18. Maximum a Posteriori (MAP) estimate? (2 marks)
Answer: An estimate that maximizes the posterior distribution P(θ∣X)P(θ∣X),
combining likelihood P(X∣θ)P(X∣θ) and prior P(θ)P(θ).
MOD5
2 marks
Correlation vs. Covariance
Correlation measures both the strength and direction of the linear relationship
between two variables, standardized to a range of -1 to 1. Covariance measures only
the direction of the linear relationship (positive or negative) but its magnitude is not
standardized and depends on the variables' scales.
PCA Analysis
Need for PCA: PCA (Principal Component Analysis) is used for dimensionality
reduction by transforming correlated variables into uncorrelated principal components
while retaining most of the original variance. It helps visualize high-dimensional data,
remove noise, and improve algorithm performance.
Scree plots in PCA display the eigenvalues of principal components in descending
order. They help determine the optimal number of components to retain by showing
where the eigenvalues level off (the "elbow point"), indicating components that explain
most variance before the drop-off.
Classification vs. Clustering
Classification is a supervised learning method where predefined labels are assigned
to instances based on training data. Clustering is an unsupervised learning method
that groups similar instances without predefined labels.
Clustering Schemes
Advantages:
• Discovers natural groupings in data
• Works without labeled data
• Useful for exploratory data analysis
• Can handle various data types
Disadvantages:
• Results can be subjective to interpret
• Sensitive to initial parameters
• May struggle with high-dimensional data
• Cluster quality depends on distance metric choice
Applications:
• Customer segmentation
• Image segmentation
• Anomaly detection
• Document clustering
• Social network analysis
Challenges:
• Determining optimal number of clusters
• Handling noisy data
• Scalability with large datasets
• Choosing appropriate distance metrics
• Dealing with clusters of varying shapes/densities
Problems with Clustering Large Data
• High computational complexity with distance calculations
• Memory constraints
• Difficulty visualizing results
• Increased sensitivity to noise/outliers
• Convergence issues with iterative algorithms
Agglomerative Algorithm Procedure
1. Start with each point as its own cluster
2. Compute pairwise distance matrix between all clusters
3. Merge the two closest clusters
4. Update distance matrix using linkage criterion (single, complete, average, etc.)
5. Repeat steps 3-4 until all points are in one cluster
6. Use dendrogram to determine final clustering
k-means Algorithm
k in k-means: k represents the number of clusters to form. It's typically selected using:
• Elbow method (plotting WCSS against k)
• Silhouette analysis
• Gap statistic
• Domain knowledge
Different initializations: k-means converges to local minima. Random initial centroids
may lead to different final clusters. Solutions include k-means++ initialization or
multiple runs.
Core, Border, Noise Points in DBSCAN
• Core point: Has at least minPts within its ε-neighborhood
• Border point: Has fewer than minPts but is reachable from a core point
• Noise point: Neither core nor border point (outlier)
Density in DBSCAN
Density is measured by counting points within a specified ε radius (neighborhood). A
region is dense if it contains at least minPts within this radius.
Grid in Clustering
A grid divides the data space into rectangular cells for grid-based clustering methods
(like STING or CLIQUE). It enables efficient processing by operating on cells rather than
individual points.
Monotonicity Property
In subspace clustering, the monotonicity property states that if a cluster exists in a
higher-dimensional space, its projections in lower-dimensional subspaces must also
be clusters.
Subspace Clustering
A subspace is a subset of dimensions/features. It's important because clusters may
exist only in specific subspaces in high-dimensional data, not in the full feature space.
Model in Clustering
A model represents the assumed structure of clusters (e.g., centroid-based, density-
based, distribution-based). Different algorithms make different model assumptions.
FCM (Fuzzy C-Means)
Advantages:
• Allows partial membership in clusters
• Handles overlapping clusters well
• More flexible than hard clustering
Disadvantages:
• Computationally intensive
• Sensitive to initial conditions
• Requires predefining number of clusters
• May converge to local optima
Mixture Model
A mixture model assumes data comes from a mixture of probability distributions
(often Gaussian), with each component representing a cluster. Parameters are
typically estimated via EM algorithm.
Silhouette Coefficient
Measures how similar an object is to its own cluster compared to other clusters.
Ranges from -1 (poor fit) to +1 (excellent fit). Calculated as (b-a)/max(a,b), where a is
average intra-cluster distance, b is average nearest-cluster distance.
Cluster Validation Measures
Internal: Silhouette index, Dunn index, Davies-Bouldin index (use only the data)
External: Purity, Rand index, F-measure, Jaccard index (compare to ground truth)
Detailed k-means Algorithm
1. Choose k initial centroids (random or k-means++)
2. Assign each point to nearest centroid (using Euclidean distance typically)
3. Recalculate centroids as mean of all points in cluster
4. Repeat steps 2-3 until:
a. Centroids stabilize (no change)
b. Maximum iterations reached
c. Objective function (WCSS) changes minimally
DBSCAN Algorithm
1. For each unvisited point:
a. Find its ε-neighborhood
b. If it has ≥ minPts neighbors, start a new cluster
c. Expand cluster by adding all density-reachable points
2. Mark points not in any cluster as noise Key features: Handles arbitrary shapes,
robust to noise, doesn't need predefined k
Fuzzy Logic and FCM
Fuzzy logic allows partial truth values between 0 and 1. FCM (Fuzzy C-Means) extends
k-means by allowing points to belong to multiple clusters with membership degrees. It
minimizes weighted within-cluster variance.
EM Algorithm
Expectation-Maximization algorithm for finding maximum likelihood estimates with
latent variables:
1. E-step: Compute expected values of latent variables given current parameters
2. M-step: Update parameters to maximize likelihood given current expectations
3. Repeat until convergence Used in Gaussian mixture models and other
applications with incomplete data.
Here are the detailed answers to the 5-mark questions:
26. Procedure for SVD (Singular Value Decomposition)
Applications:
• Dimensionality reduction (like PCA).
• Data compression.
• Solving linear systems (pseudo-inverse).
27. Obtaining Principal Components & Relevance in Feature Reduction
Relevance in Feature Reduction:
• Retains most variance with fewer dimensions.
• Removes multicollinearity (PCs are orthogonal).
• Improves computational efficiency and visualization.
• Reduces noise by discarding low-variance components.
28. Chebyshev Distance for (0, 3) and (5, 8)
29. Cosine Similarity for Vectors ( A = {1, 1, 0} ) and ( B = {0, 1, 1} )
Answer: The cosine similarity is 0.5.
30. Euclidean, Manhattan, and Chebyshev Distances
Answers:
• (a) Euclidean=3, Manhattan=5, Chebyshev=2.
• (b) Euclidean≈7.81, Manhattan=11, Chebyshev=6.
31. Cosine Similarity, SMC, and Jaccard Coefficients
(b) Vectors: (1, 0, 0, 0, 1) and (1, 0, 0, 0, 1)
• Cosine Similarity: 1 (identical vectors).
• SMC: 1 (all positions match).
• Jaccard: 1 (all non-zero positions match).
Answers:
• (a) Cosine≈0.41, SMC=0.5, Jaccard≈0.33.
• (b) Cosine=1, SMC=1, Jaccard=1.
32. Hamming Distance
(a) (1, 1, 1) and (1, 0, 0)
{Differences} = 2 {(positions 2 and 3)}
(b) (1, 1, 1, 0, 0) and (0, 0, 1, 1, 1)
{Differences} = 4 {(positions 1, 2, 4, 5)}
Answers:
• (a) Hamming distance = 2.
• (b) Hamming distance = 4.
33. Distance Between Categorical Data
(a) Employee ID: 1000 and 1001
• Distance: 1 (treat as ordinal; difference = |1001 - 1000|).
(b) Employee Names
• John & John: 0 (identical).
• John & Joan:
o Edit Distance: 1 (replace 'h' with 'a').
Answers:
• (a) Distance = 1.
• (b) John & John = 0, John & Joan = 1.
34. Distance Between Nominal Data
Answers:
• (a) Distance = 0.
• (b) Distance = 0.8.
Here are the step-by-step solutions to the clustering problems:
35. k-means Algorithm (k=2) for Table 13.15 Data
Given Data:
S.No. X Y
1 3 5
2 7 8
3 12 5
4 16 9
Step 1: Initialize Centroids
Randomly choose 2 centroids (let’s pick points 1 and 4):
• ( C_1 = (3, 5) )
• ( C_2 = (16, 9) )
Step 2: Assign Points to Clusters
Calculate Euclidean distances from each point to centroids:
Point Dist to ( C_1 ) Dist to ( C_2 ) Cluster
(3,5) 0 13.34 1
(7,8) 5 9.43 1
(12,5) 9 5.66 2
(16,9) 13.34 0 2
Step 3: Update Centroids
Recalculate centroids based on current clusters:
Step 4: Reassign Points
Recalculate distances to new centroids:
Point Dist to ( C_1 ) Dist to ( C_2 ) Cluster
(3,5) 2.5 11.18 1
(7,8) 2.5 7.07 1
(12,5) 8.14 2.83 2
(16,9) 11.18 2.83 2
Convergence: Clusters remain the same as in Step 2.
Final Clusters:
• Cluster 1: (3,5), (7,8)
• Cluster 2: (12,5), (16,9)
36. Hierarchical Clustering (Single Linkage) for Table 13.14 Data
Given Data:
S.No. X Y
1 3 5
2 7 8
3 12 5
4 16 9
5 20 8
Seed Points: (7,8) and (16,9)
Step 1: Compute Distance Matrix (Euclidean)
Point (7,8) (16,9)
(3,5) 5 13.34
(7,8) 0 9.43
(12,5) 5.83 5.66
(16,9) 9.43 0
(20,8) 13 4.12
Step 2: Merge Closest Points
• Iteration 1: Merge (16,9) and (20,8) (distance = 4.12).
• Iteration 2: Merge (7,8) and (12,5) (distance = 5.83).
• Iteration 3: Merge (3,5) with (7,8)-(12,5) cluster (distance = 5).
• Iteration 4: Merge all clusters (max distance = 13.34).
Dendrogram:
Height
13.34 | ┌─────────────┐
9.43 | ┌─┘ │
5.83 | ┌─┘ │
4.12 | ┌─┘ │
-------------------------
(3,5) (7,8) (12,5) (16,9) (20,8)
37. Single Linkage Algorithm for Table 13.4 Data
Given Data:
Objects X Y
0 1 4
1 2 8
2 5 10
3 12 18
4 14 28
Step 1: Compute Distance Matrix (Euclidean)
Point 0 1 2 3 4
0 0 4.12 7.21 17.46 26.93
1 4.12 0 3.61 13.45 22.47
2 7.21 3.61 0 10.63 19.70
3 17.46 13.45 10.63 0 10.20
4 26.93 22.47 19.70 10.20 0
Step 2: Merge Closest Clusters
• Iteration 1: Merge points 1 and 2 (distance = 3.61).
• Iteration 2: Merge point 0 with cluster (1,2) (distance = 4.12 to point 1).
• Iteration 3: Merge points 3 and 4 (distance = 10.20).
• Iteration 4: Merge clusters (0,1,2) and (3,4) (distance = 10.63 between point 2
and 3).
Final Dendrogram:
Height
26.93 | ┌───────────────────┐
17.46 | ┌─┘ │
10.63 | ┌─┘ │
10.20 | ┌─┘ │
3.61 | └─────────────────────────┘
-------------------------------
(0) (1) (2) (3) (4)
Key: Single linkage uses the minimum distance between clusters for merging.
38. k-means Algorithm with Initial Seeds (4,6) and (12,4)
Given Data:
Objects X Y
1 2 4
2 4 6
3 6 8
4 10 4
5 12 4
Initial Seeds:
• ( C_1 = (4, 6) )
• ( C_2 = (12, 4) )
Step 1: Assign Points to Clusters
Calculate Euclidean distances:
Point Dist to ( C_1 ) Dist to ( C_2 ) Cluster
(2,4) 2.83 10.20 1
(4,6) 0 8.25 1
(6,8) 2.83 8.94 1
(10,4) 7.21 2.00 2
(12,4) 8.25 0 2
Cluster Assignment:
• Cluster 1: (2,4), (4,6), (6,8)
• Cluster 2: (10,4), (12,4)
Step 2: Update Centroids
Step 3: Reassign Points
Recalculate distances:
Point Dist to ( C_1 ) Dist to ( C_2 ) Cluster
(2,4) 2.83 9.06 1
(4,6) 0 7.28 1
(6,8) 2.83 8.06 1
(10,4) 7.21 1.00 2
(12,4) 8.25 1.00 2
Convergence: Centroids stabilize.
Final Clusters:
• Cluster 1: (2,4), (4,6), (6,8)
• Cluster 2: (10,4), (12,4)
Key Takeaways
1. k-means: Converges when centroids stabilize. Initial seeds impact results.
2. PCA: Transforms data to orthogonal axes of max variance, invertible for
reconstruction.
3. SVD: Factorizes ( A ) into ( U ∑ VT ), useful for dimensionality reduction and matrix
approximations.
41. PCA for Matrix ( A = \begin{pmatrix} 4 & 3 \ 1 & 2 \end{pmatrix} ) (10
marks)
Step 1: Center the Data
Step 2: Covariance Matrix
Step 3: Eigen Decomposition
Verification: Perfect reconstruction (lossless for non-zero eigenvalues).
43. Matrix Decompositions and SVD Proof
Key Decompositions:
Numerical Verification:
Using Python (for precision):
import numpy as np
A = np.array([[4, 3], [1, 2]])
U, S, Vt = np.linalg.svd(A)
print("Reconstructed A:\n", U @ np.diag(S) @ Vt)
Output confirms ( A ) is reconstructed perfectly.
Summary
• PCA: Captures max variance directions; reversible for full-rank data.
• SVD: General factorization; handles any matrix via ( U \Sigma V^T ).
• Verification: Both methods reconstruct the original matrix when all components
are used.
Reinforcement Learning Q&A
1. Define the terms agent and environment in the context of reinforcement
learning.
Answer:
• Agent: The entity that learns and makes decisions by interacting with its
surroundings. It takes actions to maximize cumulative rewards.
• Environment: The external system the agent interacts with. It provides
feedback (states and rewards) based on the agent’s actions.
2. What is a reward function in RL, and why is it crucial?
Answer:
• The reward function assigns a numerical value (reward) to each action-
state pair, indicating immediate desirability.
• Crucial because: It defines the agent’s goal, enabling it to learn optimal
behavior by maximizing total rewards over time. Without rewards, the
agent lacks direction.
3. Differentiate between a policy and a value function in reinforcement
learning.
Answer:
Policy Value Function
A strategy (rule) mapping states to Estimates expected long-term return
actions. (cumulative rewards).
Determines "what to do." Evaluates "how good" a state/action is.
Can be deterministic/stochastic. V(s): State value; Q(s,a): Action value.
4. Briefly explain the exploration-exploitation dilemma in RL.
Answer:
• Exploration: Trying new actions to discover potentially better rewards
(e.g., random choices).
• Exploitation: Choosing known high-reward actions to maximize
immediate gains.
• Dilemma: Over-exploiting may miss better long-term strategies, while
over-exploring reduces efficiency. Balancing both is key (e.g., using ε-
greedy methods).
5. What is Q-learning, and what does the "Q" represent?
Answer:
• Q-learning: A model-free RL algorithm where the agent learns the
optimal action-value function (Q-table) through trial and error.
• The "Q" stands for Quality, representing the expected future rewards of
taking action a in state s and following the optimal policy thereafter.
5 marks :
1. Key Components of a Reinforcement Learning System
Question: Explain the key components of a reinforcement learning system,
including the agent, environment, reward, policy, and value function.
Answer:
A reinforcement learning (RL) system is built around five fundamental
components:
• Agent: The learner or decision-maker that interacts with the environment
by taking actions to achieve a goal.
• Environment: The external system with which the agent interacts,
providing feedback in the form of states and rewards.
• Reward: A scalar feedback signal that indicates the immediate success or
failure of an action taken in a particular state.
• Policy (π): A strategy that defines the agent's behavior, mapping states to
actions. It can be deterministic (e.g., "always turn left in state X") or
stochastic (e.g., "turn left with 70% probability in state X").
• Value Function: Estimates the expected long-term return (cumulative
reward) of states or actions:
o State-value function V(s): Expected return starting from state *s*.
o Action-value function Q(s,a): Expected return for taking
action *a* in state *s*.
Importance: These components collectively enable the agent to learn optimal
behavior through interaction and feedback.
2. Exploration-Exploitation Trade-off in RL
Question: Describe the exploration-exploitation trade-off in RL. Why is it
important, and what are some common strategies for addressing it?
Answer:
• Exploration: The process of trying new actions to discover their potential
rewards, which may lead to better long-term outcomes.
• Exploitation: The process of choosing known actions that yield the
highest immediate rewards.
Importance: The trade-off arises because focusing solely on exploitation may
cause the agent to miss better strategies, while excessive exploration can lead to
inefficiency or suboptimal performance.
Strategies:
1. ε-Greedy: The agent exploits the best-known action most of the time but
explores randomly with a small probability ε.
2. Upper Confidence Bound (UCB): Prioritizes actions with high
uncertainty or potential for higher rewards.
3. Thompson Sampling: Uses probability distributions to model
uncertainty and guide exploration.
3. Q-Learning and Optimal Policy Learning
Question: Explain the basic idea behind Q-learning. How does it learn an
optimal policy?
Answer:
• Basic Idea: Q-learning is a model-free RL algorithm that learns the
optimal action-value function, Q(s,a), which represents the expected
cumulative reward of taking action *a* in state *s* and following the
optimal policy thereafter.
Learning Process:
1. Q-Table Initialization: Start with arbitrary values for all state-action
pairs.
2. Update Rule: Use the Bellman equation to iteratively improve Q-values:
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]Q(s,a)←Q(s,a)+α[r+γa′max
Q(s′,a′)−Q(s,a)]
where:
o α is the learning rate.
o γ is the discount factor for future rewards.
o *r* is the immediate reward.
3. Policy Derivation: Once Q-values converge, the optimal policy is
derived by selecting the action with the highest Q-value in each state.
Key Point: Q-learning guarantees convergence to the optimal policy if all state-
action pairs are visited infinitely often and the learning rate decays
appropriately.
4. Convergence in RL Algorithms
Question: Discuss the concept of "convergence" in the context of RL
algorithms. What does it mean for an algorithm to converge?
Answer:
• Definition: An RL algorithm converges when its value function or policy
stabilizes to an optimal or near-optimal solution, meaning further updates
do not significantly change the learned behavior.
Requirements for Convergence:
• Sufficient Exploration: The agent must explore all relevant states and
actions repeatedly.
• Learning Rate Decay: The learning rate α should decrease over time to
ensure stability.
• Discount Factor: Future rewards should be discounted (γ < 1) to
guarantee finite returns.
Challenges: Non-convergence can occur due to:
• Partial observability of states.
• Non-stationary environments (e.g., changing dynamics).
• Function approximation errors (e.g., in deep RL).
5. Challenges in Real-World RL Applications
Question: What are some of the main challenges and issues faced in applying
reinforcement learning to real-world problems?
Answer:
1. Sample Inefficiency: RL often requires a large number of interactions
with the environment, making training slow and costly, especially in real-
world settings like robotics.
2. Credit Assignment: Determining which actions are responsible for long-
term rewards is difficult, particularly in tasks with delayed feedback (e.g.,
medical treatments).
3. Exploration Risks: Random exploration can be dangerous or impractical
in safety-critical domains like autonomous driving or healthcare.
4. Non-Stationarity: Real-world environments often change over time,
requiring adaptive algorithms.
5. Sim-to-Real Gap: Policies trained in simulations may fail when
deployed in the real world due to unrealistic assumptions or unmodeled
dynamics.
Solutions:
• Transfer Learning: Leverage knowledge from simulated environments
to real-world tasks.
• Imitation Learning: Use expert demonstrations to guide the agent's
behavior.
• Safe Exploration: Constrain exploration to avoid hazardous actions.
10 marks :
1. Q-Learning Algorithm (10 Marks)
Question: Explain in detail the Q-learning algorithm, including its
mathematical formulation, working procedure, parameter significance, and
practical limitations. Provide a complete analysis suitable for a 10-mark answer.
Answer:
1. Definition: Q-learning is a model-free, off-policy reinforcement learning
algorithm that learns action-value functions to determine optimal policies.
2. Mathematical Formulation:
o Bellman equation: Q(s,a) ← Q(s,a) + α[r + γmaxQ(s',a') - Q(s,a)]
o Components: State (s), Action (a), Reward (r), Next state (s')
3. Learning Rate (α):
o Controls update magnitude (0 ≤ α ≤ 1)
o High values cause rapid but unstable learning
o Low values lead to slow but stable convergence
4. Discount Factor (γ):
o Determines future reward importance (0 ≤ γ < 1)
o γ=0: myopic agent, γ→1: long-term planning
5. Update Process:
o Initialize Q-table randomly
o Observe current state
o Select action (ε-greedy policy)
o Receive reward and observe new state
o Update Q-value
6. Convergence Properties:
o Guaranteed to converge under specific conditions
o Requires infinite visits to all state-action pairs
o Needs decreasing learning rate
7. Off-Policy Nature:
o Learns optimal policy while following exploratory policy
o Uses max operator for next state
8. Implementation Example:
o Grid world navigation
o Game playing agents
9. Advantages:
o Simple implementation
o Guaranteed convergence
o No environment model needed
10.Limitations:
o Curse of dimensionality
o Discrete state-action requirement
o Overestimation bias
o Exploration dependency
2. Exploration-Exploitation Dilemma (10 Marks)
Question: Analyze the exploration-exploitation trade-off in reinforcement
learning, discussing its theoretical basis, practical significance, and various
balancing strategies with examples.
Answer:
1. Fundamental Concept:
o Exploration: Gathering new information
o Exploitation: Using known information
2. Theoretical Basis:
o Multi-armed bandit problem
o Regret minimization framework
3. Significance:
o Critical for optimal policy discovery
o Impacts learning efficiency
o Affects final performance
4. ε-Greedy Strategy:
o Simple implementation
o Fixed exploration probability
o Example: 90% exploit, 10% explore
5. Decaying ε-Greedy:
o Reduces exploration over time
o Balances early exploration with later exploitation
6. Upper Confidence Bound (UCB):
o Mathematical confidence intervals
o Example: Ad recommendation systems
7. Thompson Sampling:
o Bayesian probability approach
o Example: Clinical trial allocations
8. Boltzmann Exploration:
o Softmax action selection
o Temperature parameter control
9. Optimistic Initialization:
o Encourages early exploration
o Example: Q-learning with high initial values
10.Comparative Analysis:
o Contextual bandits vs MDPs
o Computational requirements
o Performance metrics
3. Real-World RL Applications (10 Marks)
Question: Examine major real-world applications of reinforcement learning,
detailing implementation challenges and innovative solutions across different
domains.
Answer:
1. Robotics:
o Challenge: Sample efficiency
o Solution: Sim-to-real transfer learning
2. Game Playing:
o Challenge: State space complexity
o Solution: Deep Q-networks (DQN)
3. Recommendation Systems:
o Challenge: Dynamic preferences
o Solution: Contextual bandits
4. Autonomous Vehicles:
o Challenge: Safety constraints
o Solution: Constrained RL
5. Healthcare:
o Challenge: Delayed rewards
o Solution: Temporal credit assignment
6. Finance:
o Challenge: Non-stationarity
o Solution: Meta-learning approaches
7. Manufacturing:
o Challenge: Continuous control
o Solution: Policy gradient methods
8. Energy Systems:
o Challenge: Partial observability
o Solution: Recurrent architectures
9. Education:
o Challenge: Personalization
o Solution: Multi-armed bandits
10.Comparative Analysis:
o Domain-specific adaptations
o Performance benchmarks
o Future potential
4. Value-Based vs Policy-Based Methods (10 Marks)
Question: Conduct a comprehensive comparison between value-based and
policy-based reinforcement learning approaches, highlighting their respective
advantages, limitations, and ideal use cases.
Answer:
Policy-Based
Value-Based Hybrid
Methods (e.g.,
Feature Methods (e.g., Q- Methods (Actor-
REINFORCE,
learning, DQN) Critic)
PPO)
Learns value
Directly
function (Q/V) to Combines value
Core Idea optimizes policy
derive optimal function + policy
parameters
policy
Q-values for Probability Both value
Output actions or V- distribution estimates and
values for states over actions policy
Handles
Best for discrete continuous Supports both
Action Space
actions actions types
naturally
Policy-Based
Value-Based Hybrid
Methods (e.g.,
Feature Methods (e.g., Q- Methods (Actor-
REINFORCE,
learning, DQN) Critic)
PPO)
More stable but High variance
Balanced
Convergence may but better local
variance/stability
overestimate optima
Inherits
Requires explicit
exploration Automatic via
Exploration strategies (ε-
from stochastic policy entropy
greedy)
policy
Q-learning, Deep
Algorithm REINFORCE,
Q Networks A3C, SAC, TD3
Examples PPO, TRPO
(DQN)
- Works in
- Simple
continuous - Best of both
implementation
Advantages spaces worlds
- Provable
- Better - Sample efficient
convergence
convergence
- Struggles with
- High variance
continuous - Complex
Disadvantages - Slower
actions implementation
convergence
- Overestimation
- Discrete control - Complex
- Robotics
Best Use (e.g., games) environments
- Continuous
Cases - Tabular - Real-world
control
problems tasks
Policy-Based
Value-Based Hybrid
Methods (e.g.,
Feature Methods (e.g., Q- Methods (Actor-
REINFORCE,
learning, DQN) Critic)
PPO)
Learning rate Learning rate,
Key Both value +
(α), discount entropy
Parameters policy parameters
factor (γ) coefficient
5. RL Challenges and Future Directions (10 Marks)
Question: Investigate the critical challenges and open problems in
contemporary reinforcement learning research, proposing potential solutions
and future development pathways.
Answer:
1. Sample Efficiency:
o Challenge: Real-world data requirements
o Solution: Model-based approaches
2. Continuous Control:
o Challenge: Infinite action spaces
o Solution: Policy gradient methods
3. Partial Observability:
o Challenge: Incomplete state information
o Solution: Memory networks
4. Reward Design:
o Challenge: Sparse/delayed rewards
o Solution: Inverse RL
5. Transfer Learning:
o Challenge: Cross-domain adaptation
o Solution: Meta-RL frameworks
6. Safety Concerns:
o Challenge: Exploration risks
o Solution: Constrained optimization
7. Multi-Agent Systems:
o Challenge: Non-stationarity
o Solution: Centralized training
8. Scalability Issues:
o Challenge: High-dimensional spaces
o Solution: Hierarchical RL
9. Theoretical Foundations:
o Challenge: Convergence guarantees
o Solution: Improved analysis methods
10.Future Directions:
o Neurosymbolic integration
o World model learning
o Human-in-the-loop systems
o Energy-efficient algorithms
o Explainable RL frameworks
MODULE – 3
2 marks
1. Define Machine Learning and explain its scope in modern applications.
Machine Learning (ML) is a subset of Artificial Intelligence that enables computers to learn
patterns from data without being explicitly programmed. ML algorithms improve their
performance on a task over time with experience (data).
Scope in Modern Applications:
Healthcare: Disease prediction, medical imaging analysis.
Finance: Fraud detection, credit scoring, algorithmic trading.
Retail & E-commerce: Customer segmentation, recommendation engines.
Transportation: Self-driving cars, route optimization.
NLP: Voice assistants, Chabot, translation systems.
Cybersecurity: Threat detection and anomaly detection.
2. What is the difference between Feature Selection and Feature Extraction?
Aspect Feature Selection Feature Extraction
Selecting a subset of existing
Definition Creating new features from existing ones
features
Remove irrelevant or redundant Reduce dimensionality while retaining
Purpose
features variance
Filter, Wrapper, Embedded
Techniques PCA, LDA, Autoencoders
methods
New features (combinations of original
Output Subset of original features
ones)
Combining pixels into principal
Example Removing constant features
components
3. Define covariance and explain its significance in data representation.
Covariance measures how much two random variables change together. A positive
covariance indicates that variables increase together, while a negative covariance indicates
inverse relation.
Formula:
Significance:
Helps understand relationships between features.
Used in Principal Component Analysis (PCA) to identify directions of maximum
variance.
Important for feature reduction and decorrelation.
4. What is the difference between Normalization and Standardization?
Aspect Normalization Standardization
Rescaling data to a range (usually 0 Transforming data to have mean 0, std
Definition
to 1) 1
Formula
When to When data follows Gaussian
When features have different ranges
Use distribution
Sensitive to Outliers Less sensitive
Example Input to algorithms like SVM, Logistic
Image pixel data
Use Reg.
5. Explain the importance of training, validation, and testing in machine
learning models.
Training Set: Used to train the model by adjusting weights/parameters.
Validation Set: Used for hyper parameter tuning and to prevent overfitting.
Test Set: Used to evaluate the final model performance on unseen data.
This separation ensures the model generalizes well and is not just memorizing the training
data.
6. What is the Bias-Variance trade off in model selection?
The Bias-Variance Trade-off is a key concept for balancing model complexity:
Term Description
Bias Error due to overly simplistic assumptions (under fitting)
Variance Error due to model sensitivity to training data (overfitting)
Goal Minimize both for good generalization
7. Differentiate between Holdout and Cross-validation techniques.
Aspect Holdout Validation Cross-validation
Multiple splits; rotates test/train
Definition Single split into training/testing sets
roles
Less efficient (some data unused in More efficient (each point gets
Data Usage
training) tested)
Bias/Variance Higher variance Lower variance due to averaging
Common
70-30 or 80-20 split k-Fold (e.g., 5-fold, 10-fold)
Type
8. What is overfitting? How can it be avoided?
Overfitting occurs when a model learns the training data—including noise—too well and
fails to generalize to new data.
Symptoms: High accuracy on training set, low accuracy on test set.
Avoidance Strategies:
Use simpler models
Apply regularization (L1, L2)
Use cross-validation
Prune decision trees
Dropout in neural networks
Add more training data
9. Explain under fitting in machine learning models with an example.
Under fitting is when a model is too simple to capture the underlying pattern in the data.
Example: Using a linear model to fit non-linear data (like quadratic trends).
Results:
Poor performance on both training and testing sets.
Low model accuracy and high bias.
10. What is the role of an adversarial learning system?
Adversarial Learning involves training models to be robust against data that is intentionally
perturbed to fool them.
Applications:
GANs (Generative Adversarial Networks): Generator and discriminator compete.
Security: Training models that resist adversarial attacks.
Robustness testing in real-world ML deployments.
11. Define Meta-Learning and mention two of its types.
Meta-Learning is “learning to learn”—algorithms that learn how to adapt quickly to new
tasks with few training examples.
Types:
1. Model-based: Use models like RNNs to learn parameter updates.
2. Optimization-based: Modify optimization process (e.g., MAML – Model-Agnostic
Meta-Learning).
12. Differentiate between Bagging and Boosting in Ensemble Learning.
Aspect Bagging Boosting
Strategy Parallel training of models Sequential training of models
Goal Reduce variance Reduce bias
Weighting Equal weighting Weights adjusted based on errors
Examples Random Forest AdaBoost, XGBoost
Robustness Better for unstable models (e.g., trees) Good for weak learners
13. What is Transfer Learning, and why is it useful?
Transfer Learning involves reusing a pre-trained model on a new, but related task. Instead
of training from scratch, models leverage learned representations from a large dataset.
Usefulness:
Speeds up training
Requires less data
Improves accuracy for small or specialized datasets
E.g., using ImageNet-trained CNNs for medical image classification.
14. Explain the role of orthogonality in feature representation.
Orthogonality ensures that features are linearly independent and uncorrelated.
Benefits:
Reduces multicollinearity
Improves numerical stability
In PCA, principal components are orthogonal, capturing max variance in distinct
directions.
15. Why is handling missing data important in machine learning?
Missing data can:
Bias results
Reduce accuracy
Mislead training algorithms
Handling Techniques:
Deletion: Remove rows/columns with missing data
Imputation: Fill missing values using mean, median, KNN
Modelling: Use models that can handle missing values directly (e.g., XGBoost)
16. How does reinforcement learning differ from supervised and unsupervised
learning?
Supervised
Aspect Unsupervised Learning Reinforcement Learning
Learning
Input Labelled data Unlabelled data States, actions, and rewards
Find structure (clusters, Learn a policy to maximize
Goal Predict output
patterns) reward
Feedback Direct (target value) No feedback Delayed, reward-based
Classification, Clustering, dimensionality Game playing, robotics,
Examples
regression reduction recommendation
5 marks
17. Given a dataset with missing values, discuss three different strategies to
handle them.
1. Deletion (List wise or Column-wise): Remove rows (or sometimes columns) that
contain missing values. This is simple and effective if the proportion of missing data
is very small but can lead to information loss if applied to large datasets.
2. Imputation: Fill in missing values with statistical estimates such as:
o Mean/Median/Mode Imputation for numerical/categorical features.
o K-Nearest Neighbors (KNN) or Regression Imputation for more accurate
predictions based on other correlated features.
3. Predictive Modelling: Build a model to estimate the missing values using complete
records. For instance, missing income values can be predicted using variables like
age, education, and job type.
These techniques help preserve data volume and reduce bias introduced by missingness.
18. How does noise in data affect machine learning models, and what
techniques can be used to reduce it?
Noise in data refers to random errors, irrelevant information, or outliers that obscure true
patterns. It can:
Lead to overfitting, where the model learns the noise as if it were a true pattern.
Reduce model accuracy and generalization on unseen data.
Techniques to reduce noise:
Data cleaning: Detecting and removing outliers, correcting errors, and resolving
inconsistencies.
Smoothing methods: Such as binning, moving averages, or clustering-based
smoothing.
Regularization: Penalizes overly complex models, making them less sensitive to
noise.
Ensemble methods: Combine multiple models to average out noise effects and
improve robustness.
These approaches lead to cleaner, more reliable input data and better model performance.
19. Compare and contrast k-fold cross-validation and random sampling in
model selection.
Criteria k-Fold Cross-Validation Random Sampling (Holdout)
Randomly splits into training
Data Split Divides data into k equal parts/folds
and test sets
Each fold is used once for testing, k-1 for
Usage Data is split once, not reused
training
Efficiency More computationally intensive Faster but less reliable
More reliable and consistent, especially on Can produce high variance in
Reliability
small data results
Reduces both bias and variance through More prone to variance due to
Bias/Variance
averaging single split
Conclusion: k-fold cross-validation gives a more stable estimate of model performance
compared to a single random split.
20. Explain how an imbalanced dataset can impact model performance.
Suggest two strategies to handle it.
Imbalanced datasets occur when one class significantly outnumbers others (e.g., 95% non-
fraud vs. 5% fraud). This causes:
The model to be biased toward the majority class, ignoring the minority.
Misleading metrics like high accuracy but poor recall or precision for the minority
class.
Strategies to handle imbalance:
1. Resampling techniques:
o Oversampling the minority class (e.g., SMOTE – Synthetic Minority
Oversampling Technique).
o Under sampling the majority class to equalize distribution.
2. Algorithmic solutions:
o Use cost-sensitive learning, where misclassifying the minority class carries a
higher penalty.
o Use classifiers that support class weighting, like decision trees or logistic
regression with class weight =‘balanced’.
These methods help in improving performance metrics like F1-score, especially on the
minority class.
21. How does feature selection improve machine learning models? Provide an
example.
Feature selection involves identifying and retaining the most relevant features in a dataset. It
improves models by:
Reducing overfitting, as the model focuses only on informative variables.
Enhancing accuracy by eliminating noise and redundancy.
Speeding up training and inference, which is crucial for large datasets.
Improving interpretability, especially for models used in sensitive domains like
healthcare.
Example: In diagnosing diabetes, selecting features like glucose level, insulin, and BMI may
yield better results than using unrelated data like patient ID or name.
Effective feature selection can lead to simpler, faster, and more accurate models.
22. Impact of Outliers on Classification Models and Techniques to Handle
Them
Impact:
Skewed decision boundaries: Outliers can mislead classifiers like SVM or KNN,
shifting the boundaries.
Increased variance: Especially in models sensitive to distance (e.g., KNN), outliers
can cause overfitting.
Reduced accuracy: Misclassification of genuine data points due to outlier influence.
Handling Techniques:
1. Z-score or IQR method: Detect and remove extreme values.
2. Winsorization: Replace extreme values with nearest valid values.
3. Robust models: Use algorithms less sensitive to outliers (e.g., tree-based models).
4. Transformation: Log or Box-Cox transformations to reduce the effect of extreme
values.
23. How Boosting Improves Model Accuracy Compared to Bagging
Aspect Bagging Boosting
Goal Reduce variance Reduce bias
Train models in parallel on random Train sequentially, focusing on
Approach
subsets errors
AdaBoost, Gradient Boosting,
Examples Random Forest
XGBoost
Impact on
Good for high variance models Better accuracy for weak learners
Accuracy
Boosting focuses on learning from past mistakes. Each new model corrects the errors of the
previous ones, leading to a strong ensemble that performs well, especially on complex
datasets.
24. PCA in Feature Extraction with Highly Correlated Features
Problem: Highly correlated features (e.g., height & weight) introduce redundancy, which
can:
Increase dimensionality unnecessarily.
Cause overfitting in models.
Solution via PCA:
PCA transforms correlated features into uncorrelated principal components.
These components maximize variance while reducing dimensions.
The first few principal components capture most of the information.
Outcome: Cleaner, compressed input data that improves model efficiency and
performance.
25. Using SMOTE with KNN in Imbalanced Email Classification
SMOTE (Synthetic Minority Over-sampling Technique):
Creates synthetic samples of the minority class (Spam) by interpolating between
existing ones.
Impact on KNN:
Balancing the dataset helps KNN assign more meaningful neighborhoods.
Improved recall for the minority class (Spam) by reducing bias toward the majority.
Effect on Decision Boundary:
SMOTE can smooth and expand the decision boundary around the minority class,
making KNN better at generalizing.
Potential Bias:
Synthetic samples may overlap with other classes, causing class ambiguity.
If noise exists in the minority class, SMOTE might amplify errors.
26. Advantages and Challenges of SMOTE with KNN
Advantages:
Balances class distribution, allowing KNN to give fair weight to minority (Spam)
class.
Enhances recall and F1-score by reducing bias toward Not-Spam.
Challenges:
KNN’s distance-based nature may treat synthetic points as overly influential,
especially in sparse regions.
Overfitting risk: Generated samples may be too similar, reducing diversity.
Computational cost: KNN is slow on large SMOTE-augmented datasets.
Best Practice: Use SMOTE + Edited Nearest Neighbor (ENN) to clean overlapping
regions after synthetic sampling.
27. Feature Extraction in Correlated Height-Weight Dataset
Dataset Insight: Height and weight are highly correlated (multicollinearity issue).
Problem: Redundancy affects models like regression, which assume independence among
features.
Feature Extraction Solutions:
PCA: Converts Height and Weight into orthogonal components (e.g., Body Size 1
and Body Size 2).
Manual Feature Engineering: Create a new feature like BMI = Weight / (Height)^2.
Benefit: Reduced dimensionality, improved model interpretability, and less risk of overfitting
due to redundant data.
28. Role of Covariance in Feature Selection and Dimensionality Reduction
Covariance measures how two features vary together.
High covariance indicates multicollinearity, which can inflate model coefficients
and reduce stability.
In Dimensionality Reduction:
PCA uses the covariance matrix to identify directions (principal components) with
maximum variance and low redundancy.
Example:
In a housing price model, highly correlated features like square footage and number of
rooms can confuse a linear regression model. PCA or feature selection helps mitigate
this by retaining only independent, informative features.
29. Effect of Increasing Features on Bias-Variance Trade-off & Solutions
Effect:
Adding features usually reduces bias (more model flexibility).
But it increases variance (model over fits on noisy/unnecessary features).
This leads to the curse of dimensionality, especially in distance-based models like
KNN.
Techniques to Handle:
1. Feature Selection: Filter, wrapper, or embedded methods to retain only relevant
features.
2. Regularization: Lasso (L1) to shrink less important coefficients to zero.
3. Dimensionality Reduction: PCA or t-SNE to capture most variance in fewer
features.
4. Cross-validation: To monitor overfitting during feature addition.
30. Challenges of Meta-Learning and Role of Zero-shot & One-shot Learning
Challenges of Meta-Learning:
Data diversity: Meta-learners need varied tasks for generalization.
Task formulation: Defining tasks and task distributions is complex.
Compute cost: Training over tasks is computationally intensive.
Transferability: Learned meta-knowledge may not generalize across domains.
Zero-shot Learning:
Classifies unseen classes using semantic embedding (e.g., word vectors).
Solves problems where labelled examples are unavailable.
One-shot Learning:
Learns from only one example per class using similarity metrics (e.g., Siamese
Networks).
Useful in medical diagnosis, fraud detection, etc., where data is limited.
Summary: Meta-learning aims to "learn to learn", and zero/one-shot learning help address
data scarcity and rapid adaptation, which are key in real-world AI deployment.
31. Tumour Prediction Model Evaluation
Given:
Predicted Malignant Predicted Benign Total
Actual Malignant 15 (TP) 3 (FN) 18
Actual Benign 7 (FP) 75 (TN) 82
Total 22 78 100
a) Error Rate:
b) Kappa Statistic:
1. Observed Accuracy:
2.
Expected Accuracy:
3. Kappa:
c) Sensitivity (Recall for Malignant):
d) Precision (for Malignant):
e) F-Measure:
32. k-NN Classification for Sayan (Weight=56, Speed=10)
Use Euclidean Distance:
Name Weight Speed Class Distance
Nitesh 55 9 Average √(1² + 1²) = 1.41
Gurpreet 58 8 Poor √(2² + 2²) = 2.83
Goutam 60 7.5 Poor √(4² + 2.5²) ≈ 4.72
Gulshan 59 8.5 Average √(3² + 1.5²) ≈ 3.35
Mohit 57 10 Good √(1² + 0²) = 1.0
Sahil 53 10.5 Good √(3² + 0.5²) ≈ 3.04
Samyak 53 10 Good √(3² + 0²) = 3.0
Nearest 3 neighbors:
Mohit (Good) – 1.0
Nitesh (Average) – 1.41
Samyak (Good) – 3.0
Prediction by majority class: Good
33. a) Entropy of the Training Examples
Positive: Instances 1, 2, 4 → 3
Negative: Instances 3, 5, 6 → 3
b) Reduced Error Pruning in Decision Trees
Purpose: Avoid overfitting by removing branches that don’t improve accuracy on a
validation set.
Effect: Simplifies the tree, improves generalization, reduces variance.
Method: Iteratively remove subtrees and replace with leaves if error on validation set does
not increase.
34. a) Drawbacks of K-NN and Corrections
Drawback Explanation Solutions
Sensitive to scale Larger-scale features dominate Normalize/standardize data
Requires calculating distance to all
High computation Use KD-Trees or Ball Trees
points
Affected by irrelevant/noisy Irrelevant features reduce Use feature selection
features accuracy techniques
Choice of k is critical Low k overfits; high k underfits Use cross-validation to tune k
b) Nearest Neighbor Classification using Euclidean Distance
Given:
Class A: (1, 2, 3)
Class B: (7, 4, 5)
Class C: (6, 2, 1)
Unknown: (3, 4, 5)
Distances:
To A: √[(3–1)² + (4–2)² + (5–3)²] = √(4+4+4) = √12 ≈ 3.46
To B: √[(3–7)² + (4–4)² + (5–5)²] = √(16) = 4
To C: √[(3–6)² + (4–2)² + (5–1)²] = √(9+4+16) = √29 ≈ 5.38
Nearest neighbor is Class A, so classification: Class A
35. Student Campus Attendance (13-day Dataset - ID3/Decision Tree)
Though full data isn’t provided, here's how to approach:
Attributes: Weather (e.g., Sunny, Rainy), Wake-up Time (e.g., Early, Late), Seminar (Yes/No)
Target: Attend Campus (Yes/No)
Steps:
1. Calculate Entropy of the full dataset.
2. For each attribute, compute Information Gain.
3. Choose the attribute with the highest gain to split.
4. Recursively apply to subsets until:
o All samples in a subset belong to the same class.
o No more attributes left.
This forms a decision tree used to predict whether the student will attend campus.
Let's solve Question 35 step by step:
Dataset (13 Days)
Day Wake-up Have Talk Weather Go To College
D1 Normal No Sunny No
D2 Normal No Rain No
D3 Early No Sunny Yes
D4 Late No Sunny Yes
D5 Late Yes Sunny Yes
D6 Late Yes Rain No
D7 Early Yes Rain Yes
D8 Normal No Sunny No
D9 Normal Yes Sunny Yes
D10 Late Yes Sunny Yes
D11 Normal Yes Rain Yes
D12 Early No Rain Yes
D13 Early Yes Sunny Yes
Step 1: Calculate Root Entropy
We have:
Total: 13
Yes: 9
No: 4
Step 2: Calculate Information Gain for Each Attribute
Attribute: Wake-up
Attribute: Have Talk
Information Gain (Have Talk):
Gain=1.219−0.780=0.439Gain = 1.219 - 0.780 = 0.439
Attribute: Weather
Step 3: Choose Attribute with Highest Gain
Wake-up: 0.659
Have Talk: 0.439
Weather: 0.285
So, split on Wake-up.
Step 4: Build the Decision Tree
Wake-up
/ | \
Early Normal Late
| | |
Yes Have Talk Weather
/ \ / \
No Yes Sunny Rain
No Yes Yes No
Early → All Yes
Normal → Split on Have Talk
o No → No
o Yes → Yes
Late → Split on Weather
o Sunny → Yes
o Rain → No
Would you like a drawn tree diagram version of this decision tree?