KEMBAR78
Module 3 and 4 | PDF | Applied Mathematics | Algorithms And Data Structures
0% found this document useful (0 votes)
13 views63 pages

Module 3 and 4

The document discusses informed search strategies in artificial intelligence, focusing on Best-First Search, Greedy Best-First Search, and A* search algorithms. It explains how heuristics enhance search efficiency and outlines the characteristics, limitations, and optimality conditions of these algorithms. Additionally, it covers memory-efficient variants like Iterative Deepening A* and Recursive Best-First Search, highlighting their features and practical limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views63 pages

Module 3 and 4

The document discusses informed search strategies in artificial intelligence, focusing on Best-First Search, Greedy Best-First Search, and A* search algorithms. It explains how heuristics enhance search efficiency and outlines the characteristics, limitations, and optimality conditions of these algorithms. Additionally, it covers memory-efficient variants like Iterative Deepening A* and Recursive Best-First Search, highlighting their features and practical limitations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof.

, ECE, GMIT, Davangere

Informed Search Strategies


Informed search strategies utilize problem-specific knowledge to enhance the efficiency of finding
solutions, contrasting with uninformed search methods that rely solely on the problem's structure.

Best-First Search
Best-first search is a key informed search strategy that employs an evaluation function, f(n), to
prioritize which node to expand next. The node with the lowest evaluation is expanded first, making
the search process more efficient.
• Evaluation Function: The function f(n) serves as a cost estimate. The best-first search can be
implemented similarly to uniform-cost search, but it uses f instead of g (the cost to reach node n)
to order the nodes in the priority queue.

Heuristic Function
A critical component of f(n) is the heuristic function h(n):
• Heuristic Definition: h(n) represents the estimated cost of the cheapest path from the current
state at node n to a goal state. It only depends on the state at node n, not on the path taken to
get there.
• Example: In the context of a route-finding problem, such as determining the shortest path from
Arad to Bucharest, h(n) could be represented by the straight-line distance between these two
cities.

Characteristics of Heuristics
Heuristics play a vital role in informing the search algorithm by guiding it toward promising paths:
1. Nonnegative: Heuris tic values are generally nonnegative, ensuring that they provide a
reasonable estimate of the remaining cost.
2. Goal State: If n is a goal node, then h(n) = 0.

Search Strategies Utilizing Heuristics


The choice of heuristic significantly influences the search strategy employed. Various strategies can
be derived from best-first search by modifying f(n):
• A*: Combines both the cost to reach the node (g(n)) and the heuristic estimate (h(n)) in its
evaluation function: f(n)=g(n)+h(n).
• Greedy Best-First Search: Uses only the heuristic for its evaluation: f(n)=h(n), focusing solely on
the estimated cost to the goal.

1
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

These strategies can lead to more efficient searches by effectively narrowing down the most promising
paths based on the additional knowledge provided by heuristics.

Greedy Best-First Search


Greedy best-first search is an informed search strategy that prioritizes nodes based on their proximity
to the goal. The evaluation function used is simply the heuristic function f(n)=h(n), meaning it selects
nodes that appear closest to the target based solely on heuristic estimates.

How It Works
In route-finding problems, such as navigating from Arad to Bucharest in Romania, the straight-line
distance heuristic (denoted as hSLD) is employed. For instance, hSLD(Arad) = 366 kilometers indicates
the direct distance to Bucharest.

Fig. Values of hSLD—straight-line distances to Bucharest.

Node Expansion: The algorithm expands the node that is closest to the goal based on the heuristic.
For example, starting from Arad:
o The first node expanded is Sibiu, as it is closer to Bucharest than Zerind or Timisoara.
o Next, it expands Fagaras, which leads directly to Bucharest.
In this case, the greedy best-first search finds a solution efficiently without exploring unnecessary
nodes. However, the chosen path (via Sibiu and Fagaras) is not optimal; an alternative route through
Rimnicu Vilcea and Pitesti is actually shorter by 32 kilometers.

Characteristics and Limitations


• Greediness: The term "greedy" reflects the strategy's tendency to make locally optimal choices at
each step, aiming for immediate proximity to the goal.
• Incompleteness: The algorithm can fail to find a solution in certain scenarios. For example, if
searching from Iasi to Fagaras:
o The heuristic might suggest expanding Neamt first, which is a dead end.

2
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

o The optimal route would require going to Vaslui first, but greedy best-first search would
loop back to Iasi, leading to an infinite cycle.
• Complexity:
o The worst-case time and space complexity for the tree version of greedy best-first search
is O(bm), where b is the branching factor and m is the maximum depth of the search
space.
o However, a good heuristic can significantly reduce complexity, depending on its
effectiveness and the specific problem being solved.

Fig. Stages in a greedy best-first tree search for Bucharest with the straight-line distance heuristic hSLD.
Nodes are labeled with their h-values.

3
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

A* search: Minimizing the total estimated solution cost

Fig. Stages in an A∗ search for Bucharest. Nodes are labeled with f = g +h. The h values are the straight-
line distances to Bucharest.

4
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

A* search is a widely used informed search algorithm that combines the cost to reach a node (g(n))
and the estimated cost to the goal (h(n)) using:
f(n)=g(n)+h(n)

Key Features
1. Path Cost Estimation: A* prioritizes nodes with the lowest f(n), representing the estimated total
cost of the cheapest solution through that node.
2. Optimal and Complete: A* is complete (it finds a solution if one exists) and optimal (it finds the
least-cost solution) if the heuristic h(n) is admissible and consistent.
3. Comparison to Uniform-Cost Search: A* extends uniform-cost search by incorporating heuristic
information.

Conditions for Optimality: Admissibility and Consistency


Admissibility
An admissible heuristic h(n) never overestimates the cost to reach the goal, ensuring that f(n) = g(n) +
h(n) does not exceed the true solution cost. For example, the straight-line distance heuristic is
admissible as it reflects the shortest distance.
Consistency
A heuristic h(n) is consistent if, for every node n and successor n′:
h(n) ≤ c(n,a,n′) + h(n′) where, c(n,a,n′) – Step Cost from n to n’
This aligns with the triangle inequality, ensuring that estimated costs are valid.
Key Relationship
All consistent heuristics are admissible, but not all admissible heuristics are consistent. Most
heuristics used in A* search are both, making them effective for optimal pathfinding.

Optimality of A*
A* search has two optimality properties:
1. Tree-Search Version: Optimal if the heuristic h(n) is admissible.
2. Graph-Search Version: Optimal if h(n) is consistent.

Key Concepts
• Nondecreasing f(n): If h(n) is consistent, then f(n) values along any path do not decrease. This
ensures that when A* selects a node for expansion, it has found the optimal path to that node.

5
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• Expansion Order: A* expands nodes in nondecreasing order of f(n), meaning the first goal node
expanded is guaranteed to be optimal.
Search Behavior
• Nodes are organized in concentric "contours" based on f(n) values. A* expands all nodes with
f(n)<C∗ (optimal path cost) and may also expand some on the "goal contour" where f(n)=C∗.

Fig. Map of Romania showing contours at f = 380, f = 400, and f = 420, with Arad as the start state.
Nodes inside a given contour have f-costs less than or equal to the contour value.

Completeness and Efficiency


• A* is complete if there are finitely many nodes with cost ≤ C∗. It prunes nodes with f(n) > C∗, which
aids in optimality.
• A* is optimally efficient among algorithms using the same heuristic, meaning it expands the fewest
nodes possible under optimal conditions.
Complexity Considerations
• The time complexity of A* can be exponential in the maximum absolute error of the heuristic,
particularly with constant step costs: O(bΔ) or O(bϵd), where d is solution depth.
The absolute error is defined as Δ ≡ h∗ − h, where h∗ is the actual cost of getting from the root to
the goal, and the relative error is defined as ϵ ≡ (h∗ − h)/h∗.
• In graph scenarios, the number of states with f(n)<C∗ can grow exponentially, complicating
optimality.

6
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Practical Limitations
• A* can be impractical for large-scale problems due to high memory usage; it often runs out of
space before time.
• Variants of A* can find suboptimal solutions faster, and more accurate but non-admissible
heuristics can be developed.
Despite its limitations, A* remains a powerful search algorithm, particularly when coupled with good
heuristics.

Memory-bounded heuristic search


Iterative-deepening A∗ (IDA∗)
Iterative Deepening A* (IDA*) combines the memory efficiency of iterative deepening with the
heuristic guidance of A*. Unlike A*, which stores all nodes in a priority queue, IDA* limits memory
usage by performing a series of depth-first searches, each with an increasing f-cost cutoff
f(n)=g(n)+h(n).

Key Features:
• Iterative Deepening: IDA* increases the f-cost cutoff after each iteration based on the minimum
f-cost found in the previous iteration.
• Memory Efficient: It doesn't require storing all nodes in memory, making it space-efficient
compared to A*.
• No Priority Queue: IDA* uses a depth-first search strategy without the need for a sorted open list.

Limitations:
• Real-Valued Costs: IDA* struggles with real-valued costs, leading to inefficient repeated searches
and slower performance.
• Redundant Work: The algorithm may revisit the same parts of the search space multiple times,
which can reduce its overall efficiency.
IDA* is useful for large state spaces with unit costs where memory is a concern but can be inefficient
with non-integer or real-valued costs.

Example Process of IDA*:


1. Start with an initial cutoff (say, the f-cost of the start node).
2. Run a depth-first search using this cutoff, expanding nodes with an f-cost smaller than or equal to
the cutoff.

7
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

3. If the goal node is found, return the solution.


4. If no solution is found, increase the cutoff to the smallest f-cost found during the current search
and repeat the process.

Recursive Best-First Search (RBFS)


Recursive Best-First Search (RBFS) is a memory-efficient variant of best-first search that uses linear
space. It mimics the structure of depth-first search but incorporates an f-limit to guide the search.
Nodes are expanded based on their f(n)=g(n)+h(n) values, and the algorithm backtracks when a
node’s f-value exceeds the current limit. As it backtracks, RBFS updates the backed-up values of nodes
along the path to store the best f-value of their children, helping avoid redundant re-expansions.
Key Features:
• Memory Efficiency: RBFS uses linear space, unlike A* which requires exponential space.
• Backtracking: If the current path’s ff-value becomes worse, RBFS backtracks and explores
alternative paths.
• Optimality: RBFS is optimal if the heuristic is admissible (doesn’t overestimate the cost to the
goal).
• Time Complexity: It can suffer from excessive node regeneration and re-expansion, leading to
high time complexity, especially with poor heuristics or redundant paths.
Comparison with IDA*:
• Memory: Both use linear space, but RBFS retains more information (backed-up values).
• Time: Both face inefficiency due to re-expansion of nodes, but RBFS is more efficient in space.
Limitations:
• Node Regeneration: Re-expanding the same nodes multiple times increases time complexity.
• Redundant Paths: In graphs with many paths, RBFS may re-explore paths, leading to
inefficiency.
• Heuristic Dependence: Poor heuristics can degrade performance, making RBFS behave
similarly to uninformed search.
RBFS is useful for problems where memory is constrained, but it may still require significant time due
to repeated expansions.

8
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. The algorithm for recursive best-first search.


The steps of the Recursive Best-First Search (RBFS) algorithm:
1. Goal Test:
• If the current node is the goal, return the solution.
2. Generate Successors:
• Expand the current node by applying all possible actions to generate successor nodes.
3. Handle Empty Successors:
• If no successors are generated, return failure and set the f-limit to infinity.
4. Update f-values:
• For each successor, set its ff-value to the maximum of g+h (path cost plus heuristic) or the
parent node’s f-value.
5. Find Best Node:
• Select the successor with the lowest ff-value as the best node.
6. Check f-limit:
• If the best node’s f-value exceeds the current f-limit, return failure and the best f-value.
7. Find Alternative Node:
• Identify the second-lowest f-value among the successors as the alternative.
8. Recursive Call:
• Recursively call RBFS on the best node with the new f-limit (the minimum of the current f-
limit and the alternative f-value).
9. Return Result:
• If a solution is found, return the result. Otherwise, continue the search.

9
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Stages in an RBFS search for the shortest route to Bucharest. The f-limit value for each recursive
call is shown on top of each current node, and every node is labelled with its f-cost. (a) The path via
Rimnicu Vilcea is followed until the current best leaf (Pitesti) has a value that is worse than the best
alternative path (Fagaras). (b) The recursion unwinds and the best leaf value of the forgotten subtree
(417) is backed up to Rimnicu Vilcea; then Fagaras is expanded, revealing a best leaf value of 450. (c)
The recursion unwinds and the best leaf value of the forgotten subtree (450) is backed up to Fagaras;
then Rimnicu Vilcea is expanded. This time, because the best alternative path (through Timisoara)
costs at least 447, the expansion continues to Bucharest.

Simplified Memory-Bounded A*

SMA* (Simplified Memory-Bounded A*) is a memory-efficient variant of the A* algorithm, designed


for situations where memory is limited. It works by expanding the best leaf node and discarding the

10
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

worst leaf node when memory is full. The discarded node's value is backed up to its parent, allowing
SMA* to retain useful information without storing the entire search tree.

Key points:
• Expansion and Deletion: Expands the best node, deletes the worst (highest f-value) node, and
backs up its value to its parent.
• Ties: If nodes have the same f-value, the newest best leaf is expanded, and the oldest worst leaf
is deleted.
• Completeness: SMA* is complete if the shallowest goal node’s depth is smaller than the memory
size.
• Optimality: SMA* returns the best solution within memory limits, but may not be optimal if the
entire path cannot fit in memory.
• Practical Use: Suitable for problems with limited memory, but can suffer from thrashing (repeated
regeneration of nodes), making it inefficient on harder problems.
SMA* offers a tradeoff between time and memory: it is less optimal than A*, but more feasible when
memory is the limiting factor.

Learning to search better


Metalevel learning improves search strategies.
• Metalevel State: Represents the internal state of a program (e.g., A*) performing a search, such
as the current search tree.
• Learning from Missteps: The agent learns from unhelpful steps (e.g., expanding irrelevant nodes)
and avoids exploring unpromising paths.
• Goal: Minimize the total problem-solving cost by optimizing the trade-off between computational
expense and path cost.
Metalevel learning allows an agent to refine its search process over time for greater efficiency.

Heuristic Functions

Fig. A typical instance of the 8-puzzle. The solution is 26 steps long.

11
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

In the 8-puzzle, the goal is to arrange tiles by sliding them into the empty space. The average solution
cost is about 22 steps, with a branching factor of 3. An exhaustive search would explore around
3.1×10¹⁰ states, but using a graph search reduces this to 181,440 reachable states.

To solve the puzzle efficiently using A*, a good heuristic is needed. Two common admissible heuristics
are:
• h₁ = Misplaced Tiles: Counts the number of tiles out of place. For the start state in Figure, h₁ = 8.
It’s admissible because each misplaced tile must be moved at least once.
• h₂ = Manhattan Distance: Sums the horizontal and vertical distances of each tile from its goal
position. For the start state, h₂ = 18. It’s admissible because each move brings tiles closer to the
goal.
h2 = 3+1 + 2 + 2+ 2 + 3+ 3 + 2 = 18 .
Both heuristics never overestimate the true solution cost 26, making them effective for guiding the
search algorithm.

The effect of heuristic accuracy on performance

Fig. Comparison of the search costs and effective branching factors for the ITERATIVE-DEEPENING-
SEARCH and A∗ algorithms with h1, h2. Data are averaged over 100 instances of the 8-puzzle for each
of various solution lengths d.

The effective branching factor (b*) measures the quality of a heuristic by estimating the branching
factor of a uniform search tree that generates the same number of nodes as A* for a given problem. It
is calculated based on the total number of nodes (N) and solution depth (d).

12
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Key Points:
• b* Formula: N+1=1+b∗+(b∗)2+...+(b∗)d
• Heuristic Comparison:
o h₁ (Misplaced Tiles): A simple heuristic.
o h₂ (Manhattan Distance): The sum of horizontal and vertical distances from the goal.
Results:
• h₂ is much more efficient than h₁, with A* using h₂ expanding significantly fewer nodes (e.g., for
d=12, A* with h₂ expands 73 nodes, while A* with h₁ expands 227).
• Effective Branching Factor: A* with h₂ has a lower b* (e.g., 1.24 vs. 1.42 for h₁) and outperforms
iterative deepening search by orders of magnitude.

Domination of h₂:
• h₂ dominates h₁ because it consistently produces higher heuristic values, reducing the number of
nodes expanded by A*, making it more efficient.
In summary, h₂ (Manhattan distance) improves A* search efficiency by expanding fewer nodes, with a
lower effective branching factor, making it a superior heuristic compared to h₁ (misplaced tiles).

Generating Admissible Heuristics from Relaxed Problems


1. Relaxed Problems: A relaxed problem removes some constraints, making the problem easier to
solve and creating a supergraph of the original state space. For example:
o In the 8-puzzle, relaxing tile movement rules allows h₁ (misplaced tiles) and h₂ (Manhattan
distance) to give exact solutions for simplified versions of the puzzle.
2. Admissible Heuristics: The optimal solution of a relaxed problem provides an admissible heuristic
for the original problem, as it cannot overestimate the true solution cost.
3. Consistency: Relaxed problem heuristics are consistent because they obey the triangle inequality.
4. Automated Heuristic Generation: Programs like ABSOLVER can generate heuristics automatically
by relaxing problem constraints, creating better heuristics for problems like the 8-puzzle and
Rubik’s Cube.
5. Composite Heuristics: When multiple admissible heuristics (e.g., h₁ and h₂) are available, we can
combine them into a composite heuristic:
h(n) = max { h1(n), h2(n),…, hk(n)}
This composite heuristic is admissible, consistent, and dominates all individual heuristics,
ensuring better search efficiency.

13
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Generating admissible heuristics from subproblems: Pattern databases

Fig. A subproblem of the 8-puzzle instance. The task is to get tiles 1, 2, 3, and 4 into their correct
positions, without worrying about what happens to the other tiles.

Pattern databases (PDBs) generate admissible heuristics by storing exact solution costs for
subproblems of a larger problem, providing a lower bound for the full problem.
For example, in the 8-puzzle, a subproblem might involve solving only a subset of tiles (e.g., tiles 1, 2,
3, and 4). The cost of solving this subproblem is a lower bound for the entire puzzle, and using this
information can be more accurate than simpler heuristics like Manhattan distance.

PDB Construction: To create a PDB, perform a backward search from the goal for all possible
configurations of the relevant tiles, storing the solution cost for each configuration. This database is
then used to efficiently guide the search in future problem instances.
Combining PDBs: If two subproblems (e.g., tiles 1-2-3-4 and 5-6-7-8) don’t overlap, their heuristics can
be combined by summing the costs from their individual PDBs. This is the essence of disjoint pattern
databases (DPDBs), which are admissible because the subproblems are independent.

Benefits: DPDBs dramatically improve search efficiency. For instance, in the 15-puzzle, they can reduce
nodes explored by a factor of 1000, and for the 24-puzzle, by a factor of a million, compared to simpler
heuristics like Manhattan distance.

Limitations: DPDBs are highly effective for sliding-tile puzzles but less so for puzzles like Rubik's Cube,
where multiple pieces move simultaneously, making disjoint decomposition difficult.

Learning Heuristics from Experience


A heuristic function h(n) estimates the cost to reach the goal from state n.
One way to construct h(n) is by learning from experience—solving many instances of a problem (e.g.,
8-puzzles) and using the resulting data to predict solution costs for new states.

14
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Learning from Optimal Solutions


By solving many optimal instances, you collect examples consisting of states from the solution path
and their actual solution costs. A learning algorithm can use these examples to build a heuristic
function that predicts the cost for new states. Techniques like neural networks, decision trees, and
reinforcement learning can be applied.

Features for Learning


Effective learning requires relevant features of a state. For example:
• Feature x1(n): Number of misplaced tiles, which helps predict how far a state is from the goal.
• Feature x2(n): Number of adjacent tile pairs that aren't adjacent in the goal state.
By analyzing data from multiple solved puzzles, you can observe how these features correlate with
solution costs. For instance, if x1(n)=5, the average solution cost might be 14 moves.

Combining Features
The features can be combined linearly to predict h(n):
h(n) = c1x1(n) + c2x2(n)
where c1 and c2 are constants adjusted to fit the data. These constants are usually positive, reflecting
the fact that more misplaced tiles or incorrect adjacent pairs make the puzzle harder to solve.

Properties of the Learned Heuristic


• The heuristic satisfies h(n)=0 for goal states.
• It is not necessarily admissible or consistent because the learned function may overestimate
the actual solution cost.

Logical Agents
Knowledge-Based Agents
A knowledge-based agent (KBA) uses a knowledge base (KB) of sentences (assertions about the
world) to guide its actions. These sentences are written in a knowledge representation language and
can be axioms (taken as true without derivation).

Key Operations
• TELL: Adds knowledge to the KB, typically based on percepts (sensor input).
• ASK: Queries the KB to determine the best action based on current knowledge.

15
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Both operations may involve inference, ensuring that answers logically follow from the KB without
fabricating information.

Fig. A generic knowledge-based agent. Given a percept, the agent adds the percept to its knowledge
base, asks the knowledge base for the best action, and tells the knowledge base that it has in fact taken
that action.

Agent Process
1. TELL: The agent adds its percept to the KB.
2. ASK: The agent queries the KB for the best action.
3. TELL: After acting, the agent updates the KB with the action taken.
This cycle repeats, with time t incremented each cycle.

Knowledge Level vs. Implementation Level


At the knowledge level, we specify what the agent knows and its goals. For example, an automated
taxi knows it must cross the Golden Gate Bridge to reach its destination. The implementation level
is concerned with how this knowledge is represented and processed.

Declarative vs. Procedural Knowledge


• Declarative approach: Knowledge is explicitly provided to the agent and it reasons from this
knowledge.
• Procedural approach: Desired behaviors are encoded directly in the program.
Successful agents often combine both approaches, with declarative knowledge compiled into
procedural code for efficiency.

Learning in Knowledge-Based Agents


A knowledge-based agent can also learn from experience, allowing it to improve over time and
become fully autonomous by generating new knowledge from percepts.

16
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

The Wumpus World


The Wumpus World is a 4x4 grid environment where an agent must navigate a cave filled with hazards
(Wumpus, pits) to find gold and escape. The agent starts in room [1,1] and has sensors that provide
information about nearby dangers, like Stench (Wumpus nearby), Breeze (pit nearby), Glitter (gold
nearby), Bump (wall), and Scream (Wumpus killed).

Fig. A typical wumpus world. The agent is in the bottom left corner, facing right.

PEAS description:
• Performance Measure:
o +1000 for exiting with gold.
o -1000 for dying (pit or Wumpus).
o -1 per action, -10 for shooting the arrow.
• Environment:
o 4x4 grid, agent starts at [1,1].
o Wumpus and gold placed randomly (except at [1,1]).
o Pits have 20% chance of being in each room (other than the start).
• Actuators:
o Move Forward, Turn Left/Right, Grab gold, Shoot arrow, Climb to exit.
• Sensors:
o Stench: Wumpus or nearby.
o Breeze: Pit or nearby.
o Glitter: Gold in current room.
o Bump: Wall encountered.
o Scream: Wumpus killed.

17
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Environment Characteristics:
• Discrete: The environment is made up of a finite number of states (rooms) and actions.
• Static: The environment does not change except through the agent’s actions (i.e., no external
events affecting the environment).
• Single-agent: The agent is the only one interacting with the environment.
• Sequential: The outcome depends on the sequence of actions the agent takes.
• Partially observable: The agent cannot directly observe the entire state (e.g., it doesn't know the
exact location of the Wumpus, pits, or gold initially).

Fig. The first step taken by the agent in the wumpus world. (a) The initial situation, after percept
[None, None, None, None, None]. (b) After one move, with percept [None, Breeze, None, None,
None].

Fig. Two later stages in the progress of the agent. (a) After the third move, with percept [Stench, None,
None, None, None]. (b) After the fifth move, with percept [Stench, Breeze, Glitter, None, None].

18
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Agent’s Reasoning:
1. Initial knowledge: The agent starts knowing [1,1] is safe. Based on the percepts, it infers adjacent
rooms [1,2] and [2,1] are safe.
2. Move to [2,1]: The agent detects a Breeze and infers a pit nearby (in [2,2] or [3,1]).
3. Move to [1,2]: The agent detects a Stench and infers the Wumpus is in [1,3]. It concludes no pit
in [2,2].
4. Exploration: The agent continues exploring, avoiding dangerous rooms, using logical reasoning to
deduce safe paths and eventually finds gold.
5. Goal: After grabbing the gold, the agent returns to [1,1] and climbs out of the cave.

Knowledge Representation:
The agent builds a knowledge base (KB) with logical rules (e.g., a Breeze means a pit is adjacent). It
uses reasoning to update its knowledge and decide actions that maximize survival and success.
The Wumpus World illustrates logical agents using reasoning to navigate uncertain environments,
combining percepts to form conclusions and make decisions.

Logic
This section outlines key concepts in logical representation and reasoning, essential for knowledge-
based agents.
1. Syntax and Semantics:
o Syntax: Defines the structure of well-formed sentences (e.g., "x + y = 4").
o Semantics: Assigns meaning to sentences, specifying when they are true or false in a
model (a possible world or assignment of variable values).
2. Models and Satisfaction:
o A model fixes the truth value of sentences.
o A model satisfies a sentence if the sentence is true in that model.
3. Logical Entailment:
o Entailment (denoted α⊨β) means that if α is true, then β must also be true in every model
where α holds.
o This is formalized as M(α) ⊆ M(β), where M(α) is the set of models where α is true.
4. Logical Inference:
o Inference is the process of deriving new conclusions from existing knowledge (KB).
o Model checking involves enumerating possible models to verify whether a sentence holds
in all models where KB is true.

19
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Possible models for the presence of pits in squares [1,2], [2,2], and [3,1]. The KB corresponding to
the observations of nothing in [1,1] and a breeze in [2,1] is shown by the solid line. (a) Dotted line
shows models of α1 (no pit in [1,2]). (b) Dotted line shows models of α2 (no pit in [2,2]).
5. Soundness and Completeness:
o Soundness: An inference algorithm is sound if it only derives sentences entailed by the KB.
o Completeness: An algorithm is complete if it can derive all sentences entailed by the KB.
6. Wumpus World Example:
o In the Wumpus world, the agent uses logical reasoning to infer the presence of pits based
on percepts (e.g., breeze) and KB.
o For instance, detecting a breeze means a pit is nearby, and the agent can use entailment
to confirm where the pit is located.

Fig. Sentences are physical configurations of the agent, and reasoning is a process of constructing new
physical configurations from old ones. Logical reasoning should ensure that the new configurations
represent aspects of the world that actually follow from the aspects that the old configurations
represent.
7. Grounding:
o Grounding connects logical sentences in the KB to real-world sensory experiences. The
agent’s sensors generate percepts that are translated into sentences in the KB, which are
then used for reasoning.
8. Learning:
o Learning allows the agent to update its KB based on new experiences, improving its ability
to make accurate inferences over time. However, learning is fallible, and the KB may not
always be accurate.

20
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Propositional Logic: A Very Simple Logic


Propositional logic is a formal system that is used to represent logical statements and reason about
their truth values.

Syntax of Propositional Logic


The syntax of propositional logic defines how valid sentences are constructed. Sentences are made of
atomic sentences and complex sentences using logical connectives.

Atomic Sentences
Atomic sentences consist of a single proposition symbol, which represents a statement that can be
true or false. Examples include:
• P1: "There is a pit in square 1."
• W1,3: "The Wumpus is in square [1,3]." Two special symbols:
• True: Always true.
• False: Always false.

Complex Sentences
Complex sentences are formed by combining simpler sentences with logical connectives:
1. Negation (¬): Reverses the truth value, e.g., ¬W1,3 means "The Wumpus is not in [1,3]."
2. Conjunction (∧): Both must be true, e.g., W1,3 ∧ P3,1 means "The Wumpus is in [1,3] and
there is a pit in [3,1]."
3. Disjunction (∨): At least one must be true, e.g., (W1,3 ∧ P3,1) ∨ W2,2 means "Either the
Wumpus is in [1,3] and there is a pit in [3,1], or there is a pit in [2,2]."
4. Implication (⇒): If the premise is true, the conclusion must be true, e.g., (W1,3 ∧ P3,1) ⇒
¬W2,2 means "If the Wumpus is in [1,3] and there is a pit in [3,1], then the Wumpus is not in
[2,2]."
5. Biconditional (⇔): Both sides must be either true or false, e.g., W1,3 ↔ ¬W2,2 means "The
Wumpus is in [1,3] if and only if there is no Wumpus in [2,2]."

Grammar and Operator Precedence


The grammar is defined using Backus-Naur Form (BNF):
• AtomicSentence → True | False | P,Q,R,…
• ComplexSentence → (Sentence) | ¬Sentence | Sentence ∧ Sentence | Sentence ∨ Sentence |
Sentence⇒Sentence | Sentence⇔Sentence

21
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Operator Precedence (from highest to lowest):


1. Negation (¬)
2. Conjunction (∧)
3. Disjunction (∨)
4. Implication (⇒)
5. Biconditional (⇔)
Parentheses or square brackets are used to clarify complex sentences and avoid ambiguity.

Semantics of Propositional Logic


The semantics of propositional logic defines how to determine the truth of sentences in a model,
which assigns a truth value (true or false) to each proposition symbol. For example, a model
m1 = {P1,2 =false, P2,2 =false, P3,1 =true}.

Truth of Atomic Sentences:


• True is always true.
• False is always false.
• Other proposition symbols' truth values are given directly by the model.

Truth of Complex Sentences:


The truth value of complex sentences formed with connectives is computed recursively:
1. Negation (¬): ¬P is true if P is false.
2. Conjunction (∧): P∧Q is true if both P and Q are true.
3. Disjunction (∨): P∨Q is true if at least one of P or Q is true.
4. Implication (⇒): P⇒Q is true unless P is true and Q is false.
5. Biconditional (⇔): P⇔Q is true if P⇒Q and Q⇒P are both true (i.e., both true or both false).

Truth Tables for Connectives:

P Q ¬P P∧Q P∨Q P⇒Q P⇔Q

False False True False False True True

False True True False True True False

True False False False True False False

True True False True True True True

For example, ¬P1,2 ∧ (P2,2 ∨ P3,1) evaluates to true in model m1.

22
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Key Points:
• Exclusive Or (XOR): A variation of disjunction where both operands cannot be true.
• Implication: True when P is false or Q is true, regardless of causality.
• Biconditional (⇔): True when both sides imply each other, e.g., B1,1 ⇔ (P1,2 ∨ P2,1).
These rules define how to compute the truth value of any sentence in propositional logic, allowing
logical inference.

Simple Knowledge Base for the Wumpus World


In propositional logic, we can create a knowledge base for the Wumpus world using the following
symbols for each location [x,y]:
• Px,y: True if there is a pit in [x,y].
• Wx,y: True if there is a Wumpus in [x,y], alive or dead.
• Bx,y: True if the agent perceives a breeze in [x,y].
• Sx,y: True if the agent perceives a stench in [x,y].
Example Sentences:
1. No pit in [1,1]:
R1: ¬P1,1
2. A square is breezy if and only if a neighboring square has a pit:
o For square [1,1]: R2: B1,1⇔(P1,2 ∨ P2,1)
o For square [2,1]: R3: B2,1⇔(P1,1 ∨ P2,2 ∨ P3,1)
3. Agent's percepts in a specific world:
o No breeze in [1,1]: R4: ¬B1,1
o Breeze in [2,1]: R5: B2,1
These sentences form the knowledge base, which allows the agent to infer information about the
Wumpus world. For example, from the given percepts, the agent can deduce that there is no pit in
[1,1] and there is a breeze in [2,1], guiding further inferences about the world.

Simple Inference Procedure


The goal is to determine if a knowledge base (KB) entails a sentence α (i.e., KB⊨α). This is done using
a model-checking approach where all possible models are enumerated, and it is checked whether α
holds true in every model where KB is true.
Process:
• A model assigns true or false to every proposition symbol.
• In the Wumpus world, there are 7 proposition symbols, leading to 27=128 possible models.

23
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• KB is true in a model if all its sentences are satisfied, and we check if α is also true in those
models.
For example, if P1,2 is false in all valid models, we can infer that there is no pit in [1,2].

Fig. A truth table constructed for the knowledge base given in the text. KB is true if R1 through R5 are
true, which occurs in just 3 of the 128 rows (the ones underlined in the right-hand column). In all 3
rows, P1,2 is false, so there is no pit in [1,2]. On the other hand, there might (or might not) be a pit in
[2,2].

Truth-table Enumeration Algorithm:


1. TT-ENTAILS? recursively enumerates models and checks if α holds when KB is true.
2. TT-CHECK-ALL recursively tests truth values for each proposition symbol in the model and
evaluates both the KB and α.

Fig. A truth-table enumeration algorithm for deciding propositional entailment. (TT stands for truth
table.) PL-TRUE? returns true if a sentence holds within a model. The variable model represents a
partial model—an assignment to some of the symbols. The keyword “and” is used here as a logical
operation on its two arguments, returning true or false.

24
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• PL-TRUE? checks if a sentence holds in a given model.


• The algorithm performs a depth-first search, recursively constructing all models and evaluating
the truth of both the KB and α.
Complexity:
• Time complexity: O(2n), where n is the number of symbols.
• Space complexity: O(n) due to depth-first search.

Logical Equivalences:

Fig. Standard logical equivalences. The symbols α, β, and γ stand for arbitrary sentences of
propositional logic.
These equivalences are used to manipulate and simplify logical formulas during the inference process.

Propositional Theorem Proving


Theorem proving is a method of deriving conclusions from premises using logical inference rules,
without checking all possible models.
1. Theorem Proving vs Model Checking:
o Model Checking verifies a proposition by enumerating all possible models (truth
assignments), which can be inefficient when there are many models.
o Theorem Proving constructs a proof using logical rules, potentially more efficient when
the proof is short.
2. Logical Equivalence (≡):
o Two sentences α and β are logically equivalent if they are true in the same models,
meaning α∣=β and β∣=α .
3. Validity (Tautology):
o A sentence is valid (or a tautology) if it is true in all models (e.g., P ∨ ¬P).
o Deduction Theorem: α∣=β if and only if (α⇒β) is valid.

25
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

4. Satisfiability:
o A sentence is satisfiable if there exists at least one model where it is true. The SAT
problem (determining satisfiability) is NP-complete, central to many computational
problems.
5. Reductio ad Absurdum (Proof by Contradiction):
o This technique assumes the negation of what you want to prove and shows it leads to a
contradiction, proving the original statement.
o α∣=β is equivalent to (α∧¬β) being unsatisfiable.
6. Key Logical Results:
o Validity and Satisfiability are connected: α is valid iff ¬α is unsatisfiable.
o Entailment: α∣=β iff (α∧¬β) is unsatisfiable, reflecting proof by contradiction.
These concepts form the foundation of automated reasoning, enabling efficient entailment and
validity checking without enumerating all models.

Inference and proofs


1. Inference Rules:
o Modus Ponens: From α⇒β and α, infer β. (The best-known rule is called Modus Ponens)

o And-Elimination: From α∧β, infer α (or β).

o Biconditional Elimination: For α⇔β, infer (α⇒β)∧(β⇒α) and vice versa.

2. Applying Inference: In the Wumpus World example, inference rules are used to prove ¬P1,2 (no
pit in [1,2]) by applying rules like biconditional elimination, And-Elimination, contraposition,
Modus Ponens, and De Morgan’s rule.
We start with the knowledge base containing R1 through R5 and show how to prove ¬P1,2, that
is, there is no pit in [1,2].
R2 : B1,1 ⇔ (P1,2 ∨ P2,1)
First, we apply biconditional elimination to R2 to obtain
R6: (B1,1 ⇒ (P1,2 ∨ P2,1)) ∧ ((P1,2 ∨ P2,1) ⇒ B1,1).
Then we apply And-Elimination to R6 to obtain

26
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

R7: ((P1,2 ∨ P2,1) ⇒ B1,1).


Logical equivalence for contrapositives gives
R8: (¬B1,1 ⇒ ¬(P1,2 ∨ P2,1)).
Now we can apply Modus Ponens with R8 and the percept R4 (i.e., ¬B1,1), to obtain
R9: ¬(P1,2 ∨ P2,1) .
Finally, we apply De Morgan’s rule, giving the conclusion
R10: ¬P1,2 ∧ ¬P2,1.
That is, neither [1,2] nor [2,1] contains a pit.

3. Proof Search:
o A proof problem is framed as a search problem: start with the knowledge base, apply
inference rules, and search for the desired conclusion. This is more efficient than model
checking as irrelevant propositions are ignored.

4. Monotonicity:
o Monotonicity ensures that adding new information to the knowledge base can only add
new conclusions, not invalidate existing ones. For example, adding facts about pits cannot
contradict previously inferred conclusions.
This method shows how inference rules and logical proof strategies can efficiently derive conclusions
without exhaustive model checking.

Proof by Resolution
Resolution is a complete inference rule used in propositional logic that combines two clauses with
complementary literals (one containing a literal p and the other containing ¬p) to derive a new clause
by removing these complementary literals.
Key Points:
1. Unit Resolution: When one clause is a unit clause (containing only one literal), resolution
simplifies to eliminating complementary literals. For example, resolving P1,1 ∨ P3,1 with ¬P1,1
results in P3,1.
2. Generalized Resolution: When both clauses contain multiple literals, resolution removes the
complementary literals and combines the remaining literals. For instance, resolving P1,1 ∨ P3,1
with ¬P1,1 ∨ ¬P2,2 results in P3,1 ∨ ¬P2,2.
3. Factoring: Multiple occurrences of a literal in a clause are eliminated to avoid redundancy,
ensuring only one copy of each literal in the result.

27
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

4. Soundness: The resolution rule is sound, meaning the derived clause is true if the original clauses
are true. The truth of the resulting clause follows logically from the premises.
5. Completeness: Resolution is complete, meaning it can derive a proof for any valid entailment in
propositional logic, ensuring that if α⊨β, resolution can find a proof that β is true given α.
6. Example from the Wumpus World: The agent uses resolution to deduce the locations of pits based
on its perceptions (like a stench) and prior knowledge, eventually concluding that the pit must be
in a specific location.

Conjunctive Normal Form (CNF)


To apply the resolution rule, propositional logic sentences must be in Conjunctive Normal Form (CNF),
which is a conjunction of clauses (disjunctions of literals).
Conversion Steps:
1. Eliminate Biconditional (⇔): Replace α⇔β with (α⇒β) ∧ (β⇒α).
o Example: B1,1 ⇔ (P1,2 ∨ P2,1) becomes (B1,1 ⇒ (P1,2 ∨ P2,1)) ∧ ((P1,2 ∨ P2,1) ⇒ B1,1).
2. Eliminate Implication (⇒): Replace α⇒β with ¬α∨β.
o Example: (B1,1 ⇒ (P1,2 ∨ P2,1)) becomes (¬B1,1 ∨ P1,2 ∨ P2,1).
3. Move Negation Inwards: Apply De Morgan’s laws to ensure negation only applies to literals.
o Example: ¬(P1,2 ∨ P2,1) becomes (¬P1,2 ∧ ¬P2,1).
4. Distribute Disjunction over Conjunction: Apply distributivity to combine disjunctions and
conjunctions.
o Example: (¬B1,1 ∨ P1,2 ∨ P2,1) ∧ ((¬P1,2 ∧ ¬P2,1) ∨ B1,1) becomes
(¬B1,1 ∨ P1,2 ∨ P2,1) ∧ (¬P1,2 ∨ B1,1) ∧ (¬P2,1 ∨ B1,1).
Now the sentence is in CNF, ready for resolution-based inference.

Resolution Algorithm:
The resolution algorithm proves entailment by contradiction: to show KB⊨α, we prove that KB ∧ ¬α
is unsatisfiable (leads to a contradiction). This is done through the following steps:
Algorithm Steps:
1. Convert to CNF: Convert KB ∧ ¬α into Conjunctive Normal Form (CNF).
2. Apply Resolution: Apply the resolution rule to pairs of clauses with complementary literals (one
clause containing P, the other ¬P).
3. Continue Resolving:
o If no new clauses can be added, then KB does not entail α.

28
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

o If the empty clause (a contradiction indicating un-satisfiability, equivalent to False) is


derived, then KB entails α.

Example (Wumpus World):


For the query α = ¬P1,2, the knowledge base is:
KB = (B1,1 ⇔ (P1,2 ∨ P2,1)) ∧ ¬B1,1
The process involves:
• Converting KB ∧ ¬α into CNF.
• Resolving clauses until the empty clause (or no new clauses) is found.

PL-RESOLUTION Algorithm
The algorithm resolves pairs of clauses, generating new clauses, and checks for the empty clause:

Fig. A simple resolution algorithm for propositional logic. The function PL-RESOLVE returns the set of
all possible clauses obtained by resolving its two inputs.

Completeness of Resolution
The PL-RESOLUTION algorithm is complete, meaning it can always determine entailment in
propositional logic, as guaranteed by the Ground Resolution Theorem.
1. Resolution Closure (RC(S)): The set of all clauses derivable from a set S via resolution. It is finite
due to factoring.
2. Ground Resolution Theorem:
o If a set is unsatisfiable, RC(S) contains the empty clause (a contradiction).
o If no empty clause is found, the set is satisfiable, meaning a truth assignment exists that
satisfies all clauses.
3. Satisfiability Construction: To prove satisfiability, construct a model by assigning truth values to
literals in S based on the clauses in RC(S).

29
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Horn and Definite Clauses


1. Definite Clause: A disjunction of literals with exactly one positive literal
(e.g., ¬L1,1 ∨ ¬Breeze ∨ B1,1).
2. Horn Clause: A disjunction with at most one positive literal. All definite clauses and goal clauses
(clauses with no positive literals) are Horn clauses. Horn clauses are closed under resolution.

Key Points:
• Definite Clauses can be written as implications (e.g., (¬L1,1 ∨ ¬Breeze ∨ B1,1) becomes
(L1,1∧Breeze) ⇒ B1,1).
o The body is the conjunction of positive literals, and the head is the single positive literal.
o A fact is a clause with a single positive literal (e.g., L1,1).
• Inference with Horn Clauses:
▪ Forward-Chaining: Derives new facts from known ones.
▪ Backward-Chaining: Attempts to prove a goal from known facts.
• Deciding Entailment for Horn clauses is linear time, making it highly efficient.
Horn clauses enable efficient forward and backward-chaining inference and fast entailment checking.

Forward and Backward Chaining


1. Forward-Chaining:
o Goal: Determines if a query q is entailed by a knowledge base (KB) of definite clauses.
o Process: Starts with known facts, applies implications when all premises are true, and adds
conclusions to the known facts. Continues until the query is inferred or no new facts are
added.
o Efficiency: Runs in linear time.
o Soundness: Based on Modus Ponens.
o Completeness: Ensures all entailed atomic sentences are derived.
o Application: Used in data-driven reasoning (e.g., wumpus agent), where new facts trigger
further inferences.
2. Backward-Chaining:
o Goal: Works backward from the query, proving it by finding implications with the query as
the conclusion and proving their premises.
o Efficiency: Linear time, focusing only on relevant facts.
o Application: Goal-directed reasoning (e.g., answering specific questions like “Where are
my keys?”).

30
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Both techniques are efficient inference methods, with forward chaining being data-driven and
backward chaining goal-directed.

31
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

The Machine Learning Landscape


Machine Learning (ML) is often misunderstood as futuristic AI, but it has been in use for decades, with
early applications like Optical Character Recognition (OCR) and spam filters. These tools, especially
the spam filter, have quietly impacted everyday life.
The chapter aims to clarify what ML is, addressing key questions like whether downloading Wikipedia
means a machine has “learned.” It will cover fundamental ML concepts, including:
• Supervised vs. unsupervised learning (labeled vs. unlabeled data),
• Online vs. batch learning (continuous vs. bulk data processing),
• Instance-based vs. model-based learning (decisions based on data points vs. models).
It also introduces the typical ML project workflow, discussing challenges and the need for evaluation
and fine-tuning.

What Is Machine Learning?


Machine Learning (ML) is the field that focuses on programming computers to learn from data. Arthur
Samuel (1959) defined it as enabling computers to learn without being explicitly programmed.
Tom Mitchell (1997) provided a more engineering-oriented definition, stating that a computer
program learns from experience (E) with respect to a task (T) and a performance measure (P) if its
performance improves over time.
For example, a spam filter is an ML program that learns to identify spam emails by analyzing examples
(training set) of spam and non-spam emails (training instances). The task (T) is to flag spam, experience
(E) is the training data, and performance (P) is typically measured by accuracy, the ratio of correctly
classified emails.
Downloading a large dataset like Wikipedia does not qualify as ML because it doesn’t improve
performance on any task without further processing or learning.

Why Use Machine Learning?


Traditional programming techniques can be cumbersome when solving complex problems like creating
a spam filter. In a traditional approach:
1. Identify patterns: You manually analyze spam emails to identify patterns (e.g., certain words or
phrases).
2. Write rules: You create specific rules to detect those patterns in emails.
3. Test and repeat: You continually test and refine the rules until the program performs well.

32
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. The traditional approach


This method can lead to a long, complex list of rules that’s difficult to maintain and update. If spammers
change their tactics, you’ll need to manually update the rules.

In contrast, a Machine Learning-based spam filter:


• Automatically learns from patterns in the data, detecting which words and phrases are good
predictors of spam without needing to explicitly program every rule.

Fig. Machine Learning approach


• Adapts over time, so if spammers change tactics (e.g., using "For U" instead of "4U"), the filter will
automatically recognize and flag the new pattern.

Fig. Automatically adapting to change


ML is especially valuable for:
• Complex problems: For tasks that are too complicated for traditional programming, like speech
recognition. Traditional methods might try to use hardcoded rules (e.g., detecting high-pitch

33
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

sounds for "two"), but this approach doesn't scale well to diverse languages, noisy environments,
or large vocabularies. ML can learn from examples to handle this complexity.
• Fluctuating environments: ML systems can adapt to new data automatically, making them more
flexible than traditional approaches.
• Insights into data: ML algorithms can help identify hidden patterns or trends in large datasets,
offering valuable insights through data mining.

Fig. Machine Learning can help humans learn


Summary of when to use Machine Learning:
• When traditional methods require too much manual tuning or complex rule-making: ML can
often simplify the process and improve performance.
• For problems with no good existing solution: ML can discover solutions where traditional
algorithms fail.
• In dynamic environments: ML can continuously adapt to changing data.
• For uncovering insights from complex data: ML can help identify patterns not obvious through
traditional analysis.
Machine Learning excels when problems are complex, rules are hard to define, or environments
change rapidly. It also provides valuable insights through pattern recognition in large datasets.

Types of Machine Learning Systems


Machine Learning systems can be classified based on three key criteria:
1. Supervision Level:
o Supervised Learning: Trained with labeled data.
o Unsupervised Learning: Learns from unlabeled data.
o Semi-supervised Learning: Uses a mix of labeled and unlabeled data.
o Reinforcement Learning: Learns through trial and error with feedback.

34
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

2. Learning Method:
o Online Learning: Learns incrementally as new data arrives.
o Batch Learning: Learns from a fixed dataset, retraining when new data is available.
3. Learning Approach:
o Instance-based Learning: Compares new data to known data points.
o Model-based Learning: Builds a predictive model from training data.
These categories can overlap. For example, a spam filter using a deep neural network could be an
online, model-based, supervised learning system.

Supervised/Unsupervised Learning
Machine Learning systems can be classified according to the amount and type of
supervision they get during training. There are four major categories: supervised
learning, unsupervised learning, semi-supervised learning, and Reinforcement Learn-
ing.

Supervised learning
Supervised learning is a machine learning technique where the model is trained on labeled data,
meaning the input data comes with corresponding desired output (labels). The model learns the
relationship between inputs and labels to make predictions on new, unseen data.

Types of Tasks:
• Classification: Predicts a discrete label (e.g., spam vs. ham in email classification).

Fig. A labeled training set for supervised learning (e.g., spam classification)

• Regression: Predicts a continuous value (e.g., predicting car prices based on features like mileage
and age).

35
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Regression
Common Algorithms:
• k-Nearest Neighbors (k-NN): Classifies based on nearest neighbors in feature space.
• Linear Regression: Models the relationship between input features and continuous output.
• Logistic Regression: A classification algorithm used to predict the probability of a class.
• Support Vector Machines (SVMs): Finds the hyperplane that best separates classes.
• Decision Trees/Random Forests: Decision Trees make predictions using learned rules, while
Random Forests combine multiple trees.
• Neural Networks: Learn complex patterns for both classification and regression tasks.

Attributes vs Features:
• Attribute: A data characteristic (e.g., "Mileage").
• Feature: An attribute with a specific value (e.g., "Mileage = 15,000 miles").

Unsupervised Learning
In unsupervised learning, the training data is unlabeled, and the algorithm tries to learn patterns and
structure from the data without explicit guidance or labeled outputs. The goal is to uncover hidden
relationships, groupings, or patterns within the data.

Fig. An unlabeled training set for unsupervised learning

36
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Key Tasks and Algorithms:


1. Clustering: Groups similar data points together without predefined labels.
o K-Means: Partitions data into clusters based on similarity.
o DBSCAN: Groups data based on density, identifying noise points.
o Hierarchical Cluster Analysis (HCA): Builds a tree of clusters, allowing flexible grouping.
2. Anomaly and Novelty Detection: Identifies unusual or outlier data points.
o One-class SVM: Detects outliers by learning the distribution of normal data.
o Isolation Forest: Identifies anomalies by isolating them in decision trees.
3. Visualization and Dimensionality Reduction: Reduces data complexity to make it easier to analyze
and visualize.
o Principal Component Analysis (PCA): Reduces dimensions while preserving variance.
o Kernel PCA: A non-linear version of PCA for more complex data.
o Locally-Linear Embedding (LLE): Preserves local data structure in lower dimensions.
o t-SNE: Visualizes high-dimensional data by preserving pairwise distances.

Fig. Example of a t-SNE visualization highlighting semantic clusters


4. Association Rule Learning: Discovers interesting relationships between attributes in large
datasets.
o Apriori and Eclat: Find frequent item-sets and rules in transactional data.

Use Cases and Examples:


• Clustering: For example, clustering blog visitors into groups (e.g., male comic book readers, young
sci-fi lovers) to target content more effectively.

37
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Clustering
• Dimensionality Reduction: Reduces the number of features in data (e.g., merging car mileage and
age) to speed up training and improve performance.
• Anomaly Detection: Identifying unusual events like fraudulent credit card transactions or
manufacturing defects by recognizing deviations from typical patterns.

Fig. Anomaly detection


• Association Rule Learning: In retail, discovering that customers buying barbecue sauce and
potato chips also tend to purchase steak, which could inform product placement.
Unsupervised learning focuses on extracting patterns from unlabeled data, often leading to insights
that are not immediately apparent.

Semi-supervised Learning
Semi-supervised learning combines both labeled and unlabeled data for training. Typically, there is a
small amount of labeled data and a large amount of unlabeled data. The algorithm leverages the
unlabeled data to discover patterns and the labeled data to refine its predictions.

Fig. Semi-supervised learning

38
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Key Concepts:
• Combination of Supervised and Unsupervised Learning: Semi-supervised learning algorithms
often integrate unsupervised methods (like clustering) with supervised learning techniques to
improve performance.
Example:
• Photo Recognition: In services like Google Photos, the system may first use clustering to identify
groups of photos containing the same person. Once you label a few photos (e.g., identifying
Person A and Person B), the system can apply the learned pattern to automatically identify and
name people in all other photos, making photo searches easier.

Semi-supervised Algorithms:
• Deep Belief Networks (DBNs): These use unsupervised components (like restricted Boltzmann
machines, or RBMs) for pretraining, followed by supervised fine-tuning to enhance performance.

Semi-supervised learning is particularly useful when labeling data is expensive or time-consuming,


as it can learn from a small amount of labeled data while exploiting a larger amount of unlabeled
data.

Reinforcement Learning
Reinforcement Learning (RL) involves an agent that interacts with an environment by taking actions
and receiving feedback in the form of rewards or penalties. The goal is for the agent to learn the best
strategy, or policy, that maximizes its long-term reward.

Fig. Reinforcement Learning

39
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Key Concepts:
• Agent: The learning system that observes the environment, takes actions, and receives rewards or
penalties.
• Environment: The external context in which the agent operates and interacts.
• Policy: A strategy that defines the best action for the agent to take in any given situation, aimed at
maximizing rewards over time.

Example:
• Robot Learning to Walk: A robot might use RL to learn how to walk by trial and error. It takes
actions (e.g., moving its legs), receives rewards (if it walks successfully) or penalties (if it falls),
and refines its walking policy to improve over time.
• AlphaGo: DeepMind’s AlphaGo used RL to learn the game of Go. It analyzed millions of games and
played against itself to develop a policy that helped it beat the world champion, Ke Jie, in 2017.
During its match against the champion, AlphaGo was simply applying the learned policy, without
further learning.
Reinforcement Learning is distinct in that the agent learns from its own experience and feedback,
unlike other machine learning methods where the model is trained with labeled data.

Batch Learning vs. Online Learning in Machine Learning


Batch Learning
• Process: The system is trained using the entire dataset at once. After training, it operates without
further learning (offline learning).
• Training: Requires all data at once, which is resource-intensive and time-consuming, needing
significant computational power (CPU, memory, storage).
• Re-training: New models must be trained from scratch with both old and new data, which can take
hours and is typically done periodically (e.g., daily or weekly).
• Limitations:
o High Computing Cost: Re-training on large datasets is computationally expensive.
o Slow Adaptation: Batch learning is slow to adapt to rapidly changing data, making it
unsuitable for real-time applications (e.g., stock prediction).
o Resource Intensive: It requires large amounts of data and resources, which may be
impractical in resource-constrained environments (e.g., smartphones, Mars rovers).

40
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Online Learning

Fig. Online learning


• Process: The system learns incrementally, updating its model as new data arrives, processing data
step-by-step or in small mini-batches.
• Training: Each training step is fast and computationally cheap, with old data discarded after
learning.
• Data Handling: Ideal for continuous data streams (e.g., stock prices, real-time user interactions)
where rapid adaptation is needed.
• Resource Efficiency: Can handle huge datasets and work with limited resources, as data is
processed incrementally and not stored after use (supports out-of-core learning).

Fig. Using online learning to handle huge datasets


• Advantages:
o Adaptability: Quickly adapts to changing data, suitable for real-time applications like fraud
detection or stock prediction.
o Low Resource Usage: Requires less storage and computational power.
o Scalability: Can handle large datasets and work on devices with limited resources (e.g.,
smartphones, IoT).

41
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• Challenges:
o Sensitivity to Bad Data: Poor-quality data can degrade performance over time (e.g., faulty
sensors or spam).
o Learning Rate: The learning rate controls adaptation speed. High rates lead to rapid
changes but may forget past data, while low rates ensure stability but slow adaptation.

Out-of-Core Learning
• A method similar to online learning, where data is processed in chunks that fit in memory, useful
for large datasets. This typically occurs offline but is conceptually similar to online learning as data
is processed incrementally.

Choosing Between Batch and Online Learning


• Batch Learning: Use if retraining periodically is feasible, and your data changes slowly. Suitable for
systems with sufficient resources and when real-time updates aren’t needed.
• Online Learning: Best for systems needing real-time updates, limited resources, or when dealing
with large datasets. Ideal for autonomous systems or environments with rapidly changing data.

Instance-Based vs. Model-Based Learning


In machine learning, systems are often categorized based on how they generalize from training data
to unseen examples. The two main approaches for generalization are instance-based learning and
model-based learning.

Instance-Based Learning
Instance-based learning is a machine learning approach where the system "learns by heart" by storing
training examples and generalizing new instances through comparison with stored examples using a
similarity measure.

Working Principle:
• Learning Process: The system stores training data and classifies new instances by comparing them
to stored examples.
• Similarity Measure: For example, in a spam filter, similarity can be based on the number of
common words between emails. If a new email shares many words with spam emails, it is flagged
as spam.

42
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Example:
• A spam filter flags new emails based on their similarity to known spam emails.

Fig. Instance-based learning

Advantages:
• Simple and Intuitive: Easy to implement without needing a complex model.
• Flexible: Can handle complex, non-linear patterns in data.

Disadvantages:
• Memory Intensive: Requires storing all training examples, which can consume a lot of memory.
• Slow Predictions: Making predictions involves comparing new instances to all stored examples,
which can be computationally expensive.

Model-based learning
Model-based learning involves creating a model from a set of examples to make predictions.

Fig. Model-based learning


Eg: The goal is to determine if GDP per capita correlates with life satisfaction across countries.

43
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Process Overview:
1. Data Collection: Data on life satisfaction (OECD) and GDP per capita (IMF) is merged for various
countries.

Fig. Plot of Life satisfaction vs GDP per capita across various countries
2. Model Selection: A linear model is chosen to relate life satisfaction to GDP per capita:
Life Satisfaction = θ0 + θ1 × GDP_per_capita

Fig. A few possible linear models


3. Training the Model: The parameters θ0 and θ1 are learned using Linear Regression, minimizing
the error between predictions and actual data.

Fig. The linear model that fits the training data best
4. Prediction: After training, the model can predict life satisfaction for countries not in the original
dataset, like Cyprus.

Cyprus’s GDP per capita is $22,587, and then apply your model and find that life
satisfaction 4.85 + 22,587 × 4.91 × 10 = 5.96. -5

5. Comparison with Instance-Based Learning: If an instance-based learning algorithm (like k-


Nearest Neighbors) was used instead, predictions would be based on countries with similar GDP
per capita (Cyprus would be compared with countries like Slovenia 5.7, Portugal 5.1, and Spain 6.5.
Averaging these three values, you get 5.77).

44
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

6. Improvement: To enhance predictions, more attributes (e.g., employment rate, health) or a more
complex model (like Polynomial Regression) might be required.

Code Example:
The Python code demonstrates loading data, visualizing it, training a Linear Regression model using
Scikit-learn, and making predictions for Cyprus.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands=',', delimiter='\t', encoding='latin1',
na_values="n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()
# Select a linear model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

Main Challenges of Machine Learning


Insufficient Quantity of Training Data
Machine learning requires large amounts of data to work effectively, unlike a toddler who can learn
concepts like recognizing an apple with just a few examples. For simple tasks, thousands of examples
are typically needed, and for more complex tasks like image or speech recognition, millions of
examples are often required. This is because machine learning models rely on data to learn patterns,
and without enough data, their performance can be poor, even with basic problems.

Nonrepresentative Training Data


For machine learning to generalize well, the training data must be representative of the cases it will
encounter. If the data is nonrepresentative, like missing key countries in a model, the resulting
predictions can be inaccurate. For example, adding missing countries to a linear model drastically
changes its predictions, revealing flaws in using overly simplistic models for complex scenarios.

45
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. A more representative training sample


Nonrepresentative data, whether due to small sample size or flawed sampling methods, leads to
sampling bias, which affects the accuracy of the model, especially for outliers like very poor or very
rich countries.

Poor-Quality Data
Poor-quality data, including errors, outliers, and noise, hampers machine learning model performance.
Data cleaning is crucial, with tasks like discarding or fixing outliers and handling missing features (e.g.,
filling in missing values or ignoring instances). Data scientists often spend significant time on this
process to improve model accuracy.

Irrelevant Features
Effective machine learning requires relevant features, as irrelevant ones can hinder performance.
Feature engineering involves:
• Feature selection: Choosing the most useful features.
• Feature extraction: Combining features for greater utility.
• Creating new features: Gathering additional data.

Overfitting the Training Data


Overfitting occurs when a model performs well on training data but poorly on new data, often due to
excessive complexity relative to the data. It may detect irrelevant patterns, like a relationship between
a country’s name and life satisfaction.

46
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Overfitting the training data


To prevent overfitting:
• Simplify the model by reducing parameters or features.
• Gather more data.
• Reduce noise in the data.
Regularization helps by constraining the model, making it simpler. The amount of regularization is
controlled via a hyperparameter, which is set before training. High regularization prevents overfitting
but may hinder finding an optimal solution. Tuning hyperparameters is key to balancing data fitting
and generalization.

Fig. Regularization reduces the risk of overfitting

Underfitting the Training Data


Underfitting happens when a model is too simple to capture the data’s complexity, resulting in
inaccurate predictions. To address underfitting:
• Use a more powerful model with more parameters.
• Improve features through feature engineering.
• Reduce model constraints, like lowering the regularization hyperparameter.

Stepping Back
Machine learning improves performance by learning from data, not explicit rules. ML systems vary
(supervised, unsupervised, batch, online, etc.), and in a project, you gather data, train an algorithm,
and tune the model to fit the data. The training set must be sufficient, representative, and clean, with

47
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

relevant features. Models should avoid underfitting (too simple) and overfitting (too complex). Finally,
evaluate and fine-tune the model to ensure it generalizes well to new cases.

Testing and Validating


1. Training and Testing:
o Training Set: Used to train the model.
o Test Set: Evaluates model performance on new, unseen data, estimating the generalization
error.
o Overfitting: Occurs when a model performs well on the training set but poorly on the test
set, indicating the model has memorized the training data rather than learning
generalizable patterns.
o Typical Data Split: 80% for training, 20% for testing, but this may vary with dataset size.
2. Hyperparameter Tuning and Model Selection:
o Model Comparison: Test different models and choose the best performer based on the
test set.
o Hyperparameter Tuning: If you tune hyperparameters based on test set performance, you
risk overfitting to the test data.
o Holdout Validation: Use a validation set to fine-tune models and hyperparameters
without contaminating the test set, which should be used only for final evaluation.
3. Cross-Validation:
o Validation Set: A portion of the training data used to evaluate and select the best model
or hyperparameters.
o Cross-Validation: Split the data into multiple validation sets to get a more accurate
estimate of model performance, though it increases computational cost.
4. Data Mismatch:
o Training Data vs. Production Data: If the training data (e.g., web images) differs from
production data (e.g., mobile app images), the validation and test sets must reflect real-
world data to assess the model's true performance.
o Train-Dev Set: A separate "train-dev" set of representative data helps distinguish between
overfitting and data mismatch issues.
In summary, to ensure robust model performance, use proper data splits, avoid overfitting, and ensure
your validation and test sets are representative of production data.

48
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Well-Posed Learning Problems


A learning problem involves a computer program improving its performance on a task through
experience. A program learns from experience E with respect to a class of tasks T and a performance
measure P, if its performance on T improves as judged by P.

Example Learning Problems:


Speech Recognition:
Task: Recognizing words and phonemes (distinct units of sound).
Performance: Accuracy of recognition.
Experience: Data on speaker patterns and noise.

Autonomous Vehicle Driving:


Task: Driving on highways.
Performance: Distance driven without error.
Experience: Data from human drivers' images and steering commands.

Astronomical Classification:
Task: Classifying celestial objects.
Performance: Classification accuracy.
Experience: Astronomical image databases.

Backgammon:
Task: Playing at a competitive level.
Performance: Win rate.
Experience: Self-play games.

Well-Defined Learning Problems:


A learning problem is defined by three components:
Class of tasks (T): The set of tasks (e.g., playing a game, recognizing words).
Performance measure (P): How success is evaluated (e.g., accuracy, win rate).
Experience (E): The data or interactions used to learn (e.g., training datasets, self-play).

49
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Examples of Well-Defined Learning Problems:


Checkers Learning Problem:
Task (T): Playing checkers.
Performance Measure (P): Percent of games won against opponents.
Training Experience (E): Playing practice games against itself.

Handwriting Recognition Problem:


Task (T): Recognizing and classifying handwritten words from images.
Performance Measure (P): Percent of words correctly classified.
Training Experience (E): A database of handwritten words with given classifications.

Robot Driving Learning Problem:


Task (T): Driving on public four-lane highways using vision sensors.
Performance Measure (P): Average distance traveled before an error (as judged by a human overseer).
Training Experience (E): A sequence of images and steering commands recorded while observing a
human driver.

Broader Definition of Learning:


The definition of learning here is broad, covering any system that improves performance with
experience, such as a database system that improves query responses based on updates. The aim is to
focus on problems solvable through experience-based improvement, rather than narrow, everyday
interpretations of "learning."

Designing a Learning System (for Checkers)


To design a machine learning system to play checkers, we need to make key decisions regarding the
training experience, the type of knowledge to be learned, and the learning mechanism.

1. Choosing the Training Experience


• Direct vs. Indirect Feedback:
o Direct Feedback: The system learns from individual board states and correct moves,
making learning easier.
o Indirect Feedback: The system learns from the final outcome (win/loss) of a game. It faces
the challenge of credit assignment, i.e., determining which moves contributed to the
result.

50
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• Control Over Training Experience:


o Teacher-Controlled: The teacher selects board states and provides correct moves.
o Learner-Controlled: The learner proposes confusing states and asks the teacher for
guidance.
o Autonomous Learning: The system plays against itself, generating its own training data.
• Representativeness of Training Experience:
o The training data should be representative of real-world situations. If the system only
trains by playing against itself, it might miss crucial board states seen in human
competition. Diverse training examples are necessary for good generalization.

2. Choosing the Target Function


To design the checkers-playing program, we need to define the target function the system will learn
to improve performance.

Target Function Choices


1. ChooseMove Function:
o The program could learn a ChooseMove function that selects the best move from a given
board state. This function maps board states (B) to moves (M), but learning it directly is
challenging due to indirect feedback from game outcomes.
2. Evaluation Function (V):
o A more practical approach is learning an evaluation function V that assigns a score to
board states, where higher scores indicate better positions. The system can select the best
move by evaluating successor states and choosing the one with the highest score.

Defining the Evaluation Function (V)


We define V(b) recursively:
1. V(b) = 100 for a winning final state.
2. V(b) = -100 for a losing final state.
3. V(b) = 0 for a drawn final state.
4. V(b) = V(b'), where b' is the best achievable final state from b, assuming optimal play.
This definition is not computationally feasible because it requires searching ahead to the game's end.

51
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Operationalizing the Evaluation Function


The learning task is to approximate the ideal evaluation function V in a computationally efficient
form, which is often called function approximation. The system learns an approximation ? of V to
evaluate states and select moves in real-time.

3. Choosing a Representation for the Target Function


To represent the target function V, we need to choose a method for the learning program to
approximate it. This involves balancing expressiveness with the amount of training data required.

Representation Options
• Table Lookup: Stores values for each board state but is inefficient for large boards.
• Rules or Polynomials: Less flexible or too complex.
• Neural Networks: Powerful but require large datasets.

Chosen Representation
The target function c(b) is represented as a linear combination of key board features:
• x1: Number of black pieces
• x2: Number of red pieces
• x3: Number of black kings
• x4: Number of red kings
• x5: Number of black pieces threatened by red
• x6: Number of red pieces threatened by black
Thus, the evaluation function is:
c(b) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + w6x6
Where w0,w1,…,w6 are weights learned by the algorithm.
Summary of Design Choices
• Task (T): Playing checkers.
• Performance Measure (P): Percentage of games won.
• Training Experience (E): Games played against itself.
• Target Function (V): Maps board states to scores.
• Representation (c): A linear function of board features, with weights to be learned.
This reduces the problem to learning the weights w0 through w6 for the linear function.

52
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

4. Choosing a Function Approximation Algorithm


To learn the target function V, the system needs training examples of board states b and
corresponding values V_train(b).

4.1 Estimating Training Values


• Final States: Values are assigned directly (e.g., +100 for a win, -100 for a loss).
• Intermediate States: For non-final states, the value is estimated using the value of the successor
state: Vtrain(b)≈V(Successor(b)). This iterative estimation can converge to accurate values under
certain conditions.

4.2 Adjusting the Weights


To fit the training data, the Least Mean Squares (LMS) algorithm is used:
• The goal is to minimize the squared error E between predicted values and training values.
• The LMS algorithm adjusts the weights incrementally: wi←wi+η⋅Error(b)⋅xi, where Error(b) is the
difference between the predicted and training values, xi is the feature value, and η is a small
constant.
This process refines the weights to minimize error and converge to an optimal solution.

5. The Final Design


The final design of the checkers learning system is composed of four key modules:
1. Performance System: Plays the game using the learned evaluation function, taking a new board
state as input and producing the game history as output. Performance improves as the evaluation
function becomes more accurate.
2. Critic: Analyzes the game history and generates training examples by estimating the target
function value for each board state in the game sequence.
3. Generalizer: Takes the training examples and learns the target function by generalizing from
them. In this case, the LMS algorithm is used to adjust the weights of the evaluation function.
4. Experiment Generator: Proposes new board states for the system to explore, aiming to maximize
learning. Initially, it starts with the same board, but more advanced strategies could explore
diverse board positions.

53
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Final design of the checkers learning program.


Design Summary
The design process is summarized in Figure, outlining decisions from the training experience to the
representation of the learned function. The system learns a linear evaluation function based on six
specific board features. This design constrains the learning task but provides a foundation for further
exploration.

Evaluation
While the linear function may not be sufficient to beat a world champion, similar approaches with
more sophisticated representations (e.g., neural networks) have proven successful. Tesauro’s
backgammon program, for instance, used a neural network to learn a complex evaluation function
and played at a high level after extensive training.
Alternative approaches like nearest neighbor or genetic algorithms could also be applied. This design
represents one approach to structuring machine learning tasks.

54
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Fig. Summary of choices in designing the checkers learning program.

Perspectives and Issues in Machine Learning


Learning as Search:
Machine learning is often seen as searching a large hypothesis space to find the one that best fits the
observed data and prior knowledge. For example, in checkers, the hypothesis space consists of all
evaluation functions, and the learner searches for the one that aligns with the training data. Algorithms
like LMS adjust weights to minimize errors and are effective when the hypothesis space is continuously
parameterized. Different machine learning methods (e.g., decision trees, neural networks) explore
different hypothesis spaces suited for specific types of problems.

Key Issues in Machine Learning:


1. Learning Algorithms:
o What algorithms exist for learning target functions from training examples?
o Under what conditions do they converge with sufficient data?

55
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

2. Training Data Sufficiency:


o How much training data is needed?
o What bounds can relate data size and hypothesis space to generalization confidence?
3. Prior Knowledge:
o How can prior knowledge guide generalization?
o Can approximate prior knowledge still help?
4. Next Training Experience:
o What is the best strategy for selecting the next training experience?
o How does this affect the learning task's complexity?
5. Function Approximation:
o How can the learning task be framed as function approximation?
o Can the function selection process be automated?
6. Representation Improvement:
o How can the learner adapt its representation to improve learning of the target function?
These questions address fundamental challenges in machine learning, focusing on algorithms, data,
prior knowledge, and task complexity.

A CONCEPT LEARNING TASK


The concept learning task involves predicting whether Aldo enjoys his favorite water sport on a given
day, based on attributes like Sky, AirTemp, Humidity, Wind, Water, and Forecast. The goal is to learn
a hypothesis that predicts EnjoySport = Yes.

Hypothesis Representation
Each hypothesis is a conjunction of constraints on the attributes, represented by a 6-tuple. For each
attribute:
• "?": Any value is acceptable (wildcard).
• Specific value: A required value (e.g., "Warm").
• "0": No value is acceptable.
For example, (?, Cold, High, ?, ?, ?) means Aldo enjoys his sport on cold days with high humidity,
regardless of other attributes.

Hypothesis Space
• Most General Hypothesis: (?, ?, ?, ?, ?, ?) — every day is a positive example.
• Most Specific Hypothesis: (0, 0, 0, 0, 0, 0) — no day is a positive example.

56
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Training Examples
The learner is given examples (days with attribute values and corresponding EnjoySport values). For
instance:
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes

Learning Process
• Positive Example: The hypothesis generalizes to include new positive examples.
• Negative Example: The hypothesis specializes to exclude negative examples.

Goal
Learn a hypothesis that correctly predicts EnjoySport based on the attributes, starting from the most
specific hypothesis and generalizing or specializing it with each training example.

Notation
In concept learning, the task is to predict the target concept based on given attributes.
1. Instances (X): Set of all possible days, described by attributes: Sky, AirTemp, Humidity, Wind,
Water, and Forecast.
2. Target Concept (c): A boolean function c : X→ {0,1}, where c(x)=1 if EnjoySport = Yes, and c(x)=0 if
EnjoySport = No.
3. Hypotheses (H): Set of possible hypotheses, each representing a boolean-valued function over X,
defined by conjunctions of constraints on the attributes. Each constraint can be:
o "?": Any value is acceptable.
o "0": No value is acceptable.
o A specific value.
4. Training Examples (D): A set of pairs (x, c(x)), where x is an instance and c(x) is the target value
(Yes or No). Positive examples have c(x)=1, and negative examples have c(x)=0.
5. Goal: The learner's task is to find a hypothesis h from H such that h(x) = c(x) for all instance x∈X.

57
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

The Inductive Learning Hypothesis


The Inductive Learning Hypothesis assumes that a hypothesis learned from training data that fits the
target concept well will also generalize effectively to unseen examples. Since only the training data
is available, the hypothesis is evaluated based on how well it matches the target concept within this
set. The hypothesis is not guaranteed to be exact, but it is expected to approximate the target concept
well. The effectiveness of this approach depends on the representativeness of the training data; if the
data is sufficient and well-distributed, the hypothesis will likely perform well on new instances from
the same distribution.

FIND-S: Finding a Maximally Specific Hypothesis


FIND-S is an algorithm used to find the most specific hypothesis consistent with positive training
examples by utilizing the more-general-than partial ordering.

Algorithm Steps:
1. Initialize the hypothesis as the most specific possible.
2. Process each positive example:
o If the hypothesis doesn't cover the example, generalize it by replacing specific attribute
values with ? to match the example.
3. Ignore negative examples, as the hypothesis is already consistent with them.
4. Repeat until the hypothesis covers all positive examples.
Example:
For the EnjoySport task:
• Start with h = <Sunny, Warm, Normal, Strong, Warm, Same>.
• First positive example: No change.
• Second positive example: Generalize to h = <Sunny, Warm, ?, Strong, Warm, Same>.
• Third negative example: No change.
• Fourth positive example: Further generalize to h = <Sunny, Warm, ?, Strong, ?, ?>.

Key Characteristics:
• Most specific hypothesis: The algorithm results in the most specific hypothesis that is consistent
with the positive examples.
• No backtracking: It does not reconsider prior generalizations.

58
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Limitations:
1. Convergence: FIND-S cannot confirm whether it has found the correct target concept, as multiple
hypotheses can be consistent with the positive examples.
2. Preference for specificity: It always favors the most specific hypothesis, though a more general
one might sometimes be better.
3. Inconsistent training data: The algorithm assumes error-free data and does not handle noise or
inconsistencies well.
4. Multiple consistent hypotheses: If there are multiple maximally specific hypotheses, FIND-S does
not explore alternatives and would need extensions to do so.

VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM


The Candidate-Elimination algorithm addresses limitations of the FIND-S algorithm by maintaining
the version space, a set of all hypotheses consistent with the training data, rather than just one
hypothesis. This is done without explicitly enumerating all hypotheses, using a more-general-than
partial ordering to refine the version space incrementally with each training example.

Key Concepts:
• Consistent Hypothesis: A hypothesis h is consistent with the training set D if it correctly classifies
all examples in D (i.e., h(x)=c(x) for each example (x, c(x))).
• Version Space: The version space, VS(H, D), is the subset of hypotheses from hypothesis space H
that are consistent with the training data D.
• Incremental Refinement: The version space is updated after each training example: hypotheses
are refined to only those consistent with the new example, either positive or negative.

Main Advantage:
The Candidate-Elimination algorithm provides a compact representation of all hypotheses consistent
with the training data, allowing for more flexibility than FIND-S, though both algorithms struggle with
noisy data.

The LIST-THEN-ELIMINATE Algorithm


The LIST-THEN-ELIMINATE algorithm initializes the version space with all hypotheses in H and
eliminates any that are inconsistent with training examples. It guarantees finding all hypotheses
consistent with the data but is inefficient for large hypothesis spaces since it requires enumerating all
hypotheses.

59
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Key Steps:
1. Initialize Version Space: Include all hypotheses in H.
2. Eliminate Inconsistent Hypotheses: Remove hypotheses that do not match the training examples.
3. Output Version Space: After processing all examples, output the remaining hypotheses.

The Candidate-Elimination algorithm improves this by using a more compact representation with
general and specific boundary sets:
• General Boundary G: Maximally general hypotheses consistent with the data.
• Specific Boundary S: Minimally general (maximally specific) hypotheses consistent with the data.
The version space is represented by these boundaries and hypotheses between them in the general-
to-specific ordering, avoiding the need to list all hypotheses explicitly.

Version Space Representation Theorem:


The version space is fully represented by G, S, and the hypotheses in between, ensuring an efficient
representation.

CANDIDATE-ELIMINATION Learning Algorithm


The CANDIDATE-ELIMINATION algorithm computes the version space of hypotheses consistent with a
sequence of training examples by maintaining two boundary sets: the general boundary (G) and the
specific boundary (S). These sets are updated as each training example is processed.

Key Steps:
1. Initialize G and S:
o G starts with the most general hypothesis.
o S starts with the most specific hypothesis.
2. For each training example d:
o If d is a positive example:
▪ Remove inconsistent hypotheses from G.
▪ Generalize inconsistent hypotheses in S to be consistent with d.
▪ Remove any more general hypotheses in S.
o If d is a negative example:
▪ Remove inconsistent hypotheses from S.
▪ Specialize inconsistent hypotheses in G to be consistent with d.
▪ Remove any less general hypotheses in G.

60
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

3. Output: After processing all examples, the version space contains all hypotheses in G and S
consistent with the training data.

Duality:
• Positive examples: Narrow G and expand S.
• Negative examples: Expand G and narrow S.
The algorithm ensures that the version space includes only hypotheses consistent with the observed
data, using minimal generalizations and specializations. It can be applied to any learning task where
these operations are defined.

Remarks on Version Spaces and Candidate-Elimination


1. Convergence of the Candidate-Elimination Algorithm
The Candidate-Elimination algorithm converges to the correct hypothesis if:
1. Training examples are error-free.
2. The target concept can be described in the hypothesis space.
The version space is refined as new training examples are presented, and the correct hypothesis is
identified when the S and G boundaries converge to one hypothesis. However, errors in training data
(e.g., incorrect labeling) can cause the algorithm to eliminate the correct hypothesis. Similarly, if the
target concept cannot be described by the hypothesis space, the version space may become empty.

2. Optimal Query Strategy


When the learner can request examples, the optimal strategy is to select instances that split the
version space equally. This maximizes the reduction of hypotheses with each new example, reducing
the version space by half each time. The goal is to select queries that test hypotheses in the current
version space, ensuring efficient learning. In cases where an exact split isn't possible, more queries
may be required.

3. Classifying New Instances with Partially Learned Concepts


Even with an incomplete version space, the learner can classify new instances confidently:
• Instance A: If all hypotheses classify it as positive, it is confidently classified as positive.
• Instance B: If all hypotheses classify it as negative, it is confidently classified as negative.
• Instance C: If hypotheses disagree, further data is needed to resolve ambiguity.
• Instance D: With mixed classifications, a majority vote can be used, with a confidence rating.

61
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Instances that satisfy all S hypotheses are classified as positive, and those that satisfy none of the G
hypotheses are classified as negative. If further ambiguity exists, a majority vote can be used to classify
the instance, reflecting the most probable outcome based on the current version space.

Inductive Bias
1. A Biased Hypothesis Space
If the hypothesis space doesn’t include the target concept, the algorithm may fail to find it. For
example, with the EnjoySport task, restricting the hypothesis space to conjunctions of attributes
prevents representing disjunctive concepts like "Sky = Sunny or Cloudy." This can lead to no consistent
hypothesis being found, requiring a more expressive hypothesis space.

2. An Unbiased Learner
To ensure the target concept is in the hypothesis space, an unbiased learner would need a hypothesis
space representing all possible subsets of instances (the power set). However, while this allows any
concept to be expressed, it causes the learner to fail at generalizing beyond observed examples. For
instance, the S boundary would just be a disjunction of positive examples, and the G boundary would
exclude negative ones, meaning only training examples could be classified unambiguously. The learner
would need to see every instance to converge on the target concept, and even with partial learning, it
can't generalize to unobserved instances.

3. The Futility of Bias-Free Learning


Inductive learning requires inductive bias—prior assumptions that guide the learner to generalize
beyond observed data. Without such assumptions, a learner has no rational basis for classifying new
instances. For example, the CANDIDATE-ELIMINATION algorithm can generalize in the EnjoySport task
because it assumes the target concept can be represented as a conjunction of attribute values. If this
assumption is incorrect, it may misclassify instances.
Inductive bias is the set of assumptions that allow a learner to justify its classifications as deductive
inferences. For a learning algorithm L, the inductive bias B ensures that the classification of any new
instance xi follows from the training data D and assumptions B:
B ∪ D ∪ xi ⟹ L(xi, D)
In the CANDIDATE-ELIMINATION algorithm, the inductive bias is the assumption that the target
concept c is in the hypothesis space H, which ensures that classifications are based on consistent
hypotheses in the version space.

62
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Summary of Inductive Bias in Algorithms:


1. ROTE-LEARNER: No inductive bias, simply memorizes training examples and classifies based on
direct matches.
2. CANDIDATE-ELIMINATION: Assumes the target concept is in the hypothesis space H and classifies
by version space consistency.
3. FIND-S: Assumes the target concept is in the hypothesis space and that all instances are negative
unless otherwise specified.
Stronger inductive biases, like in FIND-S, allow more classifications but make more assumptions,
potentially leading to errors if those assumptions are incorrect. The strength of inductive bias affects
the learner's ability to generalize from the training data.

Fig. Modeling inductive systems by equivalent deductive systems

63

You might also like