Recap
A* expands the fringe node with lowest f value where
f(n) = g(n) + h(n)
g(n) is the cost to reach n
h(n) is an admissible estimate of the least cost from n to a goal node:
0 h(n) h*(n)
A* tree search is optimal
Its performance depends heavily on the heuristic h
Creating Heuristics
Creating Admissible Heuristics
Often, admissible heuristics are solutions to relaxed problems, where new
actions are available
15
366
Problem P2 is a relaxed version of P1 if A2(s) A1(s) for every s
Theorem: h2*(s) h1*(s) for every s, so h2*(s) is admissible for P1
Example: 8 Puzzle
Start State Actions Goal State
What are the states?
How many states?
What are the actions?
What are the step costs?
8 Puzzle I
Heuristic: Number of tiles misplaced
Why is it admissible?
h(start) = 8
Start State Goal State
Average nodes expanded when
the optimal path has…
…4 steps …8 steps …12 steps
UCS 112 6,300 3.6 x 106
A*TILES 13 39 227
Statistics from Andrew Moore
8 Puzzle II
What if we had an easier 8-puzzle where
any tile could slide any direction at any
time, ignoring other tiles?
Total Manhattan distance
Start State Goal State
Why is it admissible?
Average nodes expanded when
h(start) = 3 + 1 + 2 + … = 18 the optimal path has…
…4 steps …8 steps …12 steps
A*TILES 13 39 227
A*MANHATTAN 12 25 73
Combining heuristics
Dominance: h1 ≥ h2 if
n h1(n) h2(n)
Roughly speaking, larger is better as long as both are admissible
The zero heuristic is pretty bad (what does A* do with h=0?)
The exact heuristic is pretty good, but usually too expensive!
What if we have two heuristics, neither dominates the other?
Form a new heuristic by taking the max of both:
h(n) = max( h1(n), h2(n))
Max of admissible heuristics is admissible and dominates both!
Example: Knight’s moves
Minimum number of knight’s moves to get from A to B?
h1 = (Manhattan distance)/3
h1’ = h1 rounded up to correct parity (even if A, B same color, odd otherwise)
h2 = (Euclidean distance)/√5 (rounded up to correct parity)
h3 = (max x or y shift)/2 (rounded up to correct parity)
h(n) = max( h1’(n), h2(n), h3(n)) is admissible!
Optimality of A* Graph Search
A* Graph Search Gone Wrong?
State space graph Search tree
h=4
A S (0+2)
1 h=1
1
S C
A (1+4) B (1+1)
h=2 1
2
3 C (2+1) C (3+1)
B
h=1
G G (5+0) G (6+0)
Simple check against expanded set blocks C
h=0 Fancy check allows new C if cheaper than old
but requires recalculating C’s descendants
Consistency of Heuristics
Main idea: estimated heuristic costs ≤ actual costs
Admissibility: heuristic cost ≤ actual cost to goal
A 1 h(A) ≤ h*(A)
Consistency: heuristic “arc” cost ≤ actual cost for each arc
h=4 C h=1 h(A) – h(C) ≤ c(A,C)
h=3 or h(A) ≤ c(A,C) + h(C) (triangle inequality)
Note: h* necessarily satisfies triangle inequality
Consequences of consistency:
The f value along a path never decreases:
h(A) ≤ c(A,C) + h(C) => g(A) + h(A) ≤ g(A) + c(A,C) + h(C)
G A* graph search is optimal
Optimality of A* Graph Search
Sketch: consider what A* does with a
consistent heuristic:
… f1
Fact 1: In tree search, A* expands nodes in
f2
increasing total f value (f-contours)
f3
Fact 2: For every state s, nodes that reach
s optimally are expanded before nodes
that reach s suboptimally
Result: A* graph search is optimal
Optimality
Tree search:
A* is optimal if heuristic is admissible
Graph search:
A* optimal if heuristic is consistent
Consistency implies admissibility
Most natural admissible heuristics tend to be
consistent, especially if from relaxed problems
But…
A* keeps the entire explored region in memory
=> will run out of space before you get bored waiting for the answer
There are variants that use less memory (Section 3.5.5):
IDA* works like iterative deepening, except it uses an f-limit instead of a depth limit
On each iteration, remember the smallest f-value that exceeds the current limit, use as new limit
Very inefficient when f is real-valued and each node has a unique value
RBFS is a recursive depth-first search that uses an f-limit = the f-value of the best
alternative path available from any ancestor of the current node
When the limit is exceeded, the recursion unwinds but remembers the best reachable f-value on
that branch
SMA* uses all available memory for the queue, minimizing thrashing
When full, drop worst node on the queue but remember its value in the parent
A*: Summary
A* orders nodes in the queue by f(n) = g(n) + h(n)
A* is optimal for trees/graphs with admissible/consistent heuristics
Heuristic design is key: often use relaxed problems
g
h
Local search
Local search algorithms
In many optimization problems, path is irrelevant; the goal state is the solution
Then state space = set of “complete” configurations;
find configuration satisfying constraints, e.g., n-queens problem; or, find
optimal configuration, e.g., travelling salesperson problem
In such cases, can use iterative improvement algorithms: keep a single “current”
state, try to improve it
Constant space, suitable for online as well as offline search
More or less unavoidable if the “state” is yourself (i.e., learning)
Hill Climbing
Simple, general idea:
Start wherever
Repeat: move to the best neighboring state
If no neighbors better than current, quit
Heuristic for n-queens problem
Goal: n queens on board with no conflicts, i.e., no queen attacking another
States: n queens on board, one per column
Actions: move a queen in its column
Heuristic value function: number of conflicts
Hill-climbing algorithm
function HILL-CLIMBING(problem) returns a state
current ← make-node(problem.initial-state)
loop do
neighbor ← a highest-valued successor of current
if neighbor.value ≤ current.value then
return current.state
current ← neighbor
“Like climbing Everest in thick fog with amnesia”
Global and local maxima
Random restarts
find global optimum
duh
Random sideways moves
Escape from shoulders
Loop forever on flat
local maxima
Hill-climbing on the 8-queens problem
No sideways moves:
Succeeds w/ prob. 0.14
Average number of moves per trial:
4 when it succeeds, 3 when it gets stuck
Expected total number of moves needed:
3(1-p)/p + 4 =~ 22 moves
Allowing 100 sideways moves:
Succeeds w/ prob. 0.94
Average number of moves per trial:
21 when succeeding, 65 when getting stuck
Expected total number of moves needed: Moral: algorithms with knobs
65(1-p)/p + 21 =~ 25 moves to twiddle are irritating
Simulated annealing
Resembles the annealing process used to cool metals slowly to
reach an ordered (low-energy) state
Basic idea:
Allow “bad” moves occasionally, depending on “temperature”
High temperature => more bad moves allowed, shake the system out of
its local minimum
Gradually reduce temperature according to some schedule
Sounds pretty flaky, doesn’t it?
Simulated annealing algorithm
function SIMULATED-ANNEALING(problem,schedule) returns a state
current ← problem.initial-state
for t = 1 to ∞ do
T ←schedule(t)
if T = 0 then return current
next ← a randomly selected successor of current
∆E ← next.value – current.value
if ∆E > 0 then current ← next
else current ← next only with probability e∆E/T
Simulated Annealing
Theoretical guarantee:
Stationary distribution (Boltzmann): P(x) eE(x)/T
If T decreased slowly enough, will converge to optimal state!
Proof sketch
Consider two adjacent states x, y with E(y) > E(x) [high is good]
Assume xy and yx and outdegrees D(x) = D(y) = D
Let P(x), P(y) be the equilibrium occupancy probabilities at T
Let P(xy) be the probability that state x transitions to state y
x y
Occupation probability as a function of T
E(x)
x
Simulated Annealing
Is this convergence an interesting guarantee?
Sounds like magic, but reality is reality:
The more downhill steps you need to escape a local optimum,
the less likely you are to ever make them all in a row
“Slowly enough” may mean exponentially slowly
Random restart hillclimbing also converges to optimal state…
Simulated annealing and its relatives are a key
workhorse in VLSI layout and other optimal
configuration problems
Local beam search
Basic idea:
K copies of a local search algorithm, initialized randomly
Or, K chosen randomly with
For each iteration a bias towards good ones
Generate ALL successors from K current states
Choose best K of these to be the new current states
Beam search example (K=4)
8
X7 9
X7 10
9 X8
9 9
8 9 10
9 10
8 10
7 9 9
X6 X5
6
X7 8
X3 9
X7 9
Local beam search
Why is this different from K local searches in parallel?
The searches communicate! “Come over here, the grass is greener!”
What other well-known algorithm does this remind you of?
Evolution!
Genetic algorithms
Genetic algorithms use a natural selection metaphor
Resample K individuals at each step (selection) weighted by fitness function
Combine by pairwise crossover operators, plus mutation to give variety
Example: N-Queens
Does crossover make sense here?
What would mutation be?
What would a good fitness function be?
Local search in continuous spaces
Example: Siting airports in Romania
Place 3 airports to minimize the sum of squared distances from each city to its nearest airport
Airport locations
x = (x1,y1), (x2,y2), (x3,y3)
City locations (xc,yc)
Ca = cities closest to airport a
Objective: minimize
f(x) = a cC (xa - xc)2 + (ya - yc)2
a
Handling a continuous state/action space
1. Discretize it!
Define a grid with increment , use any of the discrete algorithms
2. Choose random perturbations to the state
a. First-choice hill-climbing: keep trying until something improves the state
b. Simulated annealing
3. Compute gradient of f(x) analytically
Finding extrema in continuous space
Gradient vector f(x) = (f/x1, f/y1, f/x2, …)T
For the airports, f(x) = a cCa (xa - xc)2 + (ya - yc)2
f/x1 = cC1 2(x1 - xc)
At an extremum, f(x) = 0
Can sometimes solve in closed form: x1 = (cC1 xc)/|C1|
Is this a local or global minimum of f?
Gradient descent: x x - f(x)
Huge range of algorithms for finding extrema using gradients
Summary
Many configuration and optimization problems can be
formulated as local search
General families of algorithms:
Hill-climbing, continuous optimization
Simulated annealing (and other stochastic methods)
Local beam search: multiple interaction searches
Genetic algorithms: break and recombine states
Many machine learning algorithms are local searches