KEMBAR78
Data Science - Part XIV - Genetic Algorithms | PDF
Presented by: Derek Kane
 What is a Genetic Algorithm?
 Biological Inspiration
 Evolution
 Algorithm Mechanics
 Practical Application Example
 Knapsack Problem
 Feature Selection
 Traveling Salesman
Charles Darwin – “It is not the strongest of the species
that survives, nor the most intelligent, but the one
most responsive to change.”
 A genetic algorithm is an adaptation procedure
based on the mechanics of natural genetics
and natural selection.
 Genetic Algorithm’s have 2 essential
components:
 “Survival of the fittest”
 Genetic Diversity
 Originally developed by John Holland (1975).
 The genetic algorithm (GA) is a search heuristic
that mimics the process of natural evolution.
 Uses concepts of “Natural Selection” and
“Genetic Inheritance” (Darwin 1859).
Applications of Genetic Algorithms
 Optimization and Search Problems
 Scheduling and Timetabling
 Aerospace engineering
 Astronomy and astrophysics
 Chemistry
 Electrical engineering
 Financial markets
 Game playing
 Materials engineering
 Military and law enforcement
 Molecular biology
 Pattern recognition and data mining
 Robotics
 Nature is beautiful…
The Aye-Aye
 What we can learn from nature?
 To understand biological processes
properly, we must first have an
understanding of the cell.
 Human bodies are made up of trillions of
cells.
 Each cell has a core structure (nucleus) that
contains your chromosomes.
 Additionally, each of our 23 chromosomes
are made up of tightly coiled strands of
deoxyribonucleic acid (DNA).
 Genes are segments of DNA that
determine specific traits, such as eye
or hair color.
 Humans have more than 20,000
genes. Each gene determines some
aspect of the organism.
 A collection of genes is sometimes
called a genotype.
 A collection of aspects (like eye
characteristics) is sometimes called a
phenotype.
 A gene mutation is an alteration in
your DNA.
 It can be inherited or acquired during
your lifetime, as cells age or are
exposed to certain chemicals.
 Mutations can also be triggered
through errors within the DNA
replication process.
 Some changes in your genes result in
genetic disorders.
Joseph Merrick aka “The Elephant Man” is believed to
have suffered from a genetic disorder called proteus
syndrome.
 Reproduction involves recombination
of genes from parents and then small
amounts of mutation (errors) in
copying.
 The fitness of an organism is how
much it can reproduce before it dies.
 Here is an example of the passing of
chromosomes within human
reproduction.
Natural Selection
Darwin's theory of evolution:
 Only the organisms best adapted to their
environment tend to survive and transmit their
genetic characteristics in increasing numbers to
succeeding generations while those less
adapted tend to be eliminated.
 A genetic algorithm maintains a population of
candidate solutions for the problem at hand,
and makes it evolve by iteratively applying a set
of stochastic operators
 The only intelligent systems on this planet
are biological.
 Biological intelligences are designed by
natural evolutionary processes.
 These intelligent organisms often work
together in groups, swarms, or flocks.
 They don't appear to use logic, mathematics,
complex planning, complicated modeling of
their environment.
 They can achieve complex information
processing and computational tasks that
current artificial intelligences find very
challenging indeed.
 Biological organisms cope with the demands
of their environments.
 They uses solutions quite unlike the
traditional human- engineered approaches
to problem solving.
 They exchange information about what
they’ve discovered in the places they have
visited.
 Bio-inspired computing is a field devoted to
tackling complex problems using
computational methods modeled after
design principles encountered in nature.
Classical computing’s strengths:
 Number-crunching
 Thought-support (glorified pen-and-paper)
 Rule-based reasoning
 Constant repetition of well-defined actions.
Classical computing’s weaknesses:
 Pattern recognition
 Robustness to damage
 Dealing with vague and incomplete
information;
 Adapting and improving based on
experience
 Bio-inspired computing takes a more
evolutionary approach to learning.
 In traditional AI, the intelligence is often
programmed from above. The
Programmer creates the program and
imbues it with its intelligence.
 Bio-inspired computing, on the other
hand, takes a more bottom-up,
decentralized approach.
 Bio-inspired computing often involve the
method of specifying a set of simple rules,
a set of simple organisms which adhere to
those rules.
DARPA - Legged Squat Support System (LS3)
Suppose you have a problem.
 …And you don’t know how to solve it.
 What can you do?
 Can you use a computer to somehow find a
solution?
 This would be nice! Can it be done?
Brute-Force Solution:
 A “blind generate and test” algorithm.
 Repeat
 Generate a random possible solution
 Test the solution and see how the
solution performed.
 Stop once the solution is good enough.
Can we use this Brute-Force idea?
 Sometimes - YES
 if there are only a few possible solutions
 and you have enough time
 then such a method could be used
 For most problems - NO
 many possible solutions
 with no time to try them all
 Therefore, this method cannot be used Key Point: The total number of solutions for 25 data
points is 310,224,200,866,619,719,680,000.
Search
Techniques
Calculus Based
Techniques
Fibonacci
Sort
Guided Random
Search
Techniques
Tabu Search Hill Climbing
Simulated
Annealing
Evolutionary
Algorithms
Genetic
Algorithms
Genetic
Programming
Enumerative
Techniques
DFS
Dynamic
Programming
BFS
Initialize Population
Evaluate Fitness
Satisfy
Constraints
?
Select Survivors
Output Results
Randomly Vary
Individuals
Yes
No
How do you encode a solution?
 This depends on the problem we are trying to
solve with genetic algorithms.
 Genetic algorithm’s often encode solutions as
fixed length “bitstrings” (e.g. 101110, 111111,
000101) which can also be thought of as
chromosomes.
 Each bit represents some aspect of the
proposed solution to the problem.
 For Genetic Algorithm’s to work, we need to be
able to “test” any string and get a “score”
indicating how “good” that solution is.
 The set of all possible solutions [0 to 1000]
is called the search space or state space.
 In this example, it’s just one number but it
could be many numbers.
 Often genetic algorithms code numbers
in binary producing a bitstring
representing a solution.
 We choose 1,0 bits which is enough to
represent 0 to 1000
Search Space
 For a simple function f(x) the search space is
one dimensional.
 But by encoding several values into the
chromosome many dimensions can be
searched e.g. two dimensions f(x,y).
 The search space can be visualized as a
surface or fitness landscape in which fitness
dictates height.
 Each possible genotype is a point in the
space.
 A genetic algorithm tries to move the points
to better places (higher fitness) in the space.
 Various fitness landscapes
Implicit fitness functions
 Most GA’s use explicit and static fitness
function.
 Some genetic algorithm’s (such as in Artificial
Life or Evolutionary Robotics) use dynamic
and implicit fitness functions - like “how
many obstacles did I avoid”.
𝐼𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙′
𝑠 𝑓𝑖𝑡𝑛𝑒𝑠𝑠
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
Selecting Parents for breeding
 Many schemes are possible so long as better
scoring chromosomes are more likely selected.
 Score is often termed the “fitness”
“Roulette Wheel” selection can be used:
 Add up the fitness's of all chromosomes
 Generate a random number R in that range
 Select the first chromosome in the population
that - when all previous fitness’s are added -
gives you, at a minimum, the value R
 The crossover point is a single point
identified at random between two
chromosomes.
 A crossover rate is pre determined (using a
high probability number like 0.8 to 0.95)
and then the crossover is applied to the
parents.
 The idea is that crossover preserves “good
bits” from different parents, combining
them to produce better solutions.
 A good encoding scheme would therefore
try to preserve “good bits” during crossover
and mutation.
Crossover single
point - random
 With some small probability (the
mutation rate), we flip each bit in the
offspring.
 Typical values for the mutation rate are
very small and are usually values
between 0.1 and 0.001.
 Causes movement in the search space
(local or global).
 Restores lost information to the
population.
Mutate
There are many variants of GA
 Different kinds of selection (not roulette)
 Tournament
 Elitism, etc.
 Different recombination procedures
 Multi-point crossover
 3 way crossover, etc.
 Different kinds of encoding other than
bitstring.
 Integer values
 Ordered set of symbols
 Different kinds of mutation
GA implementation considerations:
 Representation
 population size, mutation rate, ...
 selection, deletion policies
 crossover, mutation operators
 Termination Criteria
 Performance, scalability
 The Solution is only as good as the evaluation function (often hardest part)
Lets go through a simple example to showcase the steps.
Basic Genetic Algorithm Process:
 Produce an initial population of individuals
 Evaluate the fitness of all individuals
 while termination condition not met do
 select fitter individuals for reproduction
 recombine between individuals
 mutate individuals
 evaluate the fitness of the modified individuals
 generate a new population
 End while
The MAXONE problem:
 Suppose we want to maximize the number of ones
in a string of L binary digits.
 It may seem trivial because we know the answer in
advance.
 However, we can think of it as maximizing the
number of correct answers, each encoded by 1, to L
yes/no difficult questions.
Encoding
 An individual is encoded (naturally) as a string of L
binary digits
 Let’s say L = 10. Then, 1 = 0000000001 (10 bits)
 We start with a population of n random strings.
Suppose that L = 10 and n = 6
 We toss a fair coin 60 times and get the following
initial population:
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
 The fitness function: f (x)
 We toss a fair coin 60 times and get the following
initial population:
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
 Next we apply fitness proportionate selection with
the roulette wheel method:
 We repeat the extraction as many times as the
number of individuals.
 We need to have the same parent population size
(6 in our case).
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
Area is Proportional
to Fitness Value
Individual i will have a
probability to be chosen
𝑝 =
𝑓(𝑖)
𝑖 𝑓(𝑖)
 Suppose that, after performing selection, we get
the following population:
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
 For each couple we decide according
to crossover probability (for instance
0.6) whether to actually perform
crossover or not.
 Suppose that we decide to actually
perform crossover only for couples
(s1`, s2`) and (s5`, s6`).
 For each couple, we randomly extract
a crossover point, for instance 2 for
the first and 5 for the second .
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
 The final step is to apply random
mutation:
 For each bit that we are to copy to the
new population we allow a small
probability of error (for instance 0.1)
 Causes movement in the search space
(local or global)
 Restores lost information to the
population
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
 After applying mutation:
 In one generation, the total population fitness changed from 34 to 37, thus improved
by ~9%
 At this point, we go through the same process all over again, until a stopping criterion
is met
Produce an initial population of individuals
Evaluate the fitness of all individuals
while termination condition not met do
select fitter individuals for reproduction
recombine between individuals
mutate individuals
evaluate the fitness of the modified individuals
generate a new population
End while
Advantages:
 Concepts are easy to understand
 Inspired from nature
 Has many areas of applications
 GA is powerful
 Genetic Algorithms are intrinsically parallel
 Always an answer; answer gets better with
time.
 Inherently parallel; easily distributed.
 Less time required for some special
applications.
 Chances of getting the optimal solution are
greater.
Limitations
 The population considered for the evolution
should be moderate or suitable one for the
problem (normally 20-30 or 50- 100)
 Crossover rate should be 80%-95%
 Mutation rate should be low i.e. 0.5%-1%
assumed as best
 The method of selection should be
appropriate for the problem.
 Writing of fitness function must be accurate.
 The knapsack problem can be thought of as the
following:
 “You are going to spend a month in the
wilderness. You’re taking a backpack with you,
however, the maximum weight it can carry is 20
kilograms. You have a number of survival items
available, each with its own number of “survival
points”. You’re objective is to maximize the
number of survival points.”
 This type of problem is called a constrained
optimization problem because we are limited to
carrying a maximum weight of 20 kilograms.
 Our goal is to devise:
 A genetic algorithm to approach solving the
knapsack problem.
 In order to better prepare the analysis, we must first understand the data we are working with.
 There are 7 row entries in our chromosome and a 0 will represent “do not include” and 1
represents “include”.
 Therefore, an example of a chromosome would be “1001011”
Each row will be encoded into a
chromosome bitstring. A single row
represents a single value in the
bitstring.
 Here is the code to create the dataset:
 This will show us how to create a chromosome to include the first, fourth, and fifth item.
 The genalg algorithm tries to optimize towards the minimum value. Therefore, the
value is calculated as above and multiplied with -1.
 A configuration which leads to exceeding the weight constraint returns a value of 0 (a
higher value can also be given).
 Next, we choose the number of iterations, design and run the model.
 Notice that within the settings of genalg, the type is a binary chromosome by default:
Optimal Knapsack Configuration
 When we approach machine learning tasks,
we are often confronted with many variables
within a particular dataset which can be used
to build our predictive models.
 In previous lectures we discussed employing
some techniques that draw from AIC and
BIC to reduce the candidate variable pool.
These techniques include the forward,
backward, and stepwise techniques.
 Now lets explore using genetic algorithms
for this feature selection.
 Our goal is to devise:
 A genetic algorithm to approach feature
selection.
 We will also estimate the coefficients for
an OLS regression model using Genetic
Algorithms
 Here is the “fat” dataset that we are working with from the package UsingR.
 Our goal is to create a linear regression model using “body.fat.siri” as the dependent variable and
some ideal subset of variables as the independent variables.
 Lets first create a linear regression model
 Note: The adjusted R squared is 0.7353 and there are a number of statistically insignificant
variables at the 0.05 level.
 We then develop some additional R code to prepare the model to run using the following:
 This fitness function estimates the regression model using predictors identified by a 1 in the
corresponding position of the string.
 Each column (variable) is converted into a binary string with 1 = include.
 Now we run the genetic algorithm.
 This code will take the results of the genetic algorithm and pass them as a parameter:
 The Adjusted R squared is slightly better at 0.7382.
 Now imagine combining genetic algorithm’s with artificial neural networks…
 Here is an example of an interesting approach to using a genetic algorithm to approximate the β
coefficients. Lets use the following airquality dataset and the GA package.
 We are going to build a linear regression model with ozone as the dependent variable and wind
& temp as the independent variables.
 This is our fitness function for our genetic algorithm.
 This function evaluates a linear function with an intercept and the two independent variables to
compute the predicted y_hat.
 Then the algorithm computes and returns the SSE for each chromosome and we will try to
minimize the SSE like OLS.
 Here is the genetic algorithm for this model:
 Notice that the values are “real-based” and we specify
the minimum (-100) and maximum (+100) values for
each of the coefficients: β0, β1, β2
 The parameters use real numbers (so floating
decimals) and passes those to the linear regression
equation/function.
 The SSE is 50990.17 with the GA compared to
50988.96 with the OLS approach.
 One of the major topics in operations research is
what is known as the “travelling salesman”.
problem.
 The problem can be stated as follows:
 There is a set of cities a salesman needs to visit
and each city must be visited once.
 What’s the shortest way through all the cities?
 This problem has real world implications (but not
limited too) related to cost reductions, increase in
efficiencies for deliveries, & customer satisfaction.
 Our goal is to devise:
 A genetic algorithm to approach solving this
problem.
 In order to better prepare the analysis, we must first understand the data we are working with.
 The classic representations of this problem do not show the data munging reshaping aspects
which I feel are important to solve this problem in the real-world context. Therefore, we will spend
some time with this aspect in this tutorial.
 Before we begin the example, lets say we
have 5 salesmen and we want to send
them to specific addresses.
 This is a prime opportunity to utilize
another machine learning technique
discussed earlier - clustering.
 For this example, we will use a k-means
cluster approach with k = 5.
 The results of our clustering when
plotted are shown on the right hand
side.
K-Means Cluster
 Lets focus our analysis on a single cluster (k=3)
 We have to transform the latitude and longitude coordinates between each of
the points into a distance matrix before we can continue.
 This is the resulting distance matrix, shown in meters.
 The fitness function to be maximized can be defined as the reciprocal of the tour
length.
 We then run the genetic algorithm in the GA package specifying the type =
“permutation”.
 Here is the output of the algorithm:
 The solution that corresponds to the unique path can be shown by:
Genetic Algorithm Evolution Chart
1st Route Final route optimized by
the Genetic Algorithm
Here is the optimal solution depicted on the a google map.
The solution passed the results from R through the google api via the library(ggmap).
 Reside in Wayne, Illinois
 Active Semi-Professional Classical Musician
(Bassoon).
 Married my wife on 10/10/10 and been
together for 10 years.
 Pet Yorkshire Terrier / Toy Poodle named
Brunzie.
 Pet Maine Coons’ named Maximus Power and
Nemesis Gul du Cat.
 Enjoy Cooking, Hiking, Cycling, Kayaking, and
Astronomy.
 Self proclaimed Data Nerd and Technology
Lover.
 http://www.genetic-programming.org/
 http://www.slideshare.net/AMedOs/introduction-to-genetic-algorithms-
26956618?related=3
 http://www.slideshare.net/deg511/genetic-algorithms-in-search-optimization-and-
machine-learning?related=3
 http://www.slideshare.net/kancho/genetic-algorithm-by-example?related=4
 http://www.slideshare.net/karthiksankar/genetic-algorithms-3626322?related=6
 http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/
 http://stats.seandolinar.com/genetic-algorithm-to-minimize-ols-regression-in-r/
Data Science - Part XIV - Genetic Algorithms

Data Science - Part XIV - Genetic Algorithms

  • 1.
  • 2.
     What isa Genetic Algorithm?  Biological Inspiration  Evolution  Algorithm Mechanics  Practical Application Example  Knapsack Problem  Feature Selection  Traveling Salesman Charles Darwin – “It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change.”
  • 3.
     A geneticalgorithm is an adaptation procedure based on the mechanics of natural genetics and natural selection.  Genetic Algorithm’s have 2 essential components:  “Survival of the fittest”  Genetic Diversity  Originally developed by John Holland (1975).  The genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution.  Uses concepts of “Natural Selection” and “Genetic Inheritance” (Darwin 1859).
  • 4.
    Applications of GeneticAlgorithms  Optimization and Search Problems  Scheduling and Timetabling  Aerospace engineering  Astronomy and astrophysics  Chemistry  Electrical engineering  Financial markets  Game playing  Materials engineering  Military and law enforcement  Molecular biology  Pattern recognition and data mining  Robotics
  • 5.
     Nature isbeautiful… The Aye-Aye
  • 6.
     What wecan learn from nature?
  • 7.
     To understandbiological processes properly, we must first have an understanding of the cell.  Human bodies are made up of trillions of cells.  Each cell has a core structure (nucleus) that contains your chromosomes.  Additionally, each of our 23 chromosomes are made up of tightly coiled strands of deoxyribonucleic acid (DNA).
  • 8.
     Genes aresegments of DNA that determine specific traits, such as eye or hair color.  Humans have more than 20,000 genes. Each gene determines some aspect of the organism.  A collection of genes is sometimes called a genotype.  A collection of aspects (like eye characteristics) is sometimes called a phenotype.
  • 9.
     A genemutation is an alteration in your DNA.  It can be inherited or acquired during your lifetime, as cells age or are exposed to certain chemicals.  Mutations can also be triggered through errors within the DNA replication process.  Some changes in your genes result in genetic disorders. Joseph Merrick aka “The Elephant Man” is believed to have suffered from a genetic disorder called proteus syndrome.
  • 10.
     Reproduction involvesrecombination of genes from parents and then small amounts of mutation (errors) in copying.  The fitness of an organism is how much it can reproduce before it dies.  Here is an example of the passing of chromosomes within human reproduction.
  • 11.
    Natural Selection Darwin's theoryof evolution:  Only the organisms best adapted to their environment tend to survive and transmit their genetic characteristics in increasing numbers to succeeding generations while those less adapted tend to be eliminated.  A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it evolve by iteratively applying a set of stochastic operators
  • 12.
     The onlyintelligent systems on this planet are biological.  Biological intelligences are designed by natural evolutionary processes.  These intelligent organisms often work together in groups, swarms, or flocks.  They don't appear to use logic, mathematics, complex planning, complicated modeling of their environment.  They can achieve complex information processing and computational tasks that current artificial intelligences find very challenging indeed.
  • 13.
     Biological organismscope with the demands of their environments.  They uses solutions quite unlike the traditional human- engineered approaches to problem solving.  They exchange information about what they’ve discovered in the places they have visited.  Bio-inspired computing is a field devoted to tackling complex problems using computational methods modeled after design principles encountered in nature.
  • 14.
    Classical computing’s strengths: Number-crunching  Thought-support (glorified pen-and-paper)  Rule-based reasoning  Constant repetition of well-defined actions. Classical computing’s weaknesses:  Pattern recognition  Robustness to damage  Dealing with vague and incomplete information;  Adapting and improving based on experience
  • 15.
     Bio-inspired computingtakes a more evolutionary approach to learning.  In traditional AI, the intelligence is often programmed from above. The Programmer creates the program and imbues it with its intelligence.  Bio-inspired computing, on the other hand, takes a more bottom-up, decentralized approach.  Bio-inspired computing often involve the method of specifying a set of simple rules, a set of simple organisms which adhere to those rules. DARPA - Legged Squat Support System (LS3)
  • 16.
    Suppose you havea problem.  …And you don’t know how to solve it.  What can you do?  Can you use a computer to somehow find a solution?  This would be nice! Can it be done?
  • 17.
    Brute-Force Solution:  A“blind generate and test” algorithm.  Repeat  Generate a random possible solution  Test the solution and see how the solution performed.  Stop once the solution is good enough.
  • 18.
    Can we usethis Brute-Force idea?  Sometimes - YES  if there are only a few possible solutions  and you have enough time  then such a method could be used  For most problems - NO  many possible solutions  with no time to try them all  Therefore, this method cannot be used Key Point: The total number of solutions for 25 data points is 310,224,200,866,619,719,680,000.
  • 19.
    Search Techniques Calculus Based Techniques Fibonacci Sort Guided Random Search Techniques TabuSearch Hill Climbing Simulated Annealing Evolutionary Algorithms Genetic Algorithms Genetic Programming Enumerative Techniques DFS Dynamic Programming BFS
  • 20.
    Initialize Population Evaluate Fitness Satisfy Constraints ? SelectSurvivors Output Results Randomly Vary Individuals Yes No
  • 21.
    How do youencode a solution?  This depends on the problem we are trying to solve with genetic algorithms.  Genetic algorithm’s often encode solutions as fixed length “bitstrings” (e.g. 101110, 111111, 000101) which can also be thought of as chromosomes.  Each bit represents some aspect of the proposed solution to the problem.  For Genetic Algorithm’s to work, we need to be able to “test” any string and get a “score” indicating how “good” that solution is.
  • 22.
     The setof all possible solutions [0 to 1000] is called the search space or state space.  In this example, it’s just one number but it could be many numbers.  Often genetic algorithms code numbers in binary producing a bitstring representing a solution.  We choose 1,0 bits which is enough to represent 0 to 1000
  • 23.
    Search Space  Fora simple function f(x) the search space is one dimensional.  But by encoding several values into the chromosome many dimensions can be searched e.g. two dimensions f(x,y).  The search space can be visualized as a surface or fitness landscape in which fitness dictates height.  Each possible genotype is a point in the space.  A genetic algorithm tries to move the points to better places (higher fitness) in the space.
  • 24.
  • 25.
    Implicit fitness functions Most GA’s use explicit and static fitness function.  Some genetic algorithm’s (such as in Artificial Life or Evolutionary Robotics) use dynamic and implicit fitness functions - like “how many obstacles did I avoid”. 𝐼𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙′ 𝑠 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
  • 26.
    Selecting Parents forbreeding  Many schemes are possible so long as better scoring chromosomes are more likely selected.  Score is often termed the “fitness” “Roulette Wheel” selection can be used:  Add up the fitness's of all chromosomes  Generate a random number R in that range  Select the first chromosome in the population that - when all previous fitness’s are added - gives you, at a minimum, the value R
  • 27.
     The crossoverpoint is a single point identified at random between two chromosomes.  A crossover rate is pre determined (using a high probability number like 0.8 to 0.95) and then the crossover is applied to the parents.  The idea is that crossover preserves “good bits” from different parents, combining them to produce better solutions.  A good encoding scheme would therefore try to preserve “good bits” during crossover and mutation. Crossover single point - random
  • 28.
     With somesmall probability (the mutation rate), we flip each bit in the offspring.  Typical values for the mutation rate are very small and are usually values between 0.1 and 0.001.  Causes movement in the search space (local or global).  Restores lost information to the population. Mutate
  • 29.
    There are manyvariants of GA  Different kinds of selection (not roulette)  Tournament  Elitism, etc.  Different recombination procedures  Multi-point crossover  3 way crossover, etc.  Different kinds of encoding other than bitstring.  Integer values  Ordered set of symbols  Different kinds of mutation
  • 30.
    GA implementation considerations: Representation  population size, mutation rate, ...  selection, deletion policies  crossover, mutation operators  Termination Criteria  Performance, scalability  The Solution is only as good as the evaluation function (often hardest part)
  • 31.
    Lets go througha simple example to showcase the steps. Basic Genetic Algorithm Process:  Produce an initial population of individuals  Evaluate the fitness of all individuals  while termination condition not met do  select fitter individuals for reproduction  recombine between individuals  mutate individuals  evaluate the fitness of the modified individuals  generate a new population  End while
  • 32.
    The MAXONE problem: Suppose we want to maximize the number of ones in a string of L binary digits.  It may seem trivial because we know the answer in advance.  However, we can think of it as maximizing the number of correct answers, each encoded by 1, to L yes/no difficult questions. Encoding  An individual is encoded (naturally) as a string of L binary digits  Let’s say L = 10. Then, 1 = 0000000001 (10 bits)
  • 33.
     We startwith a population of n random strings. Suppose that L = 10 and n = 6  We toss a fair coin 60 times and get the following initial population: Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
  • 34.
     The fitnessfunction: f (x)  We toss a fair coin 60 times and get the following initial population: Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
  • 35.
     Next weapply fitness proportionate selection with the roulette wheel method:  We repeat the extraction as many times as the number of individuals.  We need to have the same parent population size (6 in our case). Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while Area is Proportional to Fitness Value Individual i will have a probability to be chosen 𝑝 = 𝑓(𝑖) 𝑖 𝑓(𝑖)
  • 36.
     Suppose that,after performing selection, we get the following population: Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
  • 37.
     For eachcouple we decide according to crossover probability (for instance 0.6) whether to actually perform crossover or not.  Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`).  For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second . Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
  • 38.
     The finalstep is to apply random mutation:  For each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)  Causes movement in the search space (local or global)  Restores lost information to the population Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
  • 39.
     After applyingmutation:  In one generation, the total population fitness changed from 34 to 37, thus improved by ~9%  At this point, we go through the same process all over again, until a stopping criterion is met Produce an initial population of individuals Evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
  • 40.
    Advantages:  Concepts areeasy to understand  Inspired from nature  Has many areas of applications  GA is powerful  Genetic Algorithms are intrinsically parallel  Always an answer; answer gets better with time.  Inherently parallel; easily distributed.  Less time required for some special applications.  Chances of getting the optimal solution are greater.
  • 41.
    Limitations  The populationconsidered for the evolution should be moderate or suitable one for the problem (normally 20-30 or 50- 100)  Crossover rate should be 80%-95%  Mutation rate should be low i.e. 0.5%-1% assumed as best  The method of selection should be appropriate for the problem.  Writing of fitness function must be accurate.
  • 43.
     The knapsackproblem can be thought of as the following:  “You are going to spend a month in the wilderness. You’re taking a backpack with you, however, the maximum weight it can carry is 20 kilograms. You have a number of survival items available, each with its own number of “survival points”. You’re objective is to maximize the number of survival points.”  This type of problem is called a constrained optimization problem because we are limited to carrying a maximum weight of 20 kilograms.  Our goal is to devise:  A genetic algorithm to approach solving the knapsack problem.
  • 44.
     In orderto better prepare the analysis, we must first understand the data we are working with.  There are 7 row entries in our chromosome and a 0 will represent “do not include” and 1 represents “include”.  Therefore, an example of a chromosome would be “1001011” Each row will be encoded into a chromosome bitstring. A single row represents a single value in the bitstring.
  • 45.
     Here isthe code to create the dataset:  This will show us how to create a chromosome to include the first, fourth, and fifth item.
  • 46.
     The genalgalgorithm tries to optimize towards the minimum value. Therefore, the value is calculated as above and multiplied with -1.  A configuration which leads to exceeding the weight constraint returns a value of 0 (a higher value can also be given).
  • 47.
     Next, wechoose the number of iterations, design and run the model.  Notice that within the settings of genalg, the type is a binary chromosome by default: Optimal Knapsack Configuration
  • 49.
     When weapproach machine learning tasks, we are often confronted with many variables within a particular dataset which can be used to build our predictive models.  In previous lectures we discussed employing some techniques that draw from AIC and BIC to reduce the candidate variable pool. These techniques include the forward, backward, and stepwise techniques.  Now lets explore using genetic algorithms for this feature selection.  Our goal is to devise:  A genetic algorithm to approach feature selection.  We will also estimate the coefficients for an OLS regression model using Genetic Algorithms
  • 50.
     Here isthe “fat” dataset that we are working with from the package UsingR.  Our goal is to create a linear regression model using “body.fat.siri” as the dependent variable and some ideal subset of variables as the independent variables.
  • 51.
     Lets firstcreate a linear regression model  Note: The adjusted R squared is 0.7353 and there are a number of statistically insignificant variables at the 0.05 level.
  • 52.
     We thendevelop some additional R code to prepare the model to run using the following:  This fitness function estimates the regression model using predictors identified by a 1 in the corresponding position of the string.  Each column (variable) is converted into a binary string with 1 = include.
  • 53.
     Now werun the genetic algorithm.
  • 54.
     This codewill take the results of the genetic algorithm and pass them as a parameter:  The Adjusted R squared is slightly better at 0.7382.  Now imagine combining genetic algorithm’s with artificial neural networks…
  • 55.
     Here isan example of an interesting approach to using a genetic algorithm to approximate the β coefficients. Lets use the following airquality dataset and the GA package.  We are going to build a linear regression model with ozone as the dependent variable and wind & temp as the independent variables.
  • 56.
     This isour fitness function for our genetic algorithm.  This function evaluates a linear function with an intercept and the two independent variables to compute the predicted y_hat.  Then the algorithm computes and returns the SSE for each chromosome and we will try to minimize the SSE like OLS.
  • 57.
     Here isthe genetic algorithm for this model:  Notice that the values are “real-based” and we specify the minimum (-100) and maximum (+100) values for each of the coefficients: β0, β1, β2  The parameters use real numbers (so floating decimals) and passes those to the linear regression equation/function.  The SSE is 50990.17 with the GA compared to 50988.96 with the OLS approach.
  • 59.
     One ofthe major topics in operations research is what is known as the “travelling salesman”. problem.  The problem can be stated as follows:  There is a set of cities a salesman needs to visit and each city must be visited once.  What’s the shortest way through all the cities?  This problem has real world implications (but not limited too) related to cost reductions, increase in efficiencies for deliveries, & customer satisfaction.  Our goal is to devise:  A genetic algorithm to approach solving this problem.
  • 60.
     In orderto better prepare the analysis, we must first understand the data we are working with.  The classic representations of this problem do not show the data munging reshaping aspects which I feel are important to solve this problem in the real-world context. Therefore, we will spend some time with this aspect in this tutorial.
  • 61.
     Before webegin the example, lets say we have 5 salesmen and we want to send them to specific addresses.  This is a prime opportunity to utilize another machine learning technique discussed earlier - clustering.  For this example, we will use a k-means cluster approach with k = 5.  The results of our clustering when plotted are shown on the right hand side. K-Means Cluster
  • 62.
     Lets focusour analysis on a single cluster (k=3)
  • 63.
     We haveto transform the latitude and longitude coordinates between each of the points into a distance matrix before we can continue.
  • 64.
     This isthe resulting distance matrix, shown in meters.
  • 65.
     The fitnessfunction to be maximized can be defined as the reciprocal of the tour length.  We then run the genetic algorithm in the GA package specifying the type = “permutation”.
  • 66.
     Here isthe output of the algorithm:  The solution that corresponds to the unique path can be shown by: Genetic Algorithm Evolution Chart
  • 67.
    1st Route Finalroute optimized by the Genetic Algorithm
  • 68.
    Here is theoptimal solution depicted on the a google map. The solution passed the results from R through the google api via the library(ggmap).
  • 69.
     Reside inWayne, Illinois  Active Semi-Professional Classical Musician (Bassoon).  Married my wife on 10/10/10 and been together for 10 years.  Pet Yorkshire Terrier / Toy Poodle named Brunzie.  Pet Maine Coons’ named Maximus Power and Nemesis Gul du Cat.  Enjoy Cooking, Hiking, Cycling, Kayaking, and Astronomy.  Self proclaimed Data Nerd and Technology Lover.
  • 70.
     http://www.genetic-programming.org/  http://www.slideshare.net/AMedOs/introduction-to-genetic-algorithms- 26956618?related=3 http://www.slideshare.net/deg511/genetic-algorithms-in-search-optimization-and- machine-learning?related=3  http://www.slideshare.net/kancho/genetic-algorithm-by-example?related=4  http://www.slideshare.net/karthiksankar/genetic-algorithms-3626322?related=6  http://www.r-bloggers.com/genetic-algorithms-a-simple-r-example/  http://stats.seandolinar.com/genetic-algorithm-to-minimize-ols-regression-in-r/