A genetic algorithm (GA)
is a search technique used in computer science to find approximate
solutions to optimization and search problems. Genetic algorithms are a
particular class of evolutionary algorithms that use techniques
inspired by evolutionary biology such as inheritance, mutation, natural
selection, and recombination (or crossover).
Genetic
algorithms are typically implemented as a computer simulation in which
a population of abstract representations (called chromosomes) of candidate solutions (called individuals)
to an optimization problem evolves toward better solutions.
Traditionally, solutions are represented in binary as strings of 0s and
1s, but different encodings are also possible. The evolution starts
from a population of completely random individuals and happens in
generations. In each generation, the fitness of the whole population is
evaluated, multiple individuals are stochastically selected from the
current population (based on their fitness), modified (mutated or
recombined) to form a new population, which becomes current in the next
iteration of the algorithm.
Operation of a GA
Two
elements are required for any problem before a genetic algorithm can be
used to search for a solution: First, there must be a method of representing
a solution in a manner that can be manipulated by the algorithm.
Traditionally, a solution can be represented by a string of bits,
numbers or characters. Second, there must be some method of measuring
the quality of any proposed solution, using a fitness function.
For instance, if the problem involves fitting as many different weights as possible into a knapsack without breaking it, a representation
of a solution might be a string of bits, where each bit represents a
different weight, and the value of the bit (0 or 1) represents whether
or not the weight is added to the knapsack. The fitness of
the solution would be measured by determining the total weight of the
proposed solution: The higher the weight, the greater the fitness,
provided that the solution is possible.
Initialization
Initially
many individual solutions are randomly generated to form an initial
population. The population size depends on the nature of the problem,
but typically contains several hundreds or thousands of possible
solutions. Traditionally, the population is generated randomly,
covering the entire range of possible solutions (the search space). Occasionally, the solutions may be "seeded" in areas where optimal solutions are likely to be found.
Selection
During
each successive epoch, a proportion of the existing population is
selected to breed a new generation. Individual solutions are selected
through a fitness-based process, where fitter solutions (as
measured by a fitness function) are typically more likely to be
selected. Certain selection methods rate the fitness of each solution
and preferentially select the best solutions. Other methods rate only a
random sample of the population, as this process may be very
time-consuming.
Most functions are
stochastic and designed so that a small proportion of less fit
solutions are selected. This helps keep the diversity of the population
large, preventing premature convergence on poor solutions. Popular and
well-studied selection methods include roulette wheel selection and
tournament selection.
Reproduction
The
next step is to generate a second generation population of solutions
from those selected through genetic operators: crossover (or
recombination), and mutation.
For each
new solution to be produced, a pair of "parent" solutions is selected
for breeding from the pool selected previously. By producing a "child"
solution using the above methods of crossover and mutation, a new
solution is created which typically shares many of the characteristics
of its "parents." New parents are selected for each child, and the
process continues until a new population of solutions of appropriate
size is generated.
These processes
ultimately result in the next generation population of chromosomes that
is different from the initial generation. Generally the average fitness
will have increased by this procedure for the population, since only
the best organisms from the first generation are selected for breeding,
alongwith a small proportion of less fit solutions, for reasons already
mentioned above.
Termination
This generational process is repeated until a termination condition has been reached. Common terminating conditions are
- A solution is found that satisfies minimum criteria
- Fixed number of generations reached
- Allocated budget (computation time/money) reached
- The
highest ranking solution's fitness is reaching or has reached a plateau
such that successive iterations no longer produce better results
- Manual inspection
- Combinations of the above
Pseudo-code algorithm
Choose
initial population Repeat Evaluate the individual fitnesses of a
certain proportion of the population Select pairs of best-ranking
individuals to reproduce Breed new generation through crossover and
mutation Until terminating condition
Observations
There are several general observations about the generation of solutions via a genetic algorithm:
- In
many problems with sufficient complexity, GAs may have a tendency to
converge towards local optima rather than the global optimum of the
problem. The likelihood of this occurring depends on the shape of the
fitness landscape: certain problems may provide an easy ascent towards
a global optimum, others may make it easier for the function to find
the local optima. This problem may be alleviated by using a different
fitness function, or by using techniques to maintain a diverse
population of solutions.
- Operating on dynamic data sets is
difficult, as genomes begin to converge early on towards solutions
which may no longer be valid for later data. Several methods have been
proposed to remedy this by increasing genetic diversity somehow and
preventing early convergence, either by increasing the probability of
mutation when the solution quality drops (called triggered hypermutation), or by occasionally introducing entirely new, randomly generated elements into the gene pool (called random immigrants). Recent research has also shown the benefits of using biological exaptation (or preadaptation) in solving this problem.
- Selection
is clearly an important genetic operator, but opinion is divided over
the importance of crossover versus mutation. Some argue that crossover
is the most important, while mutation is only necessary to ensure that
potential solutions are not lost. Others argue that crossover in a
largely uniform population only serves to propagate innovations
originally found by mutation, and in a non-uniform population crossover
is nearly always equivalent to a very large mutation (which is likely
to be catastrophic).
- Often, GAs can rapidly locate good solutions, even for difficult search spaces.
- For
specific optimization problems and problem instantiations, simpler
optimization algorithms may find better solutions than genetic
algorithms (given the same amount of computation time). GA
practitioners may wish to try other algorithms in addition to GAs.
- GAs cannot effectively solve
problems in which there is no way to judge the fitness of an answer
other than right/wrong, as there is no way to converge on the solution.
These problems are often called "needle in a haystack" problems.
- As with all current machine
learning problems it is worth tuning the parameters such as mutation
probability, recombination probability and population size to find
reasonable settings for the problem class you are working on. A very
small mutation rate may lead to genetic drift (which is non-ergodic in
nature) or premature convergence of the genetic algorithm in a local
optimum. A mutation rate that is too high may lead to loss of good
solutions. There are theoretical but not yet practical upper and lower
bounds for these parameters that can help guide selection.
- How the fitness function is evaluated is an important factor in speed and efficiency of the algorithm.
Variants
The
simplest algorithm represents each chromosome as a bit string.
Typically, numeric parameters can be represented by integers, though it
is possible to use floating point representations. The basic algorithm
performs crossover and mutation at the bit level. Other variants treat
the chromosome as a list of numbers which are indexes into an
instruction table, nodes in a linked list, hashes, objects, or any
other imaginable data structure. Crossover and mutation are performed
so as to respect data element boundaries. For most data types, specific
variation operators can be designed. Different chromosomal data types
seem to work better or worse for different specific problem domains.
When
bit strings representations of integers are used, Gray coding is often
employed. In this way, small changes in the integer can be readily
effected through mutations or crossovers. This has been found to help
prevent premature convergence at so called Hamming walls, in
which too many simultaneous mutations (or crossover events) must occur
in order to change the chromosome to a better solution.
Other
approaches involve using arrays of real-valued numbers instead of bit
strings to represent chromosomes. Theoretically, the smaller the
alphabet, the better the performance, but paradoxically, good results
have been obtained from using real-valued chromosomes.
A
slight, but very successful variant of the general process of
constructing a new population is to allow some of the better organisms
from the current generation to carry over to the next, unaltered. This
strategy is known as elitist selection.
Parallel
implementations of genetic algorithms come in two flavours. Coarse
grained parallel genetic algorithms assume a population on each of the
computer nodes and migration of individuals among the nodes. Fine
grained parallel genetic algorithms assume an individual on each
processor node which acts with neighboring individuals for selection
and reproduction. Other variants, like genetic algorithms for online
optimization problems, introduce time-dependence or noise in the
fitness function.
Problem domains
Problems
which appear to be particularly appropriate for solution by genetic
algorithms include timetabling and scheduling problems, and many
scheduling software packages are based on GAs. GAs have also been
applied to engineering. Genetic algorithms are often applied as an
approach to solve global optimization problems.
As
a general rule of thumb genetic algorithms might be useful in problem
domains that have a complex fitness landscape as recombination is
designed to move the population away from local minima that a
traditional hill climbing algorithm might get stuck in.
History
Genetic
algorithms originated from the studies of cellular automata, conducted
by John Holland and his colleagues at the University of Michigan.
Research in GAs remained largely theoretical until the mid-1980s, when
The First International Conference on Genetic Algorithms was held at
The University of Illinois. As academic interest grew, the dramatic
increase in desktop computational power allowed for practical
application of the new technique. In 1989, The New York Times writer
John Markoff wrote about Evolver, the first commercially available
desktop genetic algorithm. Custom computer applications began to emerge
in a wide variety of fields, and these algorithms are now used by a
majority of Fortune 500 companies to solve difficult scheduling, data
fitting, trend spotting and budgeting problems, and virtually any other
type of combinatorial optimization problem.
Applications
- Artificial Creativity
- Automated
design, including research on composite material design and
multi-objective design of automotive components for crashworthiness,
weight savings, and other characteristics.
- Automated design of mechatronic systems using bond graphs and genetic programming (NSF).
- Automated design of industrial equipment using catalogs of exemplar lever patterns.
- Calculation of Bound States and Local Density Approximations.
- Configuration
applications, particularly physics applications of optimal molecule
configurations for particular systems like C60 (buckyballs).
- Container loading optimization.
- Code-breaking, using the GA to search large solution spaces of ciphers for the one correct decryption.
- Design of water distribution systems.
- Distributed computer network topologies.
- Electronic circuit design, known as Evolvable hardware.
- File allocation for a distributed system.
- Parallelization
of GAs/GPs including use of hierarchical decomposition of problem
domains and design spaces nesting of irregular shapes using feature
matching and GAs.
- Game Theory Equilibrium Resolution.
- Learning Robot behavior using Genetic Algorithms.
- Learning fuzzy rule base using genetic algorithms.
- Linguistic
analysis, including Grammar Induction and other aspects of Natural
Language Processing (NLP) such as word sense disambiguation.
- Mobile communications infrastructure optimization.
- Molecular Structure Optimization (Chemistry).
- Multiple population topologies and interchange methodologies.
- Optimisation of data compression systems, for example using wavelets.
- Protein folding and protein/ligand docking.
- Plant floor layout.
- Scheduling
applications, including job-shop scheduling. The objective being to
schedule jobs in a sequence dependent or non-sequence dependent setup
environment in order to maximize the volume of production while
minimizing penalties such as tardiness.
- Software engineering
- Solving the machine-component grouping problem required for cellular manufacturing systems.
- Tactical asset allocation and international equity strategies.
- Timetabling problems, such as designing a non-conflicting class timetable for a large university.
- Training artificial neural networks when pre-classified training examples are not readily obtainable (neuroevolution).
- Traveling Salesman Problem.
Related techniques
Genetic
programming (GP) is a related technique popularized by John Koza, in
which computer programs, rather than function parameters, are
optimized. Genetic programming often uses tree-based internal data
structures to represent the computer programs for adaptation instead of
the list, or array, structures typical of genetic algorithms. GP
algorithms typically require running time that is orders of magnitude
greater than that for genetic algorithms, but they may be suitable for
problems that are intractable with genetic algorithms.
Interactive
genetic algorithms (IGA) are genetic algorithms that use human
evaluation. They are usually applied to domains where it is hard to
design a computational fitness function, for example, evolving images,
music, artistic designs and forms to fit users' aesthetic preference.
Simulated
annealing (SA) is a related global optimization technique which
traverses the search space by testing random mutations on an individual
solution. A mutation that increases fitness is always accepted. A
mutation which lowers fitness is accepted probabilistically based on
the difference in fitness and a decreasing temperature parameter. In SA
parlance, one speaks of seeking the lowest energy instead of the
maximum fitness. SA can also be used within a standard GA algorithm,
simply by starting with a relatively high rate of mutation, which
decreases over time along a given schedule.
Tabu
search (TS) is similar to Simulated Annealing, in that both traverse
the solution space by testing mutations of an individual solution.
While simulated annealing generates only one mutated solution, tabu
search generates many mutated solutions and moves to the solution with
the lowest fitness of those generated. In order to prevent cycling and
encourage greater movement through the solution space, a tabu list is
maintained of partial or complete solutions. It is forbidden to move to
a solution that contains elements of the tabu list, which is updated as
the solution traverses the solution space.
Ant
colony optimization (ACO) uses many ants (or agents) to traverse the
solution space and find locally productive areas. While usually
inferior to genetic algorithms and other forms of local search, it is
able to produce results in problems where no global or up-to-date
perspective can be obtained, and thus the other methods cannot be
applied.
Memetic algorithm (MA) is a
term that is used by some researchers to define forms of genetic
algorithm that are combined with other forms of local search, such as
simulated annealing.
Building block hypothesis
[Goldberg, 1989, page 41], talking of binary bit string genetic algorithms, says "Short, low order, and highly fit schemata are sampled, recombined (crossed over), and
resampled to form strings of potentially higher fitness. In a way, by
working with these particular schemata (the building blocks), we have
reduced the complexity of our problem; instead of building
high-performance strings by trying every conceivable combination, we
construct better and better strings from the best partial solutions of
past samplings." "Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood (building blocks), so
does a genetic algorithm seek near optimal performance through the
juxtaposition of short, low-order, high-performance schemata, or
building blocks. Note [Goldberg, 1989] suggests that building
blocks are highly fit schemata with only a few defined bits (low
order), and that these are close together (short).
|