Genetic Algorithms: A Quick Tutorial
Genetic algorithms are algorithms that model natural selection - sort of - to help solve search and optimization problems. They were originally mooted by John Holland at the University of Michigan back in the 70s, and arose out of earlier studies of cellular automata (in the in-silico sense).In a nutshell, the idea is that we start off with a "gene pool" of randomly generated solutions (note that "solution" in this context doesn't mean the right answer, just an answer) to a particular problem. As they were randomly generated, these solutions will no doubt be fairly crap. We assess their crapness with something called a fitness function, which takes a solution as input and outputs a score, where the higher the score the better the solution. After assessing all of the solutions we take the ones with the highest fitness scores and "breed" them to produce the next generation of the gene pool. Occasionally a solution is "mutated" and tweaked in some way at random. The whole process is then repeated. Over time the overall fitness of the gene pool starts to improve until eventually one of the solutions in the genepool has a high enough fitness to be a real, practical solution to whatever our problem is.
Why should anybody care when there are so many other search and optimization procedures available? Genetic algorithms are easy to implement, easy to understand and perfectly suited to certain types of problem, like feature selection for machine learning purposes. On a more personal note they're also the only algorithms I've ever had fun implementing - and how often can you say that (do say: not very, don't say: you need to get out more)?
Classically, genetic algorithms (GAs for short, from now on) operate on fixed strings of bits - 1s and 0s - which represent the real parameters of the problem in some way (in feature selection, for example, they would represent the presence or absence of a particular feature). GAs are flexible beasts, though, and as such in this tutorial we'll try something different: operating on variable length strings representing sequence motifs.
Check out this messy Perl code (I mean it: it's hacky) which, when supplied with two Fasta files - a set of test sequences and a set of control sequences (the ones supplied are sets of DNaseI hypersensitive sequences and controls, as in the SVM tutorial)- looks for overrepresented motifs in the test set. I'll explain how it does this below (I've included the code for completeness and so that you can try out GAs for yourself quickly, but you'll be able to follow the rest of this post even if you don't take a look at it).
Our first step is to define what kind of data the GA will be operating on. In our case, we want to generate and test lots of different consensus motifs, which are strings of arbitrary length made up of an alphabet defined by the IUPAC : A, T, G and C, obviously, but also N to represent any base, R to represent a purine, Y a pyrimidine and so on.
alphabet array = ("A", "T", "G", "C", "N", "Y", "R", "S", "B", "D", "H", "V", "M", "K", "W");We then need to seed the genepool. To do this, we simply create a new array and loop from 1 to k times, where k is a user defined variable specifying how big the initial genepool should be. On each loop we push a randomly generated string made up of letters from the alphabet we defined above onto the genepool array.
The next step is to assess the fitness of the genepool. To do this we'll need to implement a fitness function. Remember that a fitness function takes a solution as input and returns a score: the higher the score, the better the solution meets our needs. In our case, the fitness function will be taking a consensus motif as input and we need to return how overrepresented (or not) that motif is in the test set, compared to the controls.
function fitness(motif) {
testset_count = count(motif, testset sequences array)
control_count = count(motif, control sequences array)
# in what percentage of each set of sequences is the motif found?
testset_percentage = ( testset_count / number of testset sequences ) * 100
control_percentage = ( control_count / number of control sequences ) * 100
return testset_percentage - control_percentage
}
function count (motif, sequences array) {
count = 0
foreach sequence in sequences array {
if motif in sequence then count++
}
return count
}
In the psudeocode above, the fitness function calculates the percentage of test set sequences and the percentage of control sequences that contain the motif supplied to the function as input. It then returns the difference between them such that a motif found in all of the test set sequences but none of the controls will be scored 100 and a motif found in all of the controls but none of the test set will be scored -100.
When the genepool contains a motif with a high enough fitness, the algorithm will end. As endpoints depend on the fitness function that we are using we need to define how high "high enough" is. This will depend on the data, but in our case let's set it to 75 so that we're looking for motifs that appear in at least 75% of the test set sequences and no more than 25% of the control set (i.e. the difference between the test set percentage and the control set percentage should be at least 75)
If the genepool does contain any motifs with a fitness of 75 or more, we should finish the algorithm and print out that motif. Otherwise we move on to the next step, which is sorting the motifs in the genepool in order of fitness. This is simply to make it easier for us to pick which motifs we are going to keep for the next generation of solutions and which motifs are going to die off (it's a cruel world, but survival of the fittest rules in GA).
There are lots of different ways of picking solutions to carry on to the next generation. In general it's a bad idea simply to skim the cream from the sorted genepool array - that is, all of the top scoring motifs and none of the motifs that scored poorly. This is because keeping in some of the motifs that aren't particularly good solutions keeps the gene pool diverse and prevents premature convergence (the point where a GA can't get any better because of the genepool it has to work with) on a poor solution. One method of choosing the next generation of the genepool is to pick out a certain number of motifs with probabilities commensurate with their rank in the sorted array: so the top ranking motif has a much higher chance of being picked than the bottom ranking motif.
We remove all of the motifs not chosen to make up the next generation from the genepool array. To make up for their loss we need to "breed" new solutions by repeatedly picking a pair of motifs in the genepool at random and then producing a child motif by "crossover".
function make_child(parent_motif_a, parent_motif_b) {
half_a = random half of parent_motif_a
half_b = random half of parent_motif_b
return concat(half_a, half_b)
}
Our breeding function picks one half of each parent sequence at random, combines them and returns the result - a child motif. It's not necessary for the two parent motifs to be of the same length or fitness or anything else.
Once we've produced a certain number of child solutions, we determine whether or not any of the solutions in the genepool will mutate at random. The chance of any individual solution mutating in any one generation should be small - 2%, say. There's a balance to be struck here between introducing diversity into the genepool - giving the algorithm more to work with - and keeping the genepool static enough for a solution to evolve (as in real life, evolution works over many generations).
If a motif is to be mutated then we should change it somehow. The simplest change is to pick a letter of the sequence at random and to change it (in classical GAs it would be to pick a bit at random and to flip it). This is roughly analogous to a SNP.
That's not the only possibility, though. We could also perform operations analogous to:
- point insertions : insert a single new letter somewhere in the motif
- point deletions : delete a single letter picked at random from the motif
- inversions : select a substring of the motif at random and invert it (so GAC becomes CAG)
- gross deletions : select a substring of the motif at random and delete it
- duplications : select a substring of the motif at random and duplicate it (so GAC becomes GACGAC)
There's always the possibility that we'll never reach our endpoint, for whatever reason (the GA reaches a local maxima - that is, it keeps picking solutions that look good in the short term but can't be mutated in the long term to reach a high enough fitness - or perhaps our endpoint is just too high). We should therefore put a cap on the number of generations to run for. Again, this is fairly arbitrary as it depends on the nature of the data that we're working with.
What to do with any consensus motifs found by GAs like this one? Well, for one thing you could collect several of them and then use their presence / absence as features for an SVM classifier capable of discriminating between sequences similar to your test set and controls. Or perhaps you just want to find a single consensus motif for a particular binding site, or you want to find overrepresented sequences to assess their biological significance.
That's all there is to it. The interesting thing about genetic algorithms is that the real work isn't in implementing the algorithm per se but in designing the input format, fitness function and in picking the endpoint. The fitness function is particularly important: the suitability of any results that a GA might produce is inextricably linked to the suitability of the fitness function and it gets called once for every solution every generation - so it needs to be fairly quick to calculate.
Hope you find the above useful. Check out the code for more. Feel free to email with any comments, suggestions, questions or changes.
Anonymous
. This post has trackbacks.
