Hunting for disease genes
Up until now the traditional method - at least for relatively simple, single gene disorders like Huntington's or cystic fibrosis - has been positional cloning. The idea is to use successive rounds of linkage analysis to pinpoint the region in which the gene responsible lies, to assess the potential functions of all the genes in that region and finally to screen each possible candidate gene for relevant mutations.
Check out the graph to the left, which is from a 2002 paper in Science by Glazier et al.. Well over one and a half thousand Mendelian disease genes mapped successfully, with the rate growing almost exponentially - the positional cloning approach seems to be working well. With simple (genetically speaking) disorders linkage analysis can pinpoint relatively small regions - a megabase or so. It's then pretty straightforward to examine each gene, promoter or other interesting looking area of the region in turn.The problem comes when you try to apply the same strategy to more complex disorders. Diabetes, schizophrenia, autism, arthritis... these are all diseases whose genetic component involves more than one gene. As you can tell from the blue line on the graph, progress identifying the genes involved in complex disease has been much, much slower.
This is because in complex traits the link between phenotype (the disease) and genotype (causative mutation) at any one point in the genome is usually weak, so linkage analysis ends up "pinpointing" several areas of the genome that are each tens or hundreds of centiMorgans long. These regions of interest can contain hundreds of genes and cloning and then investigating each one in the lab would be a bitch.
Instead investigators rely on the candidate gene approach, which basically means "examine the good candidates first" (that's it - it's not a formal methodology, just common sense with a puffed-up name). Of course, this still involves a lot of literature searching, expression profiling and so on to work out which genes are good candidates in the first place.
This is where bioinformatics can help: software can do the donkey work involved in deciding which genes in a large region of interest to prioritize for further study. A number of different groups are already active in the field. Typically, their software packages look at the functional annotation of the genes in question.
This is because there's a longstanding assumption in the field of candidate prioritization that genes involved in the same complex disease will have similar functions, or be involved in the same or similar biological pathways. This is intuitively appealing, though I've yet to see anybody try to back it up with hard evidence. At this point there are so many groups whose systems rely on the premise that there's a bit of an elephant in the middle of the room thing going on.
POCUS (Turner et al.), for example, takes in a list of regions of interest and examines the Gene Ontology (GO) terms and Interpro domains assigned to the genes within those regions. If two genes on different regions have similar functions - to a statistically significant extent - they are flagged up as possible disease gene candidates.
SUSPECTS (Adie et al.) works on similar lines, though it relies on being given a "training set" of genes that the researcher suspects may be involved in disease etiology, in an attempt to bring the human element back into the equation.
There are other approaches - filtering out unsuitable candidates by looking at their expression profiles (Tiffin et al.), for example. More recently, some groups have examined protein interaction networks, assuming that genes involved in the same disease will be clustered together (Franke el al.)
There are, of course, problems with all of these approaches.
Looking at the percentage of genes in the Ensembl database that have GO annotation associated with them one might think that most of the human genome has been functionally characterized. This, obviously, is untrue. Much of the annotation has been added by algorithms working at a relatively coarse level. Unfortunately, to find statistically significant matches in GO annotation you need genes to have "high level" (and thus rare) terms assigned to them, so many genes can't be processed by software that works solely with Gene Ontology terms.
Obtaining normalized expression data can also be difficult - and that's before having to map, for example, Affymetrix IDs to refseqs and then to Ensembl gene or transcript IDs. There's also missing data to deal with and the small question of what kind of similarity measure to use.
Protein interaction networks are, unfortunately, notoriously incomplete. Furthermore, if derived from literature reports they are heavily biased towards known disease genes - simply because they are the genes investigated most frequently.
The future may lie in combining multiple lines of evidence somehow, as Aerts et al. have done recently. Last month a collaborative effort in NAR combined the results from several different pieces of software - each of which used a different approach - to produce putative candidates for obesity and type 2 diabetes, with promising results.
I think candidate prioritization is an exciting field and an area in which relatively simple bioinformatics tools can do some real good - saving labs time and money (and making life easier for those grad students who'd otherwise do the monkey work). It'll be interesting to see how things develop - unless HapMap renders all other methods of finding disease genes obsolete...
Lude Franke
jansenkoe
. This post has trackbacks.
