Flags and Lollipops

Thursday, June 29, 2006

Hunting for disease genes

Imagine that you want to find the genetic basis for a particular human hereditary disease. You've got a large family (or several families) of affected individuals, some carefully screened unaffected individuals to act as controls, and a research lab. Where do you start?

Up until now the traditional method - at least for relatively simple, single gene disorders like Huntington's or cystic fibrosis - has been positional cloning. The idea is to use successive rounds of linkage analysis to pinpoint the region in which the gene responsible lies, to assess the potential functions of all the genes in that region and finally to screen each possible candidate gene for relevant mutations.

Check out the graph to the left, which is from a 2002 paper in Science by Glazier et al.. Well over one and a half thousand Mendelian disease genes mapped successfully, with the rate growing almost exponentially - the positional cloning approach seems to be working well. With simple (genetically speaking) disorders linkage analysis can pinpoint relatively small regions - a megabase or so. It's then pretty straightforward to examine each gene, promoter or other interesting looking area of the region in turn.

The problem comes when you try to apply the same strategy to more complex disorders. Diabetes, schizophrenia, autism, arthritis... these are all diseases whose genetic component involves more than one gene. As you can tell from the blue line on the graph, progress identifying the genes involved in complex disease has been much, much slower.

This is because in complex traits the link between phenotype (the disease) and genotype (causative mutation) at any one point in the genome is usually weak, so linkage analysis ends up "pinpointing" several areas of the genome that are each tens or hundreds of centiMorgans long. These regions of interest can contain hundreds of genes and cloning and then investigating each one in the lab would be a bitch.

Instead investigators rely on the candidate gene approach, which basically means "examine the good candidates first" (that's it - it's not a formal methodology, just common sense with a puffed-up name). Of course, this still involves a lot of literature searching, expression profiling and so on to work out which genes are good candidates in the first place.

This is where bioinformatics can help: software can do the donkey work involved in deciding which genes in a large region of interest to prioritize for further study. A number of different groups are already active in the field. Typically, their software packages look at the functional annotation of the genes in question.

This is because there's a longstanding assumption in the field of candidate prioritization that genes involved in the same complex disease will have similar functions, or be involved in the same or similar biological pathways. This is intuitively appealing, though I've yet to see anybody try to back it up with hard evidence. At this point there are so many groups whose systems rely on the premise that there's a bit of an elephant in the middle of the room thing going on.

POCUS (Turner et al.), for example, takes in a list of regions of interest and examines the Gene Ontology (GO) terms and Interpro domains assigned to the genes within those regions. If two genes on different regions have similar functions - to a statistically significant extent - they are flagged up as possible disease gene candidates.

SUSPECTS (Adie et al.) works on similar lines, though it relies on being given a "training set" of genes that the researcher suspects may be involved in disease etiology, in an attempt to bring the human element back into the equation.

There are other approaches - filtering out unsuitable candidates by looking at their expression profiles (Tiffin et al.), for example. More recently, some groups have examined protein interaction networks, assuming that genes involved in the same disease will be clustered together (Franke el al.)

There are, of course, problems with all of these approaches.

Looking at the percentage of genes in the Ensembl database that have GO annotation associated with them one might think that most of the human genome has been functionally characterized. This, obviously, is untrue. Much of the annotation has been added by algorithms working at a relatively coarse level. Unfortunately, to find statistically significant matches in GO annotation you need genes to have "high level" (and thus rare) terms assigned to them, so many genes can't be processed by software that works solely with Gene Ontology terms.

Obtaining normalized expression data can also be difficult - and that's before having to map, for example, Affymetrix IDs to refseqs and then to Ensembl gene or transcript IDs. There's also missing data to deal with and the small question of what kind of similarity measure to use.

Protein interaction networks are, unfortunately, notoriously incomplete. Furthermore, if derived from literature reports they are heavily biased towards known disease genes - simply because they are the genes investigated most frequently.

The future may lie in combining multiple lines of evidence somehow, as Aerts et al. have done recently. Last month a collaborative effort in NAR combined the results from several different pieces of software - each of which used a different approach - to produce putative candidates for obesity and type 2 diabetes, with promising results.

I think candidate prioritization is an exciting field and an area in which relatively simple bioinformatics tools can do some real good - saving labs time and money (and making life easier for those grad students who'd otherwise do the monkey work). It'll be interesting to see how things develop - unless HapMap renders all other methods of finding disease genes obsolete...

Comments and trackbacks Feel free to post your comments Anonymous Lude Franke Anonymous jansenkoe . This post has trackbacks.

Trackbacks:

2 Comments:

At August 01, 2006 4:30 PM, Anonymous Lude Franke said...

Thank you for writing this overview of various computational methods for the identification of candidate genes and the problems they pose.

However, I would like to emphasize that the paper you mentioned regarding the application of protein interaction networks to prioritize positional candidates, deals with many of the issues you raise.

As principle author on this paper we were aware that for only ~50% of all genes decent annotation is available. As such we put considerable effort in identifying interactions for the functionally still unknown ones, by relying upon co-expression data.

While this strategy is far from perfect, novel high-throughput techniques that employ other experimental setups thus are very wellcome, but until they arrive motif and co-expression analysis sometimes can provide insight into what could be going on with a particular gene.

Kind regards,

Lude Franke

 
At November 24, 2006 10:19 AM, Anonymous jansenkoe said...

This post has been removed by a blog administrator.

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008