Flags and Lollipops

Friday, March 02, 2007

Prioritizing candidate genes with CAESAR

Last year I posted about disease gene prediction - using computational methods to prioritize candidate genes for further (human) study. It's a relatively busy field: there are half a dozen systems out there that can all help narrow down large lists of genes with varying degrees of success.

This week in Bioinformatics Advance Access there's a paper by Kyle Gaulton (watch out, PDF) from the Mohlke lab at UNC describing their new system, called CAESAR (nice name, which is a good start).

CAESAR is remarkably cool. Here's how it works:
  1. You give it a text corpus to work from - some review articles about the disease that you're interested in or an OMIM entry, for example
  2. It extracts all of the gene symbols from that corpus
  3. Again using the corpus it finds relevant terms from the Gene Ontology, eVOC and MGD's ontology of mammalian phenotypes
  4. It expands the set of genes from (2) by looking for interaction partners in BIND and Kegg and similar proteins in iPro and using the ontology terms from (3) to find relevant mouse knockouts, genes that have known associations with similar phenotypes and genes that are expressed in the same tissues.
  5. It combines the resulting large sets of genes and ranks them mathemagically to produce the final ranked list.
Anyway, I was impressed. I really like the basic idea:
[it] relies on human expert knowledge in order to function effectively, but it does not require that the user actually possess all of this knowledge.
CAESAR is not without issues. In particular there's a bias towards genes that are more heavily annotated - the manuscript points out that the mean number of GO terms for genes ranked in the top 98th percentile of their test sets was significantly higher than the number of terms for all genes.

Despite some cheeky use of misleading language in the results section ("we addressed this potential bias" means "we proved that the bias wasn't potential at all but real, then moved swiftly on" rather than "we addressed the problem and fixed it") there's not really any discussion of how future systems could avoid the same issue.

The worst side-effect of relying on annotation is that only 15,000 human genes (~ 50%?) have enough quality annotation from different sources to do anything with at all. This percentage will increase over time, but until then there must be other sources of data that we can use (Lude Frank left a comment about this on last year's post).

There's also a potential issue with the way that CAESAR was tested using a set of genes already known to be involved in a complex trait: while Gaulton et al. cleaned up the corpus for each test gene by removing any direct references to it and restricting the papers included to those published before the year of association might not bias remain in places like BIND, Kegg and iPro, as a result of subsequent gene driven research into the trait's etiology?

You'd expect, for example, that once a new gene was implicated in a disease somebody somewhere would immediately check to see if it interacts with the other candidate genes for that disease (mentioned in the literature corpus used during testing) - placing the results into BIND. OK, it's a bit of a weak correlation, but still...

Anyway, all that aside it's a nice piece of software (and freely available!). I'd be interested to hear if CAESAR is going to be developed any further.

Labels: , ,

Comments and trackbacks Feel free to post your comments . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008