Prioritizing candidate genes with CAESAR
This week in Bioinformatics Advance Access there's a paper by Kyle Gaulton (watch out, PDF) from the Mohlke lab at UNC describing their new system, called CAESAR (nice name, which is a good start).
CAESAR is remarkably cool. Here's how it works:
- You give it a text corpus to work from - some review articles about the disease that you're interested in or an OMIM entry, for example
- It extracts all of the gene symbols from that corpus
- Again using the corpus it finds relevant terms from the Gene Ontology, eVOC and MGD's ontology of mammalian phenotypes
- It expands the set of genes from (2) by looking for interaction partners in BIND and Kegg and similar proteins in iPro and using the ontology terms from (3) to find relevant mouse knockouts, genes that have known associations with similar phenotypes and genes that are expressed in the same tissues.
- It combines the resulting large sets of genes and ranks them mathemagically to produce the final ranked list.
[it] relies on human expert knowledge in order to function effectively, but it does not require that the user actually possess all of this knowledge.CAESAR is not without issues. In particular there's a bias towards genes that are more heavily annotated - the manuscript points out that the mean number of GO terms for genes ranked in the top 98th percentile of their test sets was significantly higher than the number of terms for all genes.
Despite some cheeky use of misleading language in the results section ("we addressed this potential bias" means "we proved that the bias wasn't potential at all but real, then moved swiftly on" rather than "we addressed the problem and fixed it") there's not really any discussion of how future systems could avoid the same issue.
The worst side-effect of relying on annotation is that only 15,000 human genes (~ 50%?) have enough quality annotation from different sources to do anything with at all. This percentage will increase over time, but until then there must be other sources of data that we can use (Lude Frank left a comment about this on last year's post).
There's also a potential issue with the way that CAESAR was tested using a set of genes already known to be involved in a complex trait: while Gaulton et al. cleaned up the corpus for each test gene by removing any direct references to it and restricting the papers included to those published before the year of association might not bias remain in places like BIND, Kegg and iPro, as a result of subsequent gene driven research into the trait's etiology?
You'd expect, for example, that once a new gene was implicated in a disease somebody somewhere would immediately check to see if it interacts with the other candidate genes for that disease (mentioned in the literature corpus used during testing) - placing the results into BIND. OK, it's a bit of a weak correlation, but still...
Anyway, all that aside it's a nice piece of software (and freely available!). I'd be interested to hear if CAESAR is going to be developed any further.
