Is there a super-semantic-web-enabled phenotype database out there*? I want to ask a question like 'give a list of monogenic disorders whose locus has been confirmed by at least two labs, broken down by type of causative mutation type' and get an answer.
(* on a tangent: is 23andMe's gene book thing freely available?)
OMIM falls quite a long way short of this... it never set out to be a resource for programmatic access so you can't really blame them. The morbid map is available for download and contains all of the gene -> disorder mappings in their database.
A couple of issues:
- OMIM's weird entry categorization system (#*%+...) is very confusing. There are 2229 'phenotypes' (note: not 'Mendelian phenotypes') with a known molecular basis in the database, apparently, but only 386 genes with a phenotype associated with them? Some of those phenotypes are going to be caused by gross insertions / deletions / whatever and not small mutations in single genes, multiple phenotypes might arise from different mutations in the same genes but even so... what's with the disparity?
- It contains polygenic disorders (diabetes, schizophrenia) as well as monogenic ones
- You can't tell which is which - you could count the number of genes associated with the disorder but a 'monogenic' disorder might be a complex one whose OMIM entry hasn't been updated yet
- It's not a disease database - it has other phenotypes in it too. Longevity? Wet or dry ear wax? Novelty seeking personality?
The last point is interesting, really. When is a phenotype a disease? If you have a novelty seeking personality and so are relatively impulsive and prone to climbing mountains, swimming with sharks, cycling without a helmet etc. then are you ill?
Well, no, is the obvious answer. But where do you draw the line? Is autism a disease?
Neh. Beyond our remit. For us a monogenic disease = a clinically recognized disorder with a single, genetic cause.
The question
Monogenic (classically mendelian) disorders are caused by mutations or errors in a single gene. Many of these gene -> disesase mappings have been discovered and are listed in OMIM, the Online Mendelian Inheritance in Man database.
There are ~ 1,000 'disease' genes (genes that give rise to a particular monogenic disorder when mutated in particular way) listed in OMIM. If you compare this set to other genes some interesting differences become apparent [1, 2] (only two paragraphs before reference to own paper; this is just like a real writeup!). Check out the table below from [1]:

Median gene length (*) is particularly interesting; disease genes have a median length of 27k while the control set sits at 19k. Why?
(* the longest known transcript of each gene was used)
Some plausible sounding explanations
- Study bias: genes known to be responsible for disease have by definition been studied in a lab. Gene finding is an inexact science; perhaps automated systems tend to miss the last few exons and it takes a human in a wet lab to find the longer transcripts?
- Larger genes are 'less important': older, more conserved genes tend to be smaller [ref needed]. Mutations in newer, larger genes may be more likely to have no effect or give rise to a new phenotype (like monogenic disease) while mutations in these older presumably more important genes might be fatal at a very early stage.
- Correlation with some other feature: larger gene sizes are correlated (to different extents) with things like larger numbers of exons, longer 3' and 5' UTRs and expression patterns [3]. Could it be, for example, that monogenic diseases tend to affect one particular area rather than being systemic? If so, maybe the disease gene set is larger because larger genes tend to be more tissue specific.
Our approach
Let's start off by revisting the data from [1] and making sure that the gene size / disease correlation still holds up, throw in a few more features to look at - back in 2005 it was difficult to get normalized expression data for the control set - then search the literature for any other theories or related findings.
After that we can test a few possible explanations.
[1] Speeding disease gene discovery by sequence based candidate prioritization
[2] Human disease genes: patterns and predictions
[3] Elevated rates of protein secretion, evolution, and disease among tissue-specific genes
I've decided to get back into 'proper' science. For a week, anyway, I'm not stupid (well, stupid enough to do this in my spare time, but yeah...).
Here's the plan:
- ask an interesting yet niche and relatively simple question
- use bioinformatics tools and awesome science 2.0 websites to find answer
- keep track of progress on this blog
- put together manuscript and submit to Precedings
- use backdoor into Nature Genetics manuscript tracking system to get paper accepted
This may not make for exciting reading - we'll see.
NPG is recruiting (a publishing / managerial role):
Head of Community Business Development
This person will play a central role in NPG's evolution as a scientific communication company. They will be based in London or New York and will report to the Publishing Director, Nature.com. This role will focus on using online approaches to develop a better understanding of, and deeper relationships with, each of our users. By serving them better we intend ultimately to attract attention and usage from all professional scientists, and by using these services as the foundation for new businesses we intend to continue NPG's rapid evolution as an online scientific communication company. This role will involve line management responsibility for our existing social software teams, as well as the appointment of further staff in the areas of online marketing and web statistics. We are seeking someone with a clear strategic sense of how the web is evolving, sufficient technical knowledge to work closely with software developers, a clear strategic vision for the future of communities on Nature.com, and experience in developing, promoting and running successful participative websites.
Egon and Noel have a paper in BMC Bioinformatics this month describing userscripts for the life sciences... nice work, guys.
Last year there was a discussion over at Pedro's of the merits of publishing individual userscripts after Ben Good's paper about a Greasemonkey based iHOP enhancement appeared in BMC. This is more of a review.
We discussed the possibility of hosting a science mashups / web services wiki at NPG - sort of like ProgrammableWeb, but listing only the APIs, databases and tools relevant to science. This sort of ties in with the post over at Nodalpoint that Alf wrote about documenting bioinformatics APIs. There's enough stuff available nowadays for it to be a useful resource, I think.
Incidentally: I started writing this post BEFORE I read the paper properly and realised that I got a namecheck for Postgenomic. Now I definitely recommend it. ;p
Labels: api, greasemonkey, mashups
 via Alf's delicious bookmarks - Linden Lab has released a beta version of the Second Life client that uses Windmark's 'atmospheric rendering technology'.
Big difference, no?
See all posts from:
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
|
|