Open notebook pt2 - question, theories, approach
Monogenic (classically mendelian) disorders are caused by mutations or errors in a single gene. Many of these gene -> disesase mappings have been discovered and are listed in OMIM, the Online Mendelian Inheritance in Man database.
There are ~ 1,000 'disease' genes (genes that give rise to a particular monogenic disorder when mutated in particular way) listed in OMIM. If you compare this set to other genes some interesting differences become apparent [1, 2] (only two paragraphs before reference to own paper; this is just like a real writeup!). Check out the table below from [1]:

Median gene length (*) is particularly interesting; disease genes have a median length of 27k while the control set sits at 19k. Why?
(* the longest known transcript of each gene was used)
Some plausible sounding explanations
- Study bias: genes known to be responsible for disease have by definition been studied in a lab. Gene finding is an inexact science; perhaps automated systems tend to miss the last few exons and it takes a human in a wet lab to find the longer transcripts?
- Larger genes are 'less important': older, more conserved genes tend to be smaller [ref needed]. Mutations in newer, larger genes may be more likely to have no effect or give rise to a new phenotype (like monogenic disease) while mutations in these older presumably more important genes might be fatal at a very early stage.
- Correlation with some other feature: larger gene sizes are correlated (to different extents) with things like larger numbers of exons, longer 3' and 5' UTRs and expression patterns [3]. Could it be, for example, that monogenic diseases tend to affect one particular area rather than being systemic? If so, maybe the disease gene set is larger because larger genes tend to be more tissue specific.
Our approach
Let's start off by revisting the data from [1] and making sure that the gene size / disease correlation still holds up, throw in a few more features to look at - back in 2005 it was difficult to get normalized expression data for the control set - then search the literature for any other theories or related findings.
After that we can test a few possible explanations.
[1] Speeding disease gene discovery by sequence based candidate prioritization
[2] Human disease genes: patterns and predictions
[3] Elevated rates of protein secretion, evolution, and disease among tissue-specific genes
Bill Hooker
Stew
Bill Hooker
Bill Hooker
. This post has trackbacks.
