Flags and Lollipops

Saturday, December 29, 2007

Open notebook pt2 - question, theories, approach

The question

Monogenic (classically mendelian) disorders are caused by mutations or errors in a single gene. Many of these gene -> disesase mappings have been discovered and are listed in OMIM, the Online Mendelian Inheritance in Man database.

There are ~ 1,000 'disease' genes (genes that give rise to a particular monogenic disorder when mutated in particular way) listed in OMIM. If you compare this set to other genes some interesting differences become apparent [1, 2] (only two paragraphs before reference to own paper; this is just like a real writeup!). Check out the table below from [1]:



Median gene length (*) is particularly interesting; disease genes have a median length of 27k while the control set sits at 19k. Why?

(* the longest known transcript of each gene was used)

Some plausible sounding explanations


  • Study bias: genes known to be responsible for disease have by definition been studied in a lab. Gene finding is an inexact science; perhaps automated systems tend to miss the last few exons and it takes a human in a wet lab to find the longer transcripts?
  • Larger genes are 'less important': older, more conserved genes tend to be smaller [ref needed]. Mutations in newer, larger genes may be more likely to have no effect or give rise to a new phenotype (like monogenic disease) while mutations in these older presumably more important genes might be fatal at a very early stage.
  • Correlation with some other feature: larger gene sizes are correlated (to different extents) with things like larger numbers of exons, longer 3' and 5' UTRs and expression patterns [3]. Could it be, for example, that monogenic diseases tend to affect one particular area rather than being systemic? If so, maybe the disease gene set is larger because larger genes tend to be more tissue specific.


Our approach

Let's start off by revisting the data from [1] and making sure that the gene size / disease correlation still holds up, throw in a few more features to look at - back in 2005 it was difficult to get normalized expression data for the control set - then search the literature for any other theories or related findings.

After that we can test a few possible explanations.

[1] Speeding disease gene discovery by sequence based candidate prioritization

[2] Human disease genes: patterns and predictions

[3] Elevated rates of protein secretion, evolution, and disease among tissue-specific genes

Comments and trackbacks Feel free to post your comments Blogger Bill Hooker Anonymous Stew Blogger Bill Hooker Blogger Bill Hooker . This post has trackbacks.

Trackbacks:

4 Comments:

At December 29, 2007 5:43 PM, Blogger Bill Hooker said...

Does the higher degree of protein identity with mouse orthologs indicate "older/more ancestral" genes? This is contra Plausible Souding Explanation #2; the alternative PSE might be that older genes, having been around longer, have their fingers in more, and more fundamental, pies -- so mutations in same are more likely to cause disease without the need for other mutated genes as partners in crime.

Looking at ratios, the standout is gene length (disease = 1.42*control; all the others are < 1.29). Is this difference in gene length significantly greater than the differences in coding sequence or 3'UTR? That is, are the disease genes longer than control genes to a greater degree than the disease proteins/UTRs are larger than the control proteins/UTRs? If so, does that mean anything (e.g. should we be looking at introns in disease genes)?

 
At December 29, 2007 6:52 PM, Anonymous Stew said...

Hi Bill!

Does the higher degree of protein identity with mouse orthologs indicate "older/more ancestral" genes? This is contra Plausible Souding Explanation #2

Good point. I wonder what the identities are like with yeast or drosophila?

the alternative PSE might be that older genes, having been around longer, have their fingers in more, and more fundamental, pies -- so mutations in same are more likely to cause disease without the need for other mutated genes as partners in crime.

Definitely plausible... number of interactions with other proteins would be a most excellent thing to look at. What dataset would minimize study bias? I guess you'd need to make sure that had a strong 'from the literature' component (assuming disease genes are better studied and so have more published interactions).

If so, does that mean anything (e.g. should we be looking at introns in disease genes)?

Another good point. I wonder if anybody has looked at the kinds of mutations that cause monogenic disease (i.e. errors in the coding sequence vs. on regulatory regions)?

 
At December 29, 2007 9:34 PM, Blogger Bill Hooker said...

Oh yeah, I meant to suggest widening the homology search -- at least to other vertebrates.

In re: protein interactions, I hadn't actually thought of including that as a metric; nice one. I bet Pedro would have some good ideas on that.

 
At December 31, 2007 6:30 AM, Blogger Bill Hooker said...

One other thought -- a search for the presence of conserved domains (src homology, Myc box, etc) might be a better predictive tool than overall homology. I was thinking about the fact that short stretches of important, high homology could be masked by less-conserved regions of "looser" function. I couldn't think of a way to automate the sort of base-by-base scanning you'd do by eye, so machine-identifiable conserved domains seemed like a reasonable proxy.

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008