Flags and Lollipops

Wednesday, August 10, 2005

Evolutionary conserved elements

Genome Research carries a paper this month by Siepel et al suggesting once more that about 5% (3-8%, actually) of the human genome is highly conserved throughout evolution - that is to say, at least 5% of the genome has an important enough function for it to be pretty much the same in humans, mice, rats, chicken and Fugu. That fits with studies done just with humans, mice and rats which came up with roughly the same percentage.

The thing is that the human genome is big. Really big. 3,400 million base pairs big. Only 1.5% of that is made up of genes that code for proteins and the amount of material in the difference between 5% and 1.5% is huge, even though it might not seem like it at first. Think about it this way: in the same amount of space you could fit in almost the entire genome of, say, the fruit fly.

So what's the other functionally important 3.5% of the genome? Nobody really knows. The snappy name for it is "dark matter" - no reason why physics should get to keep all those cool names for mysterious expanses of the unknown - and while there are a lot of good ideas out there, mostly all that non-junk DNA just serves to remind us how much we still don't know about our genetic material.

One interesting thing about the Siepel et al study is that they've tried to break down where these evolutionally conserved regions occur. As you might expect, many are in the exons of protein coding genes, signposting genes that have similar ancestry and function in most of the species examined. As the percentages from the previous paragraphs indicate, however, there are at least twice as many conserved regions outside of those protein coding genes.

Repeats, which make up a large proportion of the genome, tended to contain a substantially reduced number of conserved regions although interestingly some ancestral repeats (repeats inserted before different species "split" off from their common ancestor) appear to have gained critical functions and are highly conserved; it has been suggested that maybe some of these are the functions that help to differentiate mammals from ancestral vertebrates.

Other conserved regions seem to confirm the regulatory roles played by certain sequence features: tellingly, the untranslated regions (UTRs) sometimes found on the ends of gene sequences which are thought to play a part in where, how often and for how long the gene is expressed were enriched significantly in highly conserved bases.

The possibilities for using the sort of data Siepel et al have generated are fantastic - I'd love to see more research along these lines. With the ENCODE project maturing we should be able to shed more light on what dark matter really is and there's even the possibility that we'll be able to create a set of controls significantly large enough to train and test machine learning algorithms that finally get regulatory region prediction success rates up to reliable levels. That's maybe a whole other posting for another day.

I should point out that this certainly isn't the first paper to look at genomic alignments. With each one written, though, the field seems to get a little bit more sophisticated and the methodology a little more polished.

More importantly, all of the data they produced - including base by base conservation scores - is already available as a track on the UCSC genome browser which earns them a big gold star in my book.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Trackbacks:

0 Comments:

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008