Evolutionary conserved elements
The thing is that the human genome is big. Really big. 3,400 million base pairs big. Only 1.5% of that is made up of genes that code for proteins and the amount of material in the difference between 5% and 1.5% is huge, even though it might not seem like it at first. Think about it this way: in the same amount of space you could fit in almost the entire genome of, say, the fruit fly.
So what's the other functionally important 3.5% of the genome? Nobody really knows. The snappy name for it is "dark matter" - no reason why physics should get to keep all those cool names for mysterious expanses of the unknown - and while there are a lot of good ideas out there, mostly all that non-junk DNA just serves to remind us how much we still don't know about our genetic material.
One interesting thing about the Siepel et al study is that they've tried to break down where these evolutionally conserved regions occur. As you might expect, many are in the exons of protein coding genes, signposting genes that have similar ancestry and function in most of the species examined. As the percentages from the previous paragraphs indicate, however, there are at least twice as many conserved regions outside of those protein coding genes.
Repeats, which make up a large proportion of the genome, tended to contain a substantially reduced number of conserved regions although interestingly some ancestral repeats (repeats inserted before different species "split" off from their common ancestor) appear to have gained critical functions and are highly conserved; it has been suggested that maybe some of these are the functions that help to differentiate mammals from ancestral vertebrates.
Other conserved regions seem to confirm the regulatory roles played by certain sequence features: tellingly, the untranslated regions (UTRs) sometimes found on the ends of gene sequences which are thought to play a part in where, how often and for how long the gene is expressed were enriched significantly in highly conserved bases.
The possibilities for using the sort of data Siepel et al have generated are fantastic - I'd love to see more research along these lines. With the ENCODE project maturing we should be able to shed more light on what dark matter really is and there's even the possibility that we'll be able to create a set of controls significantly large enough to train and test machine learning algorithms that finally get regulatory region prediction success rates up to reliable levels. That's maybe a whole other posting for another day.
I should point out that this certainly isn't the first paper to look at genomic alignments. With each one written, though, the field seems to get a little bit more sophisticated and the methodology a little more polished.
More importantly, all of the data they produced - including base by base conservation scores - is already available as a track on the UCSC genome browser which earns them a big gold star in my book.
