Flags and Lollipops

Monday, September 19, 2005

The Joy of S/MARs

Despite having my common sense, several seminars and many references in literature tell me that it's not true I still have to fight the tendency to think of cell nuclei as little bio-jelly filled bags in which bits of DNA are floating around freely in nice chromosome-shaped chunks, like you see on karyotypes.

This is possibly due to my reductionist, computer scientist brain rebelling against yet another level of complexity - the geography of the cell - when surely there's enough work to be done with the -omics we have. But anyway...

Up until the 70s, nuclear architecture was a mystery. Light and early electron microscopy couldn't pick out any structures and so everybody calmly just pretended like they had better things to do like making flies grow two heads. Then new cell preparation and fluoroscopy techniques appeared and some light was shed on eukaryotic cell nuceli.

A complex, flexible network of protein and RNA fibrils called the nuclear matrix was discovered. Amongst other things, this network serves as a scaffold to which chromosomes are attached (the relevant parts were imaginatively titled "the chromosome scaffold").

I'll skip the in-depth review of chromatin structure (people much more clever and eloquent than I could ever be do a better job elsewhere). Essentially, DNA (for most of the time) is wrapped around histone proteins, packaged as 'beads on a string'. This string is stuck to the scaffold. What I want to talk about today are those bits of DNA which make up the "sticky bits" of the string - marked as "AT-rich regions" in the diagram above.

These are the Scaffold / Matrix Attachment Regions, or S/MARs for short. You can work out where they are in the lab - slowly - and there are a limited number of S/MAR sequences for different organisms available on the internet from the SMARt DB.

There's no real consensus sequence. They do tend to share some general features, though: S/MARs are between 300bp and several kb in length, tend to be AT rich and are enriched for features like Topoisomerase II binding and cleavage sites and curved or kinked DNA. Recently researchers have expanded on the latter feature - it turns out that S/MARs also have a high potential for stress-induced duplex destabilization (SIDD). Craig Benham at UC Davies has web based software to calculate SIDD for short sequences - incidentally, he has also produced work showing that SIDD-prone sites might be linked to regulatory potential, at least in E.Coli.

Why is anybody interested in where S/MARs are anyway? Well, there's their relationship to regulatory regions, faint evidence that they have something to do with where translocation breakpoints and gross deletions happen on chromosomes and the relationship between structural domains (the "loops of DNA" in-between attachment regions in the diagram above) and functional domains. It's also been mooted that there's a relationship between gene expression and the contents of each structural domain; thus, for example, important, highly-expressed genes are perhaps the only gene on their structural domains while other, larger domains contain groups of less important genes.

There's no shortage of interesting future experiments but what is lacking is the data. At the moment, the three ways to identify putative S/MARs in-silico are MAR-Wiz, Smartest and Web SIDD, all of which are web based scripts with limits on the amount of sequence that they can handle at once; limits that make genomic studies difficult (unless you're the people who wrote the software in the first place). Their workings aren't very transparent - I'm not sure if MAR-Wiz is even peer reviewed.

Which brings me to the point of this post... if anybody out there is looking for a neat coding project, a standalone S/MAR finder that incorporates SIDD as a feature would be great (an open source one that we could all tinker with would be even better). My attempts to create such a thing myself have exploded in a puff of greek letters, misunderstood equations and lack of time. If you doubt how useful S/MAR finding software might be, check out the number of papers that use MAR-Wiz (aka MAR-Finder).

p.s. I know about the EMBOSS one, but it's really behind the times.

Comments and trackbacks Feel free to post your comments Anonymous Neil Blogger Stew Anonymous Anonymous . This post has trackbacks.

Trackbacks:

3 Comments:

At September 20, 2005 12:04 PM, Anonymous Neil said...

Sounds like an interesting project and I understand your frustrations!

I think this is a good illustration of a "real world" bioinformatics problem. We're often not looking at single, simple features (like a motif), but rather a combination of factors - sequence, position/length, associated genes and so on. Another example that springs to mind is bacterial integrons.

Writing code that scans reliably for multiple, often "fuzzy" features is a real challenge - I've had many efforts fall down at the "obtaining the appropriate raw dataset" stage...

 
At September 20, 2005 1:47 PM, Blogger Stew said...

Good point about obtaining the right dataset... I perhaps glossed over that in the post!

One of the reasons that there isn't a consensus motif or at least a more tightly defined definition of S/MARs is possibly simply because there aren't that many experimentally defined S/MAR sequences available to work with.

 
At October 22, 2005 7:14 PM, Anonymous Anonymous said...

That would be an interesting, albeit complex, project.

Has anyone actually tested against the existing projects? I have tried using these SMARTEST and MARWiz with a 0 success rate. My test was to run a set of test cases using smartdb samples surrounded by junk (random) DNA. In no cases did either system detect statistically significant MARS. In fact the genomatrix product said it didn't detect anything in all cases. Just to be sure there weren't formatting issues, I generated significantly larger random sequences and in that case both systems nade successful (and numerous) detections from the chaos, as you would expect.

The samples I used were SM0000006, SM0000298,SM0000309,
SM0000427, andSM0000552.

Shouldn't these programs be able to detect SMARTDB samples? Isn't that what their training sets are based on? Or am I just completely out to lunch here?

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008