Flags and Lollipops

Wednesday, January 18, 2006

Lights, CAMERA, Venter!

I'm of two minds about Craig Venter. Undoubtedly he is a egotistical glory hound who, at one point, could have seriously compromised the public human genome project (would computational genomics research today be the same if we all had to go through Celera? Probably not.... I'm thinking subscription based database access... I'm thinking HGMD... I'm thinking the amount of hair pulling and fruitless mouse clicking that was involved the last time I tried to get some information about a set of Celera SNPs off the web).

On the other hand, the man has cojones. He also has a fondness for high-throughput data collection that as a bioinformatician I have to admire.

Anyway, Venter has been in the news again recently. Bio-IT World is carrying a story about Google's involvement with Venter (first seen in the Sunday Times in November last year - Snowdeal picked up on it back then). There's also a story making the rounds about a new collaboration between the Venter Institute and UC San Diego division of the California Institute for Telecommunications and Information Technology (Calit2).

As you may know, right now Craig Venter is circumnavigating the globe on his high tech superyacht Sorcerer II, currently sitting off the coast of Madagascar. In between BBQs on the poop deck and having bikinied lovelies massage suncream onto his sensitive pate he collects litres of seawater, strains all of the living organisms out of them and then sends the resulting crates of freeze dried micro-organisms back to the States to get shotgun sequenced. It's this project that Calit2 is interested in.

As Sorcerer II is collecting vast amounts of information:
Scientists aboard the vessel are identifying about 40,000 new species at every 200-mile stop along their route, [Venter] said
it's anticipated that to store, annotate and analyse all of the sequence data that is going to be generated scientists will need a fair amount of computing power.

The proposed solution is to set up a Grid on top of the National LambdaRail (a high speed academic network that runs over fiber optic lines). Calit2 is calling this a "Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis", or CAMERA for short. Calit2 and the San Diego Supercomputer Centre will provide the hardware while the Venter Institute will contribute sequences and "community developed genome analysis software" to help researchers make sense of the new data.

The infrastructure involved isn't trivial:
the CAMERA complex will have a thousand processors of dedicated local cluster computing and several hundred terabytes of replicated data storage.
Fair enough.

The press release makes for fun reading: bombast from the Venter Institute, the natural predilection for tortured acronyms shared by all Grid researchers and the love PR officers have for making science sound complex have all come together: the whole project is swaddled in fantastically meaningless jargon. For the record, it's not a Grid, it's an environmental metagenomics data storage and computational complex run over the OptIPuter high-performance 'collaboratory'.

Oh, and it'll help cure cancer.
The new resource will greatly enhance [UCSD's] health science researchers' ability to advance the development of new drugs and therapies from the ocean's resources to combat cancer and neurodegenerative and other diseases.

Comments and trackbacks Feel free to post your comments Blogger e3 Blogger Pedro Beltrão Anonymous Neil Anonymous Neil Blogger Kristofer Blogger Stew Anonymous Neil Blogger Sandra Porter . This post has trackbacks.

Trackbacks:

8 Comments:

At January 18, 2006 2:52 PM, Blogger e3 said...

i just thought i'd pop in and say that i thoroughly enjoyed this post - i too am of two minds about venter.

 
At January 18, 2006 3:22 PM, Blogger Pedro Beltrão said...

Thanks for the laugh :)
People like Venter, Barabasi, Aubrey de Grey, and others have this ability to connect to society and project their work. This is good for science because science and society should be more connected but it usually implies some costs.

Anyway, imagine the amount of raw data coming from this project. We can start dreaming up ways to analyse this.

 
At January 19, 2006 1:29 AM, Anonymous Neil said...

I think we're all in two minds about JCV. He is a good PR man - I work with people who collaborate with him and he's given some great talks to the microbiology students at our place. Working as I do in a department where every obstacle is put in the way of people who try to innovate, I also admire his "just do it" ethos.

On the other hand, I have serious concerns about the validity and usefulness of environmental sequencing. It generates an awful lot of short contigs and a very few large ones and I'm not convinced at all that the large ones are accurate, given the current state of shotgun assembler software. Also, far too many (micro)biologists are confusing "a lot of raw data" with "useful data for understanding biological systems". So there's a lot of genes out there - big deal. It has to be more than a massive stamp collecting exercise.

 
At January 19, 2006 3:06 AM, Anonymous Neil said...

Whilst we're on the subject of environmental sequencing - something troubles me about the Sargasso Sea sequencing project. Come with me on a short bioinformatics journey.

First, grab yourself a bacterial 23S rRNA sequence. I went with E. coli - you can get its 23S in fasta format from this link.

Now, head over to the NCBI nucleotide-nucleotide BLAST page. Paste in your 23S query but for the database, make sure that you select "env_nt". Submit, sit back and format when ready.

This gave me 240 hits. They look good - high identities, low e-values, we are definitely hitting 23S rRNA genes. Scroll down and look at the top hit. Its GI number in my case is 44249358 - all of the Sargasso clones are named with the prefix IBEA_CTG. Scroll down further and look at the alignment. The sequence coordinates of this hit are 1763 - 2907.

Now, open up the link for the GenBank entry of GI 44249358. You'll see several features: genes for a conserved hypothetical protein, several small hypothetical proteins and a 16S rRNA gene. Look at the coordinates of the first few short hypothetical proteins:

1786-2013
1911-2345
2329-2457
2423-2545
2592-2741
2656-2802

Notice something wrong? Those coordinates lie within the region that we've identified as a 23S rRNA gene. They are not "hypothetical proteins" - they are translations of the 23S rRNA gene that just happen to resemble short ORFs. The downstream 16S rRNA gene is another giveaway because most bacterial rRNA operons contain 23S-16S-5S.

You see this time and time again with the Sargasso contigs. They have been annotated for 16S RNA genes, but not 23S and many of those so-called hypothetical proteins lie in an obvious 23S RNA region. Result: GenBank fills with more crap.

It's such a simple, basic error, I can't believe that no-one else has picked it up.

 
At January 19, 2006 9:09 AM, Blogger Kristofer said...

Yes, you never really know where he is going. But he has cool ideas and the money to make them real.

Kristofer's computational biology blog

 
At January 19, 2006 11:33 AM, Blogger Stew said...

Interesting comment about the annotation errors, thanks Neil. I suppose the responsibility ultimately rests with the submitter, i.e. IBEA, whose philosophy of quantity vs quality might differ to everybody elses, but doesn't Genbank have any guidance for this sort of thing?

 
At January 19, 2006 2:27 PM, Anonymous Neil said...

doesn't Genbank have any guidance for this sort of thing?

Interestingly, not really. There are levels of curation within GenBank (such as RefSeq), but its philosophy is generally one of being a passive archive with little filtering. For instance, if you and I work in the same lab, sequence the same gene and disagree on its exact sequence, there's nothing to stop both of us submitting our personal versions of the sequence to GenBank. It's this inconsistency that led to development of curated databases such as SwissProt and TrEMBL.

 
At January 24, 2006 5:02 PM, Blogger Sandra Porter said...

Hi Stew,

If you're not too busy writing and you want to read a really entertaining book about JCV, check out "The Genome Wars" by James Shreeve. I remember this time period pretty well since my sweetie was a post-doc then in the UW genome center. The author does a good job with portraying the characters that I've met.

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008