Flags and Lollipops

Saturday, August 20, 2005

Brief Blogging Hiatus

Just Married

Well, by the time you read this, anyway. Thanks for the kind comments & email and see you in a few weeks!

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Thursday, August 18, 2005

Reusable Code

One of the things that I like about working in bioinformatics is that a lot of the software I write is situational - that is to say it's suited for one particular purpose. It gets written to do a job (typically some sort of analysis or data collation) and then discarded. Sure, there are a couple of major scripts or little applications that I've written that I use fairly frequently but most of my coding time is spent on throwaway scripts. It's the same for many of my colleagues.

In Computer Science classes we were always taught that this is not a good thing (that was in the mid-nineties - nowadays you'd hope that other methodologies get a look in too).

Personally, I think throwaway code is great - I dislike documenting code in detail, I selectively ignore object orientated design principles wherever possible and UML gives me the heeby jeebies. Just because you don't put in the extra effort to make sure that an outsider can come in and reuse all of your code doesn't mean that your program doesn't work just as well, after all - in fact, you have more time to find and fix bugs.

So we've got a Masters student (with a life sciences background) working on a project with us over the summer. The MSc in bioinformatics here is taught by the Computer Science department and his head has been filled with crazy J2EE-talk. He's doing everything by the book; drawing class diagrams, writing test harnesses - he even set up his own little local CVS repository. I should point out that I think this is a good thing. I mean, he's being graded by CS professors, or at least the coding part of his project is. If I were him I'd stick to what had been recommended to me in the lectures, too. It doesn't really bear any relation to the kind of code he'll be writing once he's actually out in the field, though. I just hope he doesn't end up believing that the only good code is reusable code.

Don't get me wrong - there's certainly a time and place for sticking to your object orientated, reusable code module guns. Bioperl is a good example of that. If you've spent some time implementing a tricksy algorithm it's worth setting things up so that you can share it with other people. If you're going to be distributing a script or program make sure that it's high quality, readable code.

To be brutally honest, though, does "everyday" code really ever get reused? What's the ratio of extra effort expended to work saved later on?

For much of bioinformatics I think that disposable software is perfectly acceptable. Partly the reason for this is that the focus isn't usually on producing fully-fledged application software for use by others but on processing data (or providing tools to process data) in the short term. What's important isn't the knowledge encapsulated in the code but in the knowledge created and then published as a paper. It's horses for courses, I guess; we should be able to accept that no single software methodology or mindset necessarily suits all situations.

Or maybe I'm just lazy.

(footnote: for more on this, check out the comments and posts at Propeller Twist and Inforbiomatica)

Comments and trackbacks Feel free to post your comments Anonymous Spitshine Blogger Stew Anonymous Fabrice Anonymous Mauricio Anonymous neil Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Avian Flu

A member of my family works in foreign aid, organising disaster relief programs and similar projects. Recently she was talking about avian flu and how the plans that are currently being drawn up by European governments assume that bird flu will definitely cross over to humans - sooner rather than later. There have been headlines in the last week about outbreaks in birds as far west as the Urals and as far south as the Phillipines.

Yesterday, while checking out Connotea - Nature's new "Del.icio.us for references" software - I noticed that the top three tags there are "avian flu" "H5N1" (one of the strains of flu people are currently worried about - the H and N numbers refer to the types of protein on the surface of the flu virus) and "pandemic". Great. Assuming Connotea is used mostly by those in the know - well, people working in science, anyway - presumably this means that we should all be worried.

How difficult is it to apply modern genetic science to a particular strain of flu and work out how to disable it? Have matters improved since, say, SARS?

It seems that in this case, the problem isn't really with sequencing the virus strain and developing a vaccine but with more prosaic issues like manufacturing the vast number of vaccines required while still being able to control the seasonal variations of flu that cause an estimated quarter of a million deaths worldwide each year.

A good general introduction to the biology of avian flu can be found at 2can, the EBI's bioinformatics educational resource. Snowdeal has some good links and clippings on avian flu and how bioinformatics and geographic information systems (GIS) are being used in conjunction with one another to help epidemiologists track outbreaks and strains, too.

Comments and trackbacks Feel free to post your comments Anonymous Lei Anonymous Anonymous Anonymous avian Anonymous Flu Masks . This post has trackbacks.

Tuesday, August 16, 2005

Digital E.Coli

OK, judging by the number of people who've bookmarked the URL on del.icio.us I'm behind on picking up on this. My excuse is that systems biology - the modelling physical processes kind of systems biology, anyway - isn't really my thing.

One of the reasons for that is, well, I find it boring (I'm sure that lots of systems biologists find what I do boring, too, so hey). At least, I did find it boring: then I saw AgentCell, modestly subtitled "digital e. coli". Emonet et al. haven't created life in a virtual test tube quite yet but what they do have is a neato simulation of bacterial chemotaxis.

A quick aside for the bioinformaticians leaning compsci-wards (thanks to Wikipedia):
Chemotaxis is the phenomenon in which bodily cells, bacteria, and other single-celled or multicellular organisms direct their movements according to certain chemicals in their environment. This is important for bacteria to find food (for example, glucose) by swimming towards the highest concentration of food molecules, or to flee from poisons (for example, phenol).
The assays done in-silico on populations of these agents reproduce experimental data done on real life bacterium.

I think that what makes AgentCell more impressive than similar simulations to an outsider like myself is, sadly, the nice AVIs. That and the (relatively) easily understandable pathways involved. Let this be a lesson in presentation to all systems biologists.

AgentCell's source code was released yesterday, if you're interested in probing a little deeper. The relevant paper is in OUP Bioinformatics (subscription required), here.

Addendum: I've just read the press release from the University of Chicago. Let AgentCell also be a lesson in fantabulous hyperbole to all systems biologists:
The simulation, called AgentCell, has possible applications in cancer research, drug development and combating bioterrorism.
Combating bioterrorism?! Whatever you've got to do to get the funding, I guess....

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Monday, August 15, 2005

Labelling validity of results

Brief non-bioinformatics interlude: my rate of posting will gradually wind down over the course of the next week - this is mainly because I'm getting married at the weekend (hurray!) to my beautiful girlfriend and things are hectic. Normal service will resume in two weeks...

Anyway, Spitshine over at A Bioinformatics Blog put up a post that made me chuckle this morning; Quantifying the margin of error in high-throughput data interpretation. The idea is to help readers work out the validity of claims made in a paper by labelling them with short acronyms:
[I5] Inferred from Perl script. 5 lines of Perl can't be wrong.
[REV] Stinking reviewers didn't like our numbers. Have to put them into the acknowledgments.
[IA>] As high as we could get it to meet your expectations
...
Not that those are the three I'd end up using all of the time or anything. Go check it out...

Comments and trackbacks Feel free to post your comments Anonymous Fabrice Anonymous Lei . This post has trackbacks.

Thursday, August 11, 2005

Getting a handle on references

So every scientific journal has its own style of referencing the past papers, books and resources that you want to cite in your manuscript. I suspect that part of the reason for this is so that if you want to resubmit your paper to a different journal after it has been turned down because the reviewers were all ignorant (for what other reason could there be?) you are forced to jump through hoops reformatting everything. This ensures that you know your place as the editor's bitch.

Software like EndNote and Reference Manager claim to make the process easier but I find that, inevitably, the output styles that ship with citation software include templates for journals like Contemporary Accounting Research and Advances in Polymer Technology but not Bioinformatics or any of the BMC titles (to be fair, I just checked and EndNote does actually carry these styles now. Not so six months ago, though).

Anyway, the two main ways of referencing works in text (at least in biomedical journals) look like so:
  • Harvard style - "thus we can conclude that protein interaction networks are scale-free (Stew, 2003)"
  • Vancouver style - "as evidenced in our previous analyses of overrepresented GO terms in microarray experiments (2)"
There are also myriad formatting issues in the actual references section - what gets put in bold, do the names of the authors come before the name of the paper etc.: you might not think that this is all that important, but sometimes the backend systems at publishers rely on you having properly formatted your references to automatically fetch crossref links and so on for the electronic version of your paper.

Personally, I find the Harvard style (citing a reference by including the surname of the primary author and the year that the paper came out) infuriating. It's the style of choice for OUP's Bioinformatics. It's not so bad when everything is hyperlinked on screen, but if you print out a long paper with lots of references you then have to waste time trying to find the actual citation at the back of your huge sheaf of paper. It's also quite intrusive, especially when more than one source is cited mid-sentence. The flow of the text gets broken by this big bracketed list of frankly meaningless names and numbers.

It's high time we all switched to the Vancouver style, I say. Also, there are two further developments that could make life easier.

The first: little explanations below selected citations in the references section. Sometimes you see this in Nature Reviews, not sure if it has cropped up anywhere else:

i.e.
31. Kanduri, C. et al. Functional association of CTCF with the insulator upstream of the H19 gene is parent of origin-specific and methylation-sensitive. Curr. Biol. 10, 853−856 (2000).
This paper shows that CTCF binding is methylation sensitive and provides important mechanistic insights into how cells distinguish between maternal and paternal alleles.
(..)
36. Dean, W. et al. (..) 13734−13738 (2001)
37. De Baun, M. R et al. (..) 156−160 (2003)
References 34−37 show that in vitro culturing of embryos can lead to epigenetic defects in animals and that this might also have a role in humans.
A little bit of commentary goes a long way, making it much easier to cull references from papers. Think back to the last time that you were reading about a new aspect or technology that you weren't familiar with and remember how difficult is was to pick out the seminal references from which to learn more...

Secondly, I don't see why references can't be grouped by the section of the manuscript that first mentions them; in theory Vancouver style references are already in the order of their appearance. Simply throw in "Background", "Methods" and "Discussion" headers into the references section and immediately we can see, for example, which papers are most closely related to the one being read (cited in the background section) and which are more distantly related (cited in the discussion).


Comments and trackbacks Feel free to post your comments Anonymous Lei Anonymous zerologic . This post has trackbacks.

Wednesday, August 10, 2005

Links

Was cleaning out my bookmarks and found a couple of links that you might find interesting:
Standard disclaimers about this author not necessarily agreeing with the contents of all those links applies. Happy reading...

Comments and trackbacks Feel free to post your comments Anonymous Lei Anonymous Anonymous . This post has trackbacks.

Evolutionary conserved elements

Genome Research carries a paper this month by Siepel et al suggesting once more that about 5% (3-8%, actually) of the human genome is highly conserved throughout evolution - that is to say, at least 5% of the genome has an important enough function for it to be pretty much the same in humans, mice, rats, chicken and Fugu. That fits with studies done just with humans, mice and rats which came up with roughly the same percentage.

The thing is that the human genome is big. Really big. 3,400 million base pairs big. Only 1.5% of that is made up of genes that code for proteins and the amount of material in the difference between 5% and 1.5% is huge, even though it might not seem like it at first. Think about it this way: in the same amount of space you could fit in almost the entire genome of, say, the fruit fly.

So what's the other functionally important 3.5% of the genome? Nobody really knows. The snappy name for it is "dark matter" - no reason why physics should get to keep all those cool names for mysterious expanses of the unknown - and while there are a lot of good ideas out there, mostly all that non-junk DNA just serves to remind us how much we still don't know about our genetic material.

One interesting thing about the Siepel et al study is that they've tried to break down where these evolutionally conserved regions occur. As you might expect, many are in the exons of protein coding genes, signposting genes that have similar ancestry and function in most of the species examined. As the percentages from the previous paragraphs indicate, however, there are at least twice as many conserved regions outside of those protein coding genes.

Repeats, which make up a large proportion of the genome, tended to contain a substantially reduced number of conserved regions although interestingly some ancestral repeats (repeats inserted before different species "split" off from their common ancestor) appear to have gained critical functions and are highly conserved; it has been suggested that maybe some of these are the functions that help to differentiate mammals from ancestral vertebrates.

Other conserved regions seem to confirm the regulatory roles played by certain sequence features: tellingly, the untranslated regions (UTRs) sometimes found on the ends of gene sequences which are thought to play a part in where, how often and for how long the gene is expressed were enriched significantly in highly conserved bases.

The possibilities for using the sort of data Siepel et al have generated are fantastic - I'd love to see more research along these lines. With the ENCODE project maturing we should be able to shed more light on what dark matter really is and there's even the possibility that we'll be able to create a set of controls significantly large enough to train and test machine learning algorithms that finally get regulatory region prediction success rates up to reliable levels. That's maybe a whole other posting for another day.

I should point out that this certainly isn't the first paper to look at genomic alignments. With each one written, though, the field seems to get a little bit more sophisticated and the methodology a little more polished.

More importantly, all of the data they produced - including base by base conservation scores - is already available as a track on the UCSC genome browser which earns them a big gold star in my book.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Friday, August 05, 2005

The Edge of Computation

OK, enough opinionated posts and moaning about my poor state of health - on to an interesting link or two.

Edge is a web and print publication produced by the Edge Foundation, whose "informal membership includes of some of the most interesting minds in the world". Their Febuary issue has the transcript of a pretty good panel discussion between Craig Venter, Ray Kurzweil and Rodney Brooks on the subject of biocomputation.

Everybody probably knows about Craig Venter already. Rodney Brooks is the director of MIT's Computer Science and Artificial Intelligence Laboratory and is probably best known for his insect robots based on the idea that complex, lifelike behaviour can arise from lots of simple programmed behaviours working together. Ray Kurzweil is another comp sci celebrity; probably more famous as an entrepreneur than an academic now, though.

It's a long read but also one that covers a lot of ground. The three participants have some interesting opinions about how biocomputing is changing the way we look at the natural world; Kurzweil in particular seems adamant that we'll be able to model everything from cells up to the human brain in relatively short order while Rodney Brooks is more careful: he believes that we're still missing the "essence of life" - an understanding of biology from first principles - without which our efforts at replicating the complexity of life will always be unsuccessful. Craig Venter talks a little bit about high-throughput genomics - as it relates to his genome mining trips on the high seas and to cheap DNA sequencing - amongst other things.

There's some Quicktime video of the talk available too.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Dang, I'm moving

Salaries in the field vary widely depending on education level and experience. Hughey says that annual salaries can range from $60,000 at the entry level to more than $100,000 for Ph.D.s. A recent salary survey [..] reported that the average salary in 2003 for life scientists whose primary area of specialization is bioinformatics was $75,845.
(Text taken from this article)

Entry level salaries of $60,000 (~ £35,000) for bioinformaticians? I'm definitely in the wrong country. Entry level here in the UK would be about two thirds of that and you'd be lucky to push $100,000 a year with a PhD as a team leader.

Comments and trackbacks Feel free to post your comments Anonymous Lei Anonymous Mauricio Anonymous Anonymous . This post has trackbacks.

Some tips I wish I'd been told

I started working in bioinformatics without much knowledge of biology or genetics; my background is in vanilla computer science. I knew that I wanted to get into bioinformatics - I'd already dabbled in some relevant open source projects - but my knowledge of the field as a whole was pretty much restricted to what I'd gotten out of a copy of Developing Bioinformatics Computer Skills (which unfortunately is aimed more at bench biologists uncomfortable with computers rather than programmers uncomfortable with bench biology). On top of that I entered academia from industry.

Anyway, some things I wish that I'd been told when I first started:
  • You don't need a qualification in the life sciences to work in bioinformatics : it takes the same amount of time for a biologist to learn the relevant computer science skills as it does a computer scientist to learn the relevant biology.
  • Don't expect perfect solutions : I reckon that the sweet spot for accuracy in bioinformatics is 60-70%. Protein structure predictions work for 60-70% of genes. 60-70% of regulatory regions can be detected with the more recent methods. 60-70% of gene names can be successfully culled from large sets of Pubmed abstracts. Biology is complex. Current knowledge is far from perfect. Don't get into bioinformatics if you like clean, elegant solutions.
  • Learn Perl : You can try and get away with just Java or C but I assure you that at some point you're going to have to embrace Perl. Twas the language of the original bioinformaticians and thus shall remain ever so.
  • Stay well informed : As in computer science, you have to keep your skills current to survive. Unfortunately you also need to keep up to date with current scientific thinking on top of that. Sign up to the RSS feeds or table of contents alerts from the big bioinformatics journals (some are listed on the sidebar to the right).
  • Offer your services : your job probably involves helping out bench biologists anyway, but be on the lookout for ways that informatics could help the work going on in your lab; sometimes you'd be surprised. What might take you a couple of minutes with a simple script could be taking an unfortunate RA days (a real life example: checking a list of a hundred or so SNPs to look for those in conserved regions, running them through SIFT, etc.).

Comments and trackbacks Feel free to post your comments Anonymous Neil Blogger Stew Anonymous Lei Blogger Stew Anonymous Anonymous Blogger Stew Blogger Obi Igbokwe Anonymous Anonymous . This post has trackbacks.

Ouch

I'm writing this from home. I'm confined here by mysterious muscle pains down my thigh.

You may think that working in bioinformatics means that you're safe from lab health and safety concerns like pipette RSI, being blinded by exploding liquid nitrogen etc. and that's true - but don't underestimate the cost of not talking a walk around every hour and a half or doing some desk excercises like you're supposed to.

I find it difficult to tear myself away from coding projects once I've gotten started - I like to just put my head down and work away until it's time to go for lunch or to go home. Unfortunately I'm now paying the price for that attitude...

Top tip for the day: that you all get up from your workstations and take a brisk walk.

Comments and trackbacks Feel free to post your comments Anonymous Lei Anonymous Neil Blogger maximilian Blogger eran . This post has trackbacks.

Wednesday, August 03, 2005

Distributed computing

I've been looking into distributed services in bioinformatics (sub required) again recently. A lot of research in this area seems to be Grid related.

Like most things in computer science the Grid is an old idea dressed up in spangly pants. It's not rocket science to have a client ask a server to do some computation and then return the result. You wouldn't know it to look at the descriptions of most Grid related projects, though - there's so much guff surrounding the central idea. Here in the UK, at least, there's a great deal of money available for Grid projects, so perhaps people feel that they need to justify their grants by inventing new acronyms and giving older technologies new, swisher names.

(I'm not saying that I think the Grid is a bad idea, because it's definitely not. It's just that it's sometimes being wrapped in needless complexity and presented as some sort of magical panacea for all our computational problems.)

Anyway, you can already do various NCBI searches with REST (i.e. sending variables on the URL line). The SOAP::Lite Perl library lets you access bioinformatics web services quickly and easily - not that there are all that many out there, but hey - all of which is Grid-lite, I guess.

Projects like BioMOBY take things one step further, introducing centralized lists of services, formally defining inputs and outputs and so on. I don't know if anybody reading this uses it frequently, or knows of any projects where it's been used successfully; as impressed as I am with the work that has gone into it BioMOBY never grabbed me simply because it took as long to sort out all of the kinks and overhead involved as it did to obtain, parse and analyse the data I wanted to look at in the old fashioned BioPerl way. Maybe that'll change in the future, as APIs get slimmed down, network negotiations get faster and more services become available.

In any case, I'm still unsure of why the Grid is a hot topic. For large scale projects and pipelines, sure, the Grid could come in useful (check out SETI@Home or that United Devices protein folding app) but are the changes it might make to the majority of bioinformatics work really all that revolutionary? If not, why all the funding?

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Tuesday, August 02, 2005

Naming conventions

Yeah, so it's unrelated to bioinformatics, but I thought that it was cool (well, maybe not cool, but interesting) - that new planet, 2003 UB313? The people who found it want to call it Xena. Yes, after that Xena. Screw Sedna or whatever poncy name is being mooted by other people.

At least it's memorable. I vote that we ditch numerical identifiers for unique, whimsical slogans wherever possible in human genetics. There's no real reason to restrict database identifiers to twelve bytes or whatever any more. Space is cheap and datapipes wide. Why not expand the dbSNP or PubMed identifiers fields to 128 characters and use unique, auto-generated sentences instead of numbers? On the rare occasions where I have to remember identifiers offhand I'd remember the "clerks love monkeys snp" or "sausage rolls make you fat paper" far more easily than RSxxxxxx.

Of course, scientific papers might not get through email spam filters anymore, but it's a small price to pay, surely?

Comments and trackbacks Feel free to post your comments Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Reproducible research

I was going to post something about Robert Gentleman's "reproducible research paper" later on this week but Pedro Beltrão at nodalpoint has picked up on it too (like me, via the Faculty of 1000 on BioMedCentral).

It's an interesting idea - essentially, mixing code in with your results to allow people to try things for themselves - see the nodalpoint writeup for more.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Monday, August 01, 2005

Restricted Access & the HGMD

I've used the Human Gene Mutation Database as a data source fairly frequently. In case you haven't come across it before, it does pretty much what it says on the tin - it's a database of various (disease linked) mutations grouped by gene. If you wanted to get a set of disease causing SNPs or a list of translocation breakpoints that happen within genes, for example, it'd be great.

The drawback is that there's no easy way to get at the data. Visiting the website, your only option is to search by gene; you'll then get a list of mutations that the gene contains. There's no form of advanced search and no way to bulk download the contents of the database (via a condoned channel; shadily, there's always wget configured with a time delay).

This is obviously something that the authors have considered. However, in their paper in NAR they mention that:
Since HGMD is partly dependent upon industrial funding and involves considerable editorial work over and above mere literature screening (e.g. to ensure the consistency of nucleotide sequence information, amino acid residue numbering and gene symbol usage), unsolved copyright problems have so far precluded HGMD from being downloadable in its entirety.
It disturbs me slightly that this sort of thing is an issue. I think that it's because as opposed to lab based genetics, bioinformatics resources are usually free; programming languages (Java, Perl), libraries (Bio*, NCBI's API, Seqhound) and data (Ensembl, PubMed abstracts, GNF expression data...). Free is a tricky concept nowadays, of course, but I mean in the sense that they are usually free to obtain and to use in an academic environment.

Just to be clear, I'm not disparaging the work of the people involved in the HGMD, just the politics behind some of their policies. The fact remains that the HGMD is a good database. It has the potential to be even better, though.

Why not release copyright on this kind of data, or allow researchers to use the relevant information after signing a release to ensure that they stick to your terms and conditions? Restricting access in this way (especially without explanation, unless you've read the relevant part of the paper) surely just annoys scientists. There's no corporate peer pressure anymore; even Celera has given up trying to hoard genomic data that goes out of date by the time you've worked out how to charge for it.

Let open access work for you. The mutations in HGMD are often culled from literature and relating them to reference sequences is remarkably difficult. An internal database identifier and "Asn351 to Asp" is only great if people know which transcript is being talked about. Make the first condition of using HGMD data that any derived analyses be made publically available too. Presumably the first thing that some people will do is take a Perl script and dbSNP and start mapping. Start including annotation derived from HGMD by places like SNPs3D.

There's a note of hope in the next paragraph of the paper in NAR.
Once the closer cooperation with publically funded bioinformatics institutions currently envisaged has been put in place, unrestricted access to the database will become possible.
Publically funded bioinformatics institutions? In the UK?

Back to OMIM and dbSNP it is.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009 February 2009 March 2009 June 2009