Saturday, August 20, 2005
Thursday, August 18, 2005
Reusable Code
In Computer Science classes we were always taught that this is not a good thing (that was in the mid-nineties - nowadays you'd hope that other methodologies get a look in too).
Personally, I think throwaway code is great - I dislike documenting code in detail, I selectively ignore object orientated design principles wherever possible and UML gives me the heeby jeebies. Just because you don't put in the extra effort to make sure that an outsider can come in and reuse all of your code doesn't mean that your program doesn't work just as well, after all - in fact, you have more time to find and fix bugs.
So we've got a Masters student (with a life sciences background) working on a project with us over the summer. The MSc in bioinformatics here is taught by the Computer Science department and his head has been filled with crazy J2EE-talk. He's doing everything by the book; drawing class diagrams, writing test harnesses - he even set up his own little local CVS repository. I should point out that I think this is a good thing. I mean, he's being graded by CS professors, or at least the coding part of his project is. If I were him I'd stick to what had been recommended to me in the lectures, too. It doesn't really bear any relation to the kind of code he'll be writing once he's actually out in the field, though. I just hope he doesn't end up believing that the only good code is reusable code.
Don't get me wrong - there's certainly a time and place for sticking to your object orientated, reusable code module guns. Bioperl is a good example of that. If you've spent some time implementing a tricksy algorithm it's worth setting things up so that you can share it with other people. If you're going to be distributing a script or program make sure that it's high quality, readable code.
To be brutally honest, though, does "everyday" code really ever get reused? What's the ratio of extra effort expended to work saved later on?
For much of bioinformatics I think that disposable software is perfectly acceptable. Partly the reason for this is that the focus isn't usually on producing fully-fledged application software for use by others but on processing data (or providing tools to process data) in the short term. What's important isn't the knowledge encapsulated in the code but in the knowledge created and then published as a paper. It's horses for courses, I guess; we should be able to accept that no single software methodology or mindset necessarily suits all situations.
Or maybe I'm just lazy.
(footnote: for more on this, check out the comments and posts at Propeller Twist and Inforbiomatica)
Avian Flu
Yesterday, while checking out Connotea - Nature's new "Del.icio.us for references" software - I noticed that the top three tags there are "avian flu" "H5N1" (one of the strains of flu people are currently worried about - the H and N numbers refer to the types of protein on the surface of the flu virus) and "pandemic". Great. Assuming Connotea is used mostly by those in the know - well, people working in science, anyway - presumably this means that we should all be worried.
How difficult is it to apply modern genetic science to a particular strain of flu and work out how to disable it? Have matters improved since, say, SARS?
It seems that in this case, the problem isn't really with sequencing the virus strain and developing a vaccine but with more prosaic issues like manufacturing the vast number of vaccines required while still being able to control the seasonal variations of flu that cause an estimated quarter of a million deaths worldwide each year.
A good general introduction to the biology of avian flu can be found at 2can, the EBI's bioinformatics educational resource. Snowdeal has some good links and clippings on avian flu and how bioinformatics and geographic information systems (GIS) are being used in conjunction with one another to help epidemiologists track outbreaks and strains, too.
Tuesday, August 16, 2005
Digital E.Coli
One of the reasons for that is, well, I find it boring (I'm sure that lots of systems biologists find what I do boring, too, so hey). At least, I did find it boring: then I saw AgentCell, modestly subtitled "digital e. coli". Emonet et al. haven't created life in a virtual test tube quite yet but what they do have is a neato simulation of bacterial chemotaxis.
A quick aside for the bioinformaticians leaning compsci-wards (thanks to Wikipedia):
Chemotaxis is the phenomenon in which bodily cells, bacteria, and other single-celled or multicellular organisms direct their movements according to certain chemicals in their environment. This is important for bacteria to find food (for example, glucose) by swimming towards the highest concentration of food molecules, or to flee from poisons (for example, phenol).The assays done in-silico on populations of these agents reproduce experimental data done on real life bacterium.
I think that what makes AgentCell more impressive than similar simulations to an outsider like myself is, sadly, the nice AVIs. That and the (relatively) easily understandable pathways involved. Let this be a lesson in presentation to all systems biologists.
AgentCell's source code was released yesterday, if you're interested in probing a little deeper. The relevant paper is in OUP Bioinformatics (subscription required), here.
Addendum: I've just read the press release from the University of Chicago. Let AgentCell also be a lesson in fantabulous hyperbole to all systems biologists:
The simulation, called AgentCell, has possible applications in cancer research, drug development and combating bioterrorism.Combating bioterrorism?! Whatever you've got to do to get the funding, I guess....
Monday, August 15, 2005
Labelling validity of results
Anyway, Spitshine over at A Bioinformatics Blog put up a post that made me chuckle this morning; Quantifying the margin of error in high-throughput data interpretation. The idea is to help readers work out the validity of claims made in a paper by labelling them with short acronyms:
[I5] Inferred from Perl script. 5 lines of Perl can't be wrong.Not that those are the three I'd end up using all of the time or anything. Go check it out...
[REV] Stinking reviewers didn't like our numbers. Have to put them into the acknowledgments.
[IA>] As high as we could get it to meet your expectations
...
Thursday, August 11, 2005
Getting a handle on references
Software like EndNote and Reference Manager claim to make the process easier but I find that, inevitably, the output styles that ship with citation software include templates for journals like Contemporary Accounting Research and Advances in Polymer Technology but not Bioinformatics or any of the BMC titles (to be fair, I just checked and EndNote does actually carry these styles now. Not so six months ago, though).
Anyway, the two main ways of referencing works in text (at least in biomedical journals) look like so:
- Harvard style - "thus we can conclude that protein interaction networks are scale-free (Stew, 2003)"
- Vancouver style - "as evidenced in our previous analyses of overrepresented GO terms in microarray experiments (2)"
Personally, I find the Harvard style (citing a reference by including the surname of the primary author and the year that the paper came out) infuriating. It's the style of choice for OUP's Bioinformatics. It's not so bad when everything is hyperlinked on screen, but if you print out a long paper with lots of references you then have to waste time trying to find the actual citation at the back of your huge sheaf of paper. It's also quite intrusive, especially when more than one source is cited mid-sentence. The flow of the text gets broken by this big bracketed list of frankly meaningless names and numbers.
It's high time we all switched to the Vancouver style, I say. Also, there are two further developments that could make life easier.
The first: little explanations below selected citations in the references section. Sometimes you see this in Nature Reviews, not sure if it has cropped up anywhere else:
i.e.
31. Kanduri, C. et al. Functional association of CTCF with the insulator upstream of the H19 gene is parent of origin-specific and methylation-sensitive. Curr. Biol. 10, 853−856 (2000).A little bit of commentary goes a long way, making it much easier to cull references from papers. Think back to the last time that you were reading about a new aspect or technology that you weren't familiar with and remember how difficult is was to pick out the seminal references from which to learn more...
This paper shows that CTCF binding is methylation sensitive and provides important mechanistic insights into how cells distinguish between maternal and paternal alleles.
(..)
36. Dean, W. et al. (..) 13734−13738 (2001)
37. De Baun, M. R et al. (..) 156−160 (2003)
References 34−37 show that in vitro culturing of embryos can lead to epigenetic defects in animals and that this might also have a role in humans.
Secondly, I don't see why references can't be grouped by the section of the manuscript that first mentions them; in theory Vancouver style references are already in the order of their appearance. Simply throw in "Background", "Methods" and "Discussion" headers into the references section and immediately we can see, for example, which papers are most closely related to the one being read (cited in the background section) and which are more distantly related (cited in the discussion).
Wednesday, August 10, 2005
Links
- A field guide to biomedical meeting creatures, part 1: Any questions? at Orac Knows - astute (and funny) guide to the different types of people who ask you questions after presentations.
- A field guide to biomedical meeting creatures, part 2: Poster Time! at Orac Knows - more of the same but as it applies to people asking you questions about your poster (and people presenting posters)
- On the Money at In the Pipeline - On the differences between working in academia and industry. Slightly skewed towards industry if you ask me, but then I've never worked in big pharma. Does have some good points though (and quotes that 'graduate school is the last bastion of feudalism' which I thought was good).
- Industry and Academia, pt 1 at In the Pipeline - more thoughts on the matter.
- Industry and Academia, pt 2 at In the Pipeline
- Industry and Academa: The Mental Aspect at In the Pipeline- Super summarized: there's a finish-the-project-or-die attitude in academia.
- Programmers Need to Learn Statistics Or I Will Kill Them All via Zed's Blog - applies equally to bioinformaticians.
- Ten Lessons I Wish I Had Been Taught - Not sure about the "publish the same result several times" advice. That's another one of my pet peeves.
Evolutionary conserved elements
The thing is that the human genome is big. Really big. 3,400 million base pairs big. Only 1.5% of that is made up of genes that code for proteins and the amount of material in the difference between 5% and 1.5% is huge, even though it might not seem like it at first. Think about it this way: in the same amount of space you could fit in almost the entire genome of, say, the fruit fly.
So what's the other functionally important 3.5% of the genome? Nobody really knows. The snappy name for it is "dark matter" - no reason why physics should get to keep all those cool names for mysterious expanses of the unknown - and while there are a lot of good ideas out there, mostly all that non-junk DNA just serves to remind us how much we still don't know about our genetic material.
One interesting thing about the Siepel et al study is that they've tried to break down where these evolutionally conserved regions occur. As you might expect, many are in the exons of protein coding genes, signposting genes that have similar ancestry and function in most of the species examined. As the percentages from the previous paragraphs indicate, however, there are at least twice as many conserved regions outside of those protein coding genes.
Repeats, which make up a large proportion of the genome, tended to contain a substantially reduced number of conserved regions although interestingly some ancestral repeats (repeats inserted before different species "split" off from their common ancestor) appear to have gained critical functions and are highly conserved; it has been suggested that maybe some of these are the functions that help to differentiate mammals from ancestral vertebrates.
Other conserved regions seem to confirm the regulatory roles played by certain sequence features: tellingly, the untranslated regions (UTRs) sometimes found on the ends of gene sequences which are thought to play a part in where, how often and for how long the gene is expressed were enriched significantly in highly conserved bases.
The possibilities for using the sort of data Siepel et al have generated are fantastic - I'd love to see more research along these lines. With the ENCODE project maturing we should be able to shed more light on what dark matter really is and there's even the possibility that we'll be able to create a set of controls significantly large enough to train and test machine learning algorithms that finally get regulatory region prediction success rates up to reliable levels. That's maybe a whole other posting for another day.
I should point out that this certainly isn't the first paper to look at genomic alignments. With each one written, though, the field seems to get a little bit more sophisticated and the methodology a little more polished.
More importantly, all of the data they produced - including base by base conservation scores - is already available as a track on the UCSC genome browser which earns them a big gold star in my book.
Friday, August 05, 2005
The Edge of Computation
Edge is a web and print publication produced by the Edge Foundation, whose "informal membership includes of some of the most interesting minds in the world". Their Febuary issue has the transcript of a pretty good panel discussion between Craig Venter, Ray Kurzweil and Rodney Brooks on the subject of biocomputation.
Everybody probably knows about Craig Venter already. Rodney Brooks is the director of MIT's Computer Science and Artificial Intelligence Laboratory and is probably best known for his insect robots based on the idea that complex, lifelike behaviour can arise from lots of simple programmed behaviours working together. Ray Kurzweil is another comp sci celebrity; probably more famous as an entrepreneur than an academic now, though.
It's a long read but also one that covers a lot of ground. The three participants have some interesting opinions about how biocomputing is changing the way we look at the natural world; Kurzweil in particular seems adamant that we'll be able to model everything from cells up to the human brain in relatively short order while Rodney Brooks is more careful: he believes that we're still missing the "essence of life" - an understanding of biology from first principles - without which our efforts at replicating the complexity of life will always be unsuccessful. Craig Venter talks a little bit about high-throughput genomics - as it relates to his genome mining trips on the high seas and to cheap DNA sequencing - amongst other things.
There's some Quicktime video of the talk available too.
Dang, I'm moving
Salaries in the field vary widely depending on education level and experience. Hughey says that annual salaries can range from $60,000 at the entry level to more than $100,000 for Ph.D.s. A recent salary survey [..] reported that the average salary in 2003 for life scientists whose primary area of specialization is bioinformatics was $75,845.(Text taken from this article)
Entry level salaries of $60,000 (~ £35,000) for bioinformaticians? I'm definitely in the wrong country. Entry level here in the UK would be about two thirds of that and you'd be lucky to push $100,000 a year with a PhD as a team leader.
Some tips I wish I'd been told
Anyway, some things I wish that I'd been told when I first started:
- You don't need a qualification in the life sciences to work in bioinformatics : it takes the same amount of time for a biologist to learn the relevant computer science skills as it does a computer scientist to learn the relevant biology.
- Don't expect perfect solutions : I reckon that the sweet spot for accuracy in bioinformatics is 60-70%. Protein structure predictions work for 60-70% of genes. 60-70% of regulatory regions can be detected with the more recent methods. 60-70% of gene names can be successfully culled from large sets of Pubmed abstracts. Biology is complex. Current knowledge is far from perfect. Don't get into bioinformatics if you like clean, elegant solutions.
- Learn Perl : You can try and get away with just Java or C but I assure you that at some point you're going to have to embrace Perl. Twas the language of the original bioinformaticians and thus shall remain ever so.
- Stay well informed : As in computer science, you have to keep your skills current to survive. Unfortunately you also need to keep up to date with current scientific thinking on top of that. Sign up to the RSS feeds or table of contents alerts from the big bioinformatics journals (some are listed on the sidebar to the right).
- Offer your services : your job probably involves helping out bench biologists anyway, but be on the lookout for ways that informatics could help the work going on in your lab; sometimes you'd be surprised. What might take you a couple of minutes with a simple script could be taking an unfortunate RA days (a real life example: checking a list of a hundred or so SNPs to look for those in conserved regions, running them through SIFT, etc.).
Ouch
You may think that working in bioinformatics means that you're safe from lab health and safety concerns like pipette RSI, being blinded by exploding liquid nitrogen etc. and that's true - but don't underestimate the cost of not talking a walk around every hour and a half or doing some desk excercises like you're supposed to.
I find it difficult to tear myself away from coding projects once I've gotten started - I like to just put my head down and work away until it's time to go for lunch or to go home. Unfortunately I'm now paying the price for that attitude...
Top tip for the day: that you all get up from your workstations and take a brisk walk.
Wednesday, August 03, 2005
Distributed computing
Like most things in computer science the Grid is an old idea dressed up in spangly pants. It's not rocket science to have a client ask a server to do some computation and then return the result. You wouldn't know it to look at the descriptions of most Grid related projects, though - there's so much guff surrounding the central idea. Here in the UK, at least, there's a great deal of money available for Grid projects, so perhaps people feel that they need to justify their grants by inventing new acronyms and giving older technologies new, swisher names.
(I'm not saying that I think the Grid is a bad idea, because it's definitely not. It's just that it's sometimes being wrapped in needless complexity and presented as some sort of magical panacea for all our computational problems.)
Anyway, you can already do various NCBI searches with REST (i.e. sending variables on the URL line). The SOAP::Lite Perl library lets you access bioinformatics web services quickly and easily - not that there are all that many out there, but hey - all of which is Grid-lite, I guess.
Projects like BioMOBY take things one step further, introducing centralized lists of services, formally defining inputs and outputs and so on. I don't know if anybody reading this uses it frequently, or knows of any projects where it's been used successfully; as impressed as I am with the work that has gone into it BioMOBY never grabbed me simply because it took as long to sort out all of the kinks and overhead involved as it did to obtain, parse and analyse the data I wanted to look at in the old fashioned BioPerl way. Maybe that'll change in the future, as APIs get slimmed down, network negotiations get faster and more services become available.
In any case, I'm still unsure of why the Grid is a hot topic. For large scale projects and pipelines, sure, the Grid could come in useful (check out SETI@Home or that United Devices protein folding app) but are the changes it might make to the majority of bioinformatics work really all that revolutionary? If not, why all the funding?
Tuesday, August 02, 2005
Naming conventions
At least it's memorable. I vote that we ditch numerical identifiers for unique, whimsical slogans wherever possible in human genetics. There's no real reason to restrict database identifiers to twelve bytes or whatever any more. Space is cheap and datapipes wide. Why not expand the dbSNP or PubMed identifiers fields to 128 characters and use unique, auto-generated sentences instead of numbers? On the rare occasions where I have to remember identifiers offhand I'd remember the "clerks love monkeys snp" or "sausage rolls make you fat paper" far more easily than RSxxxxxx.
Of course, scientific papers might not get through email spam filters anymore, but it's a small price to pay, surely?
Reproducible research
It's an interesting idea - essentially, mixing code in with your results to allow people to try things for themselves - see the nodalpoint writeup for more.
Monday, August 01, 2005
Restricted Access & the HGMD
The drawback is that there's no easy way to get at the data. Visiting the website, your only option is to search by gene; you'll then get a list of mutations that the gene contains. There's no form of advanced search and no way to bulk download the contents of the database (via a condoned channel; shadily, there's always wget configured with a time delay).
This is obviously something that the authors have considered. However, in their paper in NAR they mention that:
Since HGMD is partly dependent upon industrial funding and involves considerable editorial work over and above mere literature screening (e.g. to ensure the consistency of nucleotide sequence information, amino acid residue numbering and gene symbol usage), unsolved copyright problems have so far precluded HGMD from being downloadable in its entirety.It disturbs me slightly that this sort of thing is an issue. I think that it's because as opposed to lab based genetics, bioinformatics resources are usually free; programming languages (Java, Perl), libraries (Bio*, NCBI's API, Seqhound) and data (Ensembl, PubMed abstracts, GNF expression data...). Free is a tricky concept nowadays, of course, but I mean in the sense that they are usually free to obtain and to use in an academic environment.
Just to be clear, I'm not disparaging the work of the people involved in the HGMD, just the politics behind some of their policies. The fact remains that the HGMD is a good database. It has the potential to be even better, though.
Why not release copyright on this kind of data, or allow researchers to use the relevant information after signing a release to ensure that they stick to your terms and conditions? Restricting access in this way (especially without explanation, unless you've read the relevant part of the paper) surely just annoys scientists. There's no corporate peer pressure anymore; even Celera has given up trying to hoard genomic data that goes out of date by the time you've worked out how to charge for it.
Let open access work for you. The mutations in HGMD are often culled from literature and relating them to reference sequences is remarkably difficult. An internal database identifier and "Asn351 to Asp" is only great if people know which transcript is being talked about. Make the first condition of using HGMD data that any derived analyses be made publically available too. Presumably the first thing that some people will do is take a Perl script and dbSNP and start mapping. Start including annotation derived from HGMD by places like SNPs3D.
There's a note of hope in the next paragraph of the paper in NAR.
Once the closer cooperation with publically funded bioinformatics institutions currently envisaged has been put in place, unrestricted access to the database will become possible.Publically funded bioinformatics institutions? In the UK?
Back to OMIM and dbSNP it is.
See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009 February 2009


