When writing Friday's post about the Nature Methods 'software availability' editorial I spent some time trawling through Nodalpoint's archives looking for comments about defunct software distributions to serve as anecdotal evidence. Broken links to resources seem like a problem that many people have encountered.
I figured that I'd do some empirical research and check out all of the Application Notes published in the March issues of Bioinformatics from the past four years.
Some "this study isn't very scientific" disclaimers: It's not a huge dataset. I'm lumping databases, software and web services together to talk about 'resources' in general. There's only one resource per paper, and it's whatever is referred to in the abstract 'availability' section. I started off going through every paper in each issue to see if they mentioned resources but it rapidly because tiresome and so for 2005, 2004 and 2003 I just looked at the Application Notes.
So on to the results - the raw data is at the end of the post, but briefly:
- 12% of resources from the March 2006 issues are no longer available.
- 17% of the resources from 2005 and 2004 are no longer available.
- 11% of the resources from 2003 are no longer available.
- Only one of the resources I looked at was hosted on SourceForge. It's still available.
- Many, many resources were hosted in home directories (i.e. whatever.edu/~username/ ).
- A couple of resources that were available 'upon request' made clear that they were free for non-profit use only - is holding the software back a way of screening potential customers?
Two other things I noticed:
- OUP Bioinformatics used to have lots of original research and now it's all applications and databases (not necessarily a bad thing, I'm just saying. Neil has mentioned this before, too)
- People writing bioinformatics web services love frames. Stop using frames, please.
Perhaps a compromise between making software open source and keeping it locked up until you / your technology transfer officer can become fantastically rich by selling it to big pharma is to upload a tarball of the software executable (that runs on a reference platform: Windows, OS X, Linux?) and some documentation to, say, WebCite? No mailing lists, CVS access or anything fancy are necessary, after all: just a permanent snapshot of the software that you used to write your paper.
Anyway, the raw data:
March 2006
27 resources 3 available on request (11%) 3 unavailable (of all resources: 11% / of freely available resources: 12%) 1 in SourceForge
March 2005
33 resources 4 available on request (8.25%) 5 unavailable (15% / 17%) 1 unavailable site redirects to an ad filled domain parking page, how rude.
March 2004
29 resources all freely available (i.e. not 'on request') 5 unavailable (17% / 17%)
March 2003
22 resources 5 available on request (22%) 2 unavailable (9% / 11%)Labels: availability, nature, software
Nature Methods has a new editorial clarifying its position on making the software used in papers available to readers (about time a journal did this):
The minimum level of disclosure that Nature Methods requires depends on how central the software is to the paper. If a software program is the focus of the report, we expect the programming code to be made available. Without the code, the software—and thus the paper—would become a black box of little use to the scientific community. In many papers, however, the software is only an ancillary part of the method, and the focus is on the methodological approach or an insight gained from it.
In these cases, releasing the code may not be a requirement for publication, but such custom-developed software will often be as important for the replication of the procedure as plasmids or mutant cell lines. We therefore insist that software or algorithms be made available to readers in a usable form. The guiding principle is that enough information must be provided so that users can reproduce the procedure and use the method in their own research at reasonable cost—both monetary and in terms of labor. I think it's quite a well thought out piece. The editors recognize, for instance, that some short programs and algorithms are better made available as pseudocode (well, they say 'a small set of equations', but I know which one I'd prefer).
I'm not sure it goes far enough, though. For example: if the software runs as a web service, is making that service public enough to satisfy the journal's requirements? Can you host any code releases on your own server?
The problem with answering either of those questions with a 'yes' is that there's no guarantee that the software is still going to be available after a year or two (something most bioinformaticians are acutely aware of): postdocs and grad students move on, server accounts (and labs) get closed, bugs crop up and there's nobody willing to fix them, websites get redeveloped... etc.
What happens when we read an older paper, the software isn't around any more and we report it to an editor?
When we ask authors to make sequences available we require them to be deposited in GenBank. Should we require software authors to deposit their code on Sourceforge, Google Code or some other (more) permanent repository (in which case, what about the executable only software or software that has a restrictive licence)?
There are open comment threads at both Methagora - the Nature Methods blog - and Nautilus, which covers the whole spread of Nature journals. I urge you to go forth and help shape journal policy (perhaps).Labels: nature, opensource, software
Last year I posted about disease gene prediction - using computational methods to prioritize candidate genes for further (human) study. It's a relatively busy field: there are half a dozen systems out there that can all help narrow down large lists of genes with varying degrees of success.
This week in Bioinformatics Advance Access there's a paper by Kyle Gaulton (watch out, PDF) from the Mohlke lab at UNC describing their new system, called CAESAR (nice name, which is a good start).
CAESAR is remarkably cool. Here's how it works:
- You give it a text corpus to work from - some review articles about the disease that you're interested in or an OMIM entry, for example
- It extracts all of the gene symbols from that corpus
- Again using the corpus it finds relevant terms from the Gene Ontology, eVOC and MGD's ontology of mammalian phenotypes
- It expands the set of genes from (2) by looking for interaction partners in BIND and Kegg and similar proteins in iPro and using the ontology terms from (3) to find relevant mouse knockouts, genes that have known associations with similar phenotypes and genes that are expressed in the same tissues.
- It combines the resulting large sets of genes and ranks them mathemagically to produce the final ranked list.
Anyway, I was impressed. I really like the basic idea:
[it] relies on human expert knowledge in order to function effectively, but it does not require that the user actually possess all of this knowledge. CAESAR is not without issues. In particular there's a bias towards genes that are more heavily annotated - the manuscript points out that the mean number of GO terms for genes ranked in the top 98th percentile of their test sets was significantly higher than the number of terms for all genes.
Despite some cheeky use of misleading language in the results section ("we addressed this potential bias" means "we proved that the bias wasn't potential at all but real, then moved swiftly on" rather than "we addressed the problem and fixed it") there's not really any discussion of how future systems could avoid the same issue.
The worst side-effect of relying on annotation is that only 15,000 human genes (~ 50%?) have enough quality annotation from different sources to do anything with at all. This percentage will increase over time, but until then there must be other sources of data that we can use (Lude Frank left a comment about this on last year's post).
There's also a potential issue with the way that CAESAR was tested using a set of genes already known to be involved in a complex trait: while Gaulton et al. cleaned up the corpus for each test gene by removing any direct references to it and restricting the papers included to those published before the year of association might not bias remain in places like BIND, Kegg and iPro, as a result of subsequent gene driven research into the trait's etiology?
You'd expect, for example, that once a new gene was implicated in a disease somebody somewhere would immediately check to see if it interacts with the other candidate genes for that disease (mentioned in the literature corpus used during testing) - placing the results into BIND. OK, it's a bit of a weak correlation, but still...
Anyway, all that aside it's a nice piece of software (and freely available!). I'd be interested to hear if CAESAR is going to be developed any further.Labels: caesar, disease, software
Publish or Perish is a Windows app that generates your h-index (amongst other metrics) for you, based on citation data from Google Scholar. No NSPNAS, yet, unfortunately.
Personally I think that the simple number that makes up an h-index is a little dry. Besides, people who've never heard of it before don't have a frame of reference. What's the scale? Does it go from low to high or the other way round?
More to the point, "I have an h-index of 30" won't impress the opposite sex. No, for that you need D&D references (hotties dig D&D). How about "I'm a level 30 biobarbarian?". Now we're talking. Behold my +1 Pipettes of Power. When you collect enough citations you level up. I'm a level 1 gnome, myself.
Go on, it'll look much better on your grant application.
Labels: citations, software
See all posts from:
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
|
|