The Bergstrom lab's Eigenfactors site has just come out of preview mode and you can now query the whole dataset.
Eigenfactors are a novel alternative to journal impact factors. They're calculated using random walks through a network of citations:
The Eigenfactor algorithm corresponds to a simple model of research in which readers follow chains of citations as they move from journal to journal. Imagine that a researcher goes to the library and selects a journal article at random. After reading the article, the researcher selects at random one of the citations from the article. She then proceeds to the journal that was cited, reads a random article there, and selects a citation to direct her to her next journal volume. The researcher does this ad infinitum.
The amount of time that the researcher spends with each journal gives us a measure of that journal’s importance within network of academic citations. One advantage of this approach is that it can handle the fact that different disciplines have different citation patterns:
The average article in a leading cell biology journal might receive 10-30 citations within two years; the average article in leading mathematics journal would do very well to receive 2 citations over the same period. The list of the top 10 journals by eigenfactor looks pretty much as you'd expect - Nature and Science are sitting pretty at the top, natch.
One issue: in an attempt to include more material from the social sciences the eigenfactor dataset includes articles from newspapers and popular magazines. As newspaper articles don't typically have reference lists attached I'm not sure how they are incorporated into the network, but in any case don't they skew eigenfactors towards those journals that have the best press releases? Could I start a Journal of Sensational Medicine and start publishing pseudo-scientific quackery, be spurned by academia but have a high eigenfactor simply because I feed the London Lite headlines?
(via Three Toed Sloth)Labels: eigenfactors, impact, journals
So Andrew has been talking about Dapper and Pedro about kapow, both screen scraping (sort of) services that let you extract data from websites. It's interesting and potentially useful stuff.
Here's my Dapped contribution: an RSS feed of advance access articles from Bioinformatics and NAR. I tried putting Nature advance publications in there but shamefully the nature.com markup isn't up to scratch or something and Dapper refuses to pull out the correct pieces of data consistently. Neh.
There's a little bit of custom PHP code involved to create links for each paper from the doi, to extract publication dates and to merge the two dap RSS feeds - this seems like the kind of thing Pipes was designed for but unfortunately it's not quite flexible enough yet.
I made the original bioinformatics and NAR daps public if you want to tweak them.Labels: api, dapper, mashups, pipes, web
When writing Friday's post about the Nature Methods 'software availability' editorial I spent some time trawling through Nodalpoint's archives looking for comments about defunct software distributions to serve as anecdotal evidence. Broken links to resources seem like a problem that many people have encountered.
I figured that I'd do some empirical research and check out all of the Application Notes published in the March issues of Bioinformatics from the past four years.
Some "this study isn't very scientific" disclaimers: It's not a huge dataset. I'm lumping databases, software and web services together to talk about 'resources' in general. There's only one resource per paper, and it's whatever is referred to in the abstract 'availability' section. I started off going through every paper in each issue to see if they mentioned resources but it rapidly because tiresome and so for 2005, 2004 and 2003 I just looked at the Application Notes.
So on to the results - the raw data is at the end of the post, but briefly:
- 12% of resources from the March 2006 issues are no longer available.
- 17% of the resources from 2005 and 2004 are no longer available.
- 11% of the resources from 2003 are no longer available.
- Only one of the resources I looked at was hosted on SourceForge. It's still available.
- Many, many resources were hosted in home directories (i.e. whatever.edu/~username/ ).
- A couple of resources that were available 'upon request' made clear that they were free for non-profit use only - is holding the software back a way of screening potential customers?
Two other things I noticed:
- OUP Bioinformatics used to have lots of original research and now it's all applications and databases (not necessarily a bad thing, I'm just saying. Neil has mentioned this before, too)
- People writing bioinformatics web services love frames. Stop using frames, please.
Perhaps a compromise between making software open source and keeping it locked up until you / your technology transfer officer can become fantastically rich by selling it to big pharma is to upload a tarball of the software executable (that runs on a reference platform: Windows, OS X, Linux?) and some documentation to, say, WebCite? No mailing lists, CVS access or anything fancy are necessary, after all: just a permanent snapshot of the software that you used to write your paper.
Anyway, the raw data:
March 2006
27 resources 3 available on request (11%) 3 unavailable (of all resources: 11% / of freely available resources: 12%) 1 in SourceForge
March 2005
33 resources 4 available on request (8.25%) 5 unavailable (15% / 17%) 1 unavailable site redirects to an ad filled domain parking page, how rude.
March 2004
29 resources all freely available (i.e. not 'on request') 5 unavailable (17% / 17%)
March 2003
22 resources 5 available on request (22%) 2 unavailable (9% / 11%)Labels: availability, nature, software
Nature Methods has a new editorial clarifying its position on making the software used in papers available to readers (about time a journal did this):
The minimum level of disclosure that Nature Methods requires depends on how central the software is to the paper. If a software program is the focus of the report, we expect the programming code to be made available. Without the code, the software—and thus the paper—would become a black box of little use to the scientific community. In many papers, however, the software is only an ancillary part of the method, and the focus is on the methodological approach or an insight gained from it.
In these cases, releasing the code may not be a requirement for publication, but such custom-developed software will often be as important for the replication of the procedure as plasmids or mutant cell lines. We therefore insist that software or algorithms be made available to readers in a usable form. The guiding principle is that enough information must be provided so that users can reproduce the procedure and use the method in their own research at reasonable cost—both monetary and in terms of labor. I think it's quite a well thought out piece. The editors recognize, for instance, that some short programs and algorithms are better made available as pseudocode (well, they say 'a small set of equations', but I know which one I'd prefer).
I'm not sure it goes far enough, though. For example: if the software runs as a web service, is making that service public enough to satisfy the journal's requirements? Can you host any code releases on your own server?
The problem with answering either of those questions with a 'yes' is that there's no guarantee that the software is still going to be available after a year or two (something most bioinformaticians are acutely aware of): postdocs and grad students move on, server accounts (and labs) get closed, bugs crop up and there's nobody willing to fix them, websites get redeveloped... etc.
What happens when we read an older paper, the software isn't around any more and we report it to an editor?
When we ask authors to make sequences available we require them to be deposited in GenBank. Should we require software authors to deposit their code on Sourceforge, Google Code or some other (more) permanent repository (in which case, what about the executable only software or software that has a restrictive licence)?
There are open comment threads at both Methagora - the Nature Methods blog - and Nautilus, which covers the whole spread of Nature journals. I urge you to go forth and help shape journal policy (perhaps).Labels: nature, opensource, software
Last year I posted about disease gene prediction - using computational methods to prioritize candidate genes for further (human) study. It's a relatively busy field: there are half a dozen systems out there that can all help narrow down large lists of genes with varying degrees of success.
This week in Bioinformatics Advance Access there's a paper by Kyle Gaulton (watch out, PDF) from the Mohlke lab at UNC describing their new system, called CAESAR (nice name, which is a good start).
CAESAR is remarkably cool. Here's how it works:
- You give it a text corpus to work from - some review articles about the disease that you're interested in or an OMIM entry, for example
- It extracts all of the gene symbols from that corpus
- Again using the corpus it finds relevant terms from the Gene Ontology, eVOC and MGD's ontology of mammalian phenotypes
- It expands the set of genes from (2) by looking for interaction partners in BIND and Kegg and similar proteins in iPro and using the ontology terms from (3) to find relevant mouse knockouts, genes that have known associations with similar phenotypes and genes that are expressed in the same tissues.
- It combines the resulting large sets of genes and ranks them mathemagically to produce the final ranked list.
Anyway, I was impressed. I really like the basic idea:
[it] relies on human expert knowledge in order to function effectively, but it does not require that the user actually possess all of this knowledge. CAESAR is not without issues. In particular there's a bias towards genes that are more heavily annotated - the manuscript points out that the mean number of GO terms for genes ranked in the top 98th percentile of their test sets was significantly higher than the number of terms for all genes.
Despite some cheeky use of misleading language in the results section ("we addressed this potential bias" means "we proved that the bias wasn't potential at all but real, then moved swiftly on" rather than "we addressed the problem and fixed it") there's not really any discussion of how future systems could avoid the same issue.
The worst side-effect of relying on annotation is that only 15,000 human genes (~ 50%?) have enough quality annotation from different sources to do anything with at all. This percentage will increase over time, but until then there must be other sources of data that we can use (Lude Frank left a comment about this on last year's post).
There's also a potential issue with the way that CAESAR was tested using a set of genes already known to be involved in a complex trait: while Gaulton et al. cleaned up the corpus for each test gene by removing any direct references to it and restricting the papers included to those published before the year of association might not bias remain in places like BIND, Kegg and iPro, as a result of subsequent gene driven research into the trait's etiology?
You'd expect, for example, that once a new gene was implicated in a disease somebody somewhere would immediately check to see if it interacts with the other candidate genes for that disease (mentioned in the literature corpus used during testing) - placing the results into BIND. OK, it's a bit of a weak correlation, but still...
Anyway, all that aside it's a nice piece of software (and freely available!). I'd be interested to hear if CAESAR is going to be developed any further.Labels: caesar, disease, software

Cheeky monkeys. Can't help but feel that the money would've been better spent elsewhere, though. Yes, I know that it's an ape not a monkey. Labels: nature, publishing, science
Publish or Perish is a Windows app that generates your h-index (amongst other metrics) for you, based on citation data from Google Scholar. No NSPNAS, yet, unfortunately.
Personally I think that the simple number that makes up an h-index is a little dry. Besides, people who've never heard of it before don't have a frame of reference. What's the scale? Does it go from low to high or the other way round?
More to the point, "I have an h-index of 30" won't impress the opposite sex. No, for that you need D&D references (hotties dig D&D). How about "I'm a level 30 biobarbarian?". Now we're talking. Behold my +1 Pipettes of Power. When you collect enough citations you level up. I'm a level 1 gnome, myself.
Go on, it'll look much better on your grant application.
Labels: citations, software
See all posts from:
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
October 2008
December 2008
January 2009
February 2009
March 2009
June 2009
|
|