Flags and Lollipops

Thursday, March 22, 2007

Eigenfactors

The Bergstrom lab's Eigenfactors site has just come out of preview mode and you can now query the whole dataset.

Eigenfactors are a novel alternative to journal impact factors. They're calculated using random walks through a network of citations:
The Eigenfactor algorithm corresponds to a simple model of research in which readers follow chains of citations as they move from journal to journal. Imagine that a researcher goes to the library and selects a journal article at random. After reading the article, the researcher selects at random one of the citations from the article. She then proceeds to the journal that was cited, reads a random article there, and selects a citation to direct her to her next journal volume. The researcher does this ad infinitum.

The amount of time that the researcher spends with each journal gives us a measure of that journal’s importance within network of academic citations.
One advantage of this approach is that it can handle the fact that different disciplines have different citation patterns:
The average article in a leading cell biology journal might receive 10-30 citations within two years; the average article in leading mathematics journal would do very well to receive 2 citations over the same period.
The list of the top 10 journals by eigenfactor looks pretty much as you'd expect - Nature and Science are sitting pretty at the top, natch.

One issue: in an attempt to include more material from the social sciences the eigenfactor dataset includes articles from newspapers and popular magazines. As newspaper articles don't typically have reference lists attached I'm not sure how they are incorporated into the network, but in any case don't they skew eigenfactors towards those journals that have the best press releases? Could I start a Journal of Sensational Medicine and start publishing pseudo-scientific quackery, be spurned by academia but have a high eigenfactor simply because I feed the London Lite headlines?

(via Three Toed Sloth)

Labels: , ,

Comments and trackbacks Feel free to post your comments Blogger Sabah Kadri . This post has trackbacks.

Tuesday, March 20, 2007

Dapper

So Andrew has been talking about Dapper and Pedro about kapow, both screen scraping (sort of) services that let you extract data from websites. It's interesting and potentially useful stuff.

Here's my Dapped contribution: an RSS feed of advance access articles from Bioinformatics and NAR. I tried putting Nature advance publications in there but shamefully the nature.com markup isn't up to scratch or something and Dapper refuses to pull out the correct pieces of data consistently. Neh.

There's a little bit of custom PHP code involved to create links for each paper from the doi, to extract publication dates and to merge the two dap RSS feeds - this seems like the kind of thing Pipes was designed for but unfortunately it's not quite flexible enough yet.

I made the original bioinformatics and NAR daps public if you want to tweak them.

Labels: , , , ,

Comments and trackbacks Feel free to post your comments Anonymous Latecia . This post has trackbacks.

Monday, March 12, 2007

Software availabilty: a quick survey of OUP Bioinformatics

When writing Friday's post about the Nature Methods 'software availability' editorial I spent some time trawling through Nodalpoint's archives looking for comments about defunct software distributions to serve as anecdotal evidence. Broken links to resources seem like a problem that many people have encountered.

I figured that I'd do some empirical research and check out all of the Application Notes published in the March issues of Bioinformatics from the past four years.

Some "this study isn't very scientific" disclaimers: It's not a huge dataset. I'm lumping databases, software and web services together to talk about 'resources' in general. There's only one resource per paper, and it's whatever is referred to in the abstract 'availability' section. I started off going through every paper in each issue to see if they mentioned resources but it rapidly because tiresome and so for 2005, 2004 and 2003 I just looked at the Application Notes.

So on to the results - the raw data is at the end of the post, but briefly:

  • 12% of resources from the March 2006 issues are no longer available.
  • 17% of the resources from 2005 and 2004 are no longer available.
  • 11% of the resources from 2003 are no longer available.
  • Only one of the resources I looked at was hosted on SourceForge. It's still available.
  • Many, many resources were hosted in home directories (i.e. whatever.edu/~username/ ).
  • A couple of resources that were available 'upon request' made clear that they were free for non-profit use only - is holding the software back a way of screening potential customers?


Two other things I noticed:

  • OUP Bioinformatics used to have lots of original research and now it's all applications and databases (not necessarily a bad thing, I'm just saying. Neil has mentioned this before, too)
  • People writing bioinformatics web services love frames. Stop using frames, please.


Perhaps a compromise between making software open source and keeping it locked up until you / your technology transfer officer can become fantastically rich by selling it to big pharma is to upload a tarball of the software executable (that runs on a reference platform: Windows, OS X, Linux?) and some documentation to, say, WebCite? No mailing lists, CVS access or anything fancy are necessary, after all: just a permanent snapshot of the software that you used to write your paper.

Anyway, the raw data:

March 2006

27 resources
3 available on request (11%)
3 unavailable (of all resources: 11% / of freely available resources: 12%)
1 in SourceForge

March 2005

33 resources
4 available on request (8.25%)
5 unavailable (15% / 17%)
1 unavailable site redirects to an ad filled domain parking page, how rude.

March 2004

29 resources
all freely available (i.e. not 'on request')
5 unavailable (17% / 17%)

March 2003

22 resources
5 available on request (22%)
2 unavailable (9% / 11%)

Labels: , ,

Comments and trackbacks Feel free to post your comments Blogger Pierre Anonymous Neil Anonymous Deepak Blogger Sandy Anonymous Mike Barton Anonymous SNP Blogger Pedro Beltrão Blogger Bishu . This post has trackbacks.

Friday, March 09, 2007

Nature Methods on software availability

Nature Methods has a new editorial clarifying its position on making the software used in papers available to readers (about time a journal did this):
The minimum level of disclosure that Nature Methods requires depends on how central the software is to the paper. If a software program is the focus of the report, we expect the programming code to be made available. Without the code, the software—and thus the paper—would become a black box of little use to the scientific community. In many papers, however, the software is only an ancillary part of the method, and the focus is on the methodological approach or an insight gained from it.

In these cases, releasing the code may not be a requirement for publication, but such custom-developed software will often be as important for the replication of the procedure as plasmids or mutant cell lines. We therefore insist that software or algorithms be made available to readers in a usable form. The guiding principle is that enough information must be provided so that users can reproduce the procedure and use the method in their own research at reasonable cost—both monetary and in terms of labor.
I think it's quite a well thought out piece. The editors recognize, for instance, that some short programs and algorithms are better made available as pseudocode (well, they say 'a small set of equations', but I know which one I'd prefer).

I'm not sure it goes far enough, though. For example: if the software runs as a web service, is making that service public enough to satisfy the journal's requirements? Can you host any code releases on your own server?

The problem with answering either of those questions with a 'yes' is that there's no guarantee that the software is still going to be available after a year or two (something most bioinformaticians are acutely aware of): postdocs and grad students move on, server accounts (and labs) get closed, bugs crop up and there's nobody willing to fix them, websites get redeveloped... etc.

What happens when we read an older paper, the software isn't around any more and we report it to an editor?

When we ask authors to make sequences available we require them to be deposited in GenBank. Should we require software authors to deposit their code on Sourceforge, Google Code or some other (more) permanent repository (in which case, what about the executable only software or software that has a restrictive licence)?

There are open comment threads at both Methagora - the Nature Methods blog - and Nautilus, which covers the whole spread of Nature journals. I urge you to go forth and help shape journal policy (perhaps).

Labels: , ,

Comments and trackbacks Feel free to post your comments Anonymous Deepak . This post has trackbacks.

Friday, March 02, 2007

Prioritizing candidate genes with CAESAR

Last year I posted about disease gene prediction - using computational methods to prioritize candidate genes for further (human) study. It's a relatively busy field: there are half a dozen systems out there that can all help narrow down large lists of genes with varying degrees of success.

This week in Bioinformatics Advance Access there's a paper by Kyle Gaulton (watch out, PDF) from the Mohlke lab at UNC describing their new system, called CAESAR (nice name, which is a good start).

CAESAR is remarkably cool. Here's how it works:
  1. You give it a text corpus to work from - some review articles about the disease that you're interested in or an OMIM entry, for example
  2. It extracts all of the gene symbols from that corpus
  3. Again using the corpus it finds relevant terms from the Gene Ontology, eVOC and MGD's ontology of mammalian phenotypes
  4. It expands the set of genes from (2) by looking for interaction partners in BIND and Kegg and similar proteins in iPro and using the ontology terms from (3) to find relevant mouse knockouts, genes that have known associations with similar phenotypes and genes that are expressed in the same tissues.
  5. It combines the resulting large sets of genes and ranks them mathemagically to produce the final ranked list.
Anyway, I was impressed. I really like the basic idea:
[it] relies on human expert knowledge in order to function effectively, but it does not require that the user actually possess all of this knowledge.
CAESAR is not without issues. In particular there's a bias towards genes that are more heavily annotated - the manuscript points out that the mean number of GO terms for genes ranked in the top 98th percentile of their test sets was significantly higher than the number of terms for all genes.

Despite some cheeky use of misleading language in the results section ("we addressed this potential bias" means "we proved that the bias wasn't potential at all but real, then moved swiftly on" rather than "we addressed the problem and fixed it") there's not really any discussion of how future systems could avoid the same issue.

The worst side-effect of relying on annotation is that only 15,000 human genes (~ 50%?) have enough quality annotation from different sources to do anything with at all. This percentage will increase over time, but until then there must be other sources of data that we can use (Lude Frank left a comment about this on last year's post).

There's also a potential issue with the way that CAESAR was tested using a set of genes already known to be involved in a complex trait: while Gaulton et al. cleaned up the corpus for each test gene by removing any direct references to it and restricting the papers included to those published before the year of association might not bias remain in places like BIND, Kegg and iPro, as a result of subsequent gene driven research into the trait's etiology?

You'd expect, for example, that once a new gene was implicated in a disease somebody somewhere would immediately check to see if it interacts with the other candidate genes for that disease (mentioned in the literature corpus used during testing) - placing the results into BIND. OK, it's a bit of a weak correlation, but still...

Anyway, all that aside it's a nice piece of software (and freely available!). I'd be interested to hear if CAESAR is going to be developed any further.

Labels: , ,

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Thursday, March 01, 2007

Leave Nature HQ, walk left, see...


Cheeky monkeys. Can't help but feel that the money would've been better spent elsewhere, though. Yes, I know that it's an ape not a monkey.

Labels: , ,

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Publish or Perish

Publish or Perish is a Windows app that generates your h-index (amongst other metrics) for you, based on citation data from Google Scholar. No NSPNAS, yet, unfortunately.

Personally I think that the simple number that makes up an h-index is a little dry. Besides, people who've never heard of it before don't have a frame of reference. What's the scale? Does it go from low to high or the other way round?

More to the point, "I have an h-index of 30" won't impress the opposite sex. No, for that you need D&D references (hotties dig D&D). How about "I'm a level 30 biobarbarian?". Now we're talking. Behold my +1 Pipettes of Power. When you collect enough citations you level up. I'm a level 1 gnome, myself.

Go on, it'll look much better on your grant application.


Labels: ,

Comments and trackbacks Feel free to post your comments Blogger Pedro Beltrão Blogger Stew Anonymous Anne-Wil Harzing Anonymous Anne-Wil Harzing Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009 February 2009 March 2009 June 2009