Flags and Lollipops

Wednesday, February 15, 2006

Postgenomic

I've been pretty quiet over the last week or so because... well, because I just got an Xbox 360 as an anniversary present (King Kong rocks). My talk of putting off any new video game related purchases for a while in favour of flash based biology games made Mrs Stew take pity on me, I think.

But it's also (and mainly) because I've been working on a new site at postgenomic.com.

Postgenomic aggregates the feeds from life science blogs in order to do useful and interesting things with them. It's kind of like Technorati crossed with a really big hot papers meeting.

Its main uses - hopefully - are to:

  • List the current top life science news stories and the hottest recent papers (or the papers most often cited by bloggers, anyway)
  • Store and index reviews of papers
  • Store and collate reports from conferences
  • Help bloggers to share their expertise and, flipside of the same coin, to find useful papers on a given topic

It achieves these by collecting blog posts via RSS (or Atom 0.3, a process not without some bugs - if you've got any experience in writing parsers for this kind of thing then please, let me know) and then collecting the URLs from them. It then does some simple pattern matching to pick out "interesting" pages which are on life science related domains (publishers, science news outlets, institutions... that sort of thing).

Interesting pages are retrieved and some simple heuristics are used to try and work out if they're papers, news stories or something else. This basically consists of looking for Pubmed IDs (PMIDs) and Digital Object Identifiers (DOIs) in the body of the page. If a DOI or PMID is found, PubMed is used to retrieve title, journal, abstract and author details.

By counting the number of posts that link to a particular URL or paper we can achieve the first point (and, incidentally, build up cool statistics like these. Well, cool in my opinion, anyway. Note how impact factors derived from citations in blog post match "real life" impact factors pretty well).

To find reviews we look for some very simple semantic markup (if you can call it that): a rev="review" attribute in the anchor tag containing a link to the URL of the paper. Alternatively you can use the hReview microformat, enclosing the review text in a div class="hreview" and marking the URL of the paper with a class="url" (edit: Alf pointed out that it doesn't need to be a div, you can enclose the review in span class="hreview" or p class="hreview" or whatever else you like) The index of reviews is looking pretty sparse at the moment, for obvious reasons (one limitation of Postgenomic is that as it collects content from feeds, changes to archived posts aren't picked up, so it's no good just going back over your old posts and inserting the rel attribute in the relevant places, unless your feed reflects this. I'm trying to think of ways round this). In the future it'd be an idea to look for optional, more detailed markup too, but this needs to be given more thought (what structured data does a review of a paper need to convey?)

Finding conference reports - and organizing them - is a different matter. I've no idea what the best way of going about this is: I'm hoping that others will have brilliant ideas and want to get involved. At the moment the site simply looks for keywords in post titles, which is far from ideal.

Hopefully, as the site develops and the database grows the fourth point can be accomplished by organizing the papers by topic (perhaps using MeSH terms, or keywords, or the Technorati tags from the posts containing links to them). If you're looking for papers on, say, Bayesian networks in molecular biology but don't know where to start then you could fire up your browser, click on the appropriate tag in the Postgenomic index and be presented with a list of relevant papers and the blog posts that talk about them.

The site is very much in beta: it has quite a few known limitations, which are listed in more detail on the "get involved" page there. Some of the more pressing issues include internationalization, problems with parsing RSS and Atom feeds in Perl, identifying the correct DOI or PMID from the HTML version of a paper, lack of a search function and the aforementioned questions about how to markup conference report posts.

It's also probably quite slow, as my web host (Yahoo!) sucks at anything script heavy.

Bearing all that in mind, please try it out - your feedback, ideas and contributions would be very much appreciated (if you fancy improving the web interface or analysis pipeline, or munging the data in some new and useful way, let me know: it's open source, you can have the code and the database and I'll incorporate good changes into the site).

Comments and trackbacks Feel free to post your comments Blogger Pedro Beltrão Blogger The Mad Scientist Anonymous alf Anonymous alf Anonymous fjossinet Blogger Pierre Anonymous Tobias Anonymous Neil Blogger e3 Anonymous Deepak Anonymous Enro Blogger Greg Tyrelle Anonymous Mauricio Blogger The Bioinformatics Blog Blogger Sandra Blogger Bill Hooker Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Trackbacks:

19 Comments:

At February 15, 2006 1:38 PM, Blogger Pedro Beltrão said...

Cool :) very nice work.

It would be very nice if the idea could be expanded to include the discussions on other things that are not papers (like memeorandum or tailrank) but it is more complicated since the program would have to guess that the posts are talking about the same thing to put them together.
Anyway, this will be great to generate and aggregate discussions on papers and meetings.

 
At February 15, 2006 3:41 PM, Blogger The Mad Scientist said...

Wow, this is a great site. Thanks for performing this service.

 
At February 15, 2006 4:03 PM, Anonymous alf said...

Nice site. I'd suggest using Mark Pilgrim's Universal Feed Parser (Python) to convert all the feeds to some normalised format, and to use the basic parts of hReview (wrap the review in <span="hreview"> and use <a class="url" on the link to the subject of the review). The output of the Structured Blogging plugin for reviewing journal articles should hopefully automatically include this markup soon.

 
At February 15, 2006 4:05 PM, Anonymous alf said...

That should have been: <div class="hreview">.

 
At February 15, 2006 8:24 PM, Anonymous fjossinet said...

Fantastic job Stew and very good idea. How should I export my tags in my rss channel to help u ?

 
At February 15, 2006 9:03 PM, Blogger Pierre said...

This is a really nice work! I really like your idea about the 'rel' attribute ! I will have a closer look at your site tomorrow ! A first suggestion: adding a link linking to citeulike or connotea ?

 
At February 15, 2006 9:54 PM, Anonymous Tobias said...

Great work, Stew - thanks a lot for your efforts! If more people are posting paper reviews/opinions to their blogs, postgenomic could turn into a kind of community-based "Faculty of 1000". Let's hope this will catch on...

 
At February 16, 2006 3:41 AM, Anonymous Neil said...

This is madness. I mean, you'll have to maintain this thing now. And you'll be inundated with requests for features and complaints when it breaks. What were you thinking?

Seriously, great work.

 
At February 16, 2006 3:57 AM, Blogger e3 said...

yes, very nice. very nice, indeed.

 
At February 16, 2006 5:37 AM, Anonymous Deepak said...

I am so glad that you have taken the lead on this. This site has the potential to become the memeorandum of the life science world. Good luck!!!!

 
At February 16, 2006 2:41 PM, Anonymous Enro said...

Great work!! But still, you have all these nice RSS icons but no way to actually syndicate the latest paper reviews, meeting reports, top links or, why not, the aggregated RSS feeds for a given tag or all blogs... It would be nice if you could (would have time to?) implement this! But again, kudos...

 
At February 17, 2006 2:06 AM, Blogger Greg Tyrelle said...

Let me add my voice to the chorus line: "Very cool, and great work". I've posted some more thoughts on the forums.

 
At February 18, 2006 2:51 AM, Anonymous Mauricio said...

Bravo Stew!! You've done a great job!

I agree with Enro, it would be fantastic if we could syndicate the latest paper reviews, meeting reports, top links, etc...

Keep the nice work! :)

 
At February 20, 2006 12:57 AM, Blogger The Bioinformatics Blog said...

This is a really great idea. Ive started putting rev='review' and technorati links on all my artivle reviews.
I cant wait till this comes to critical mass!

 
At February 25, 2006 8:36 AM, Blogger Sandra said...

Awesome!

 
At April 23, 2006 12:31 AM, Blogger Bill Hooker said...

Trackback.

 
At September 16, 2006 11:50 PM, Anonymous Anonymous said...

Well done!
[url=http://kdzdusbx.com/juvm/vrav.html]My homepage[/url] | [url=http://ffgjtpxw.com/wrwr/daqp.html]Cool site[/url]

 
At September 16, 2006 11:50 PM, Anonymous Anonymous said...

Thank you!
My homepage | Please visit

 
At September 16, 2006 11:50 PM, Anonymous Anonymous said...

Good design!
http://kdzdusbx.com/juvm/vrav.html | http://vbneaxtt.com/uayx/ycsw.html

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008