Postgenomic
I've been pretty quiet over the last week or so because... well, because I just got an Xbox 360 as an anniversary present (King Kong rocks). My talk of putting off any new video game related purchases for a while in favour of flash based biology games made Mrs Stew take pity on me, I think.But it's also (and mainly) because I've been working on a new site at postgenomic.com.
Postgenomic aggregates the feeds from life science blogs in order to do useful and interesting things with them. It's kind of like Technorati crossed with a really big hot papers meeting.
Its main uses - hopefully - are to:
- List the current top life science news stories and the hottest recent papers (or the papers most often cited by bloggers, anyway)
- Store and index reviews of papers
- Store and collate reports from conferences
- Help bloggers to share their expertise and, flipside of the same coin, to find useful papers on a given topic
It achieves these by collecting blog posts via RSS (or Atom 0.3, a process not without some bugs - if you've got any experience in writing parsers for this kind of thing then please, let me know) and then collecting the URLs from them. It then does some simple pattern matching to pick out "interesting" pages which are on life science related domains (publishers, science news outlets, institutions... that sort of thing).
Interesting pages are retrieved and some simple heuristics are used to try and work out if they're papers, news stories or something else. This basically consists of looking for Pubmed IDs (PMIDs) and Digital Object Identifiers (DOIs) in the body of the page. If a DOI or PMID is found, PubMed is used to retrieve title, journal, abstract and author details.
By counting the number of posts that link to a particular URL or paper we can achieve the first point (and, incidentally, build up cool statistics like these. Well, cool in my opinion, anyway. Note how impact factors derived from citations in blog post match "real life" impact factors pretty well).
To find reviews we look for some very simple semantic markup (if you can call it that): a rev="review" attribute in the anchor tag containing a link to the URL of the paper. Alternatively you can use the hReview microformat, enclosing the review text in a div class="hreview" and marking the URL of the paper with a class="url" (edit: Alf pointed out that it doesn't need to be a div, you can enclose the review in span class="hreview" or p class="hreview" or whatever else you like) The index of reviews is looking pretty sparse at the moment, for obvious reasons (one limitation of Postgenomic is that as it collects content from feeds, changes to archived posts aren't picked up, so it's no good just going back over your old posts and inserting the rel attribute in the relevant places, unless your feed reflects this. I'm trying to think of ways round this). In the future it'd be an idea to look for optional, more detailed markup too, but this needs to be given more thought (what structured data does a review of a paper need to convey?)
Finding conference reports - and organizing them - is a different matter. I've no idea what the best way of going about this is: I'm hoping that others will have brilliant ideas and want to get involved. At the moment the site simply looks for keywords in post titles, which is far from ideal.
Hopefully, as the site develops and the database grows the fourth point can be accomplished by organizing the papers by topic (perhaps using MeSH terms, or keywords, or the Technorati tags from the posts containing links to them). If you're looking for papers on, say, Bayesian networks in molecular biology but don't know where to start then you could fire up your browser, click on the appropriate tag in the Postgenomic index and be presented with a list of relevant papers and the blog posts that talk about them.
The site is very much in beta: it has quite a few known limitations, which are listed in more detail on the "get involved" page there. Some of the more pressing issues include internationalization, problems with parsing RSS and Atom feeds in Perl, identifying the correct DOI or PMID from the HTML version of a paper, lack of a search function and the aforementioned questions about how to markup conference report posts.
It's also probably quite slow, as my web host (Yahoo!) sucks at anything script heavy.
Bearing all that in mind, please try it out - your feedback, ideas and contributions would be very much appreciated (if you fancy improving the web interface or analysis pipeline, or munging the data in some new and useful way, let me know: it's open source, you can have the code and the database and I'll incorporate good changes into the site).
Pedro Beltrão
The Mad Scientist
alf
alf
fjossinet
Pierre
Tobias
Neil
e3
Deepak
Enro
Greg Tyrelle
Mauricio
The Bioinformatics Blog
Sandra
Bill Hooker
Anonymous
Anonymous
Anonymous
. This post has trackbacks.
