Flags and Lollipops

Sunday, February 26, 2006

Postgenomic update

Sorry I haven't been writing any general bioinformatics posts recently...

I've made quite a few changes to the code for Postgenomic over the last week or so - to the extent that I feel it's relatively stable, though still very much in beta. I've also been able to document the code a little (domain is a bit flaky; hit reload if you get a "page not found" error) and set up a Subversion repository on Sourceforge.

You can check out the code with Subversion using

svn co https://svn.sourceforge.net/svnroot/postgenomic postgenomic

The directory structure that results is explained in marginally more detail in the docs. Basically, there are three main parts to Postgenomic: the database, the pipeline (blogs => parsed data) and the web interface. The pipeline and web interface are independent of one another, so if you wanted to make any interface improvements (or speed it up) you can do that without getting mixed up in Perl and Python.

Vague future plans include prettifying the pipeline output, implementing a REST API for browser extensions / other web apps and picking out gene and protein names from the text.

Along the latter lines, Egon is doing some great work adding support for InChIs to the code.

There're more development bits and pieces at postgenomic.org (see flaky domain notice above) including mailing list details (postgenomic.org will also always have a dump of the latest database at this link, if you just want to get your hands on the raw data). From now on I'll post Postgenomic announcements and updates there rather than on F&L.

Well, most of the time.

Comments and trackbacks Feel free to post your comments Anonymous Mauricio Blogger Stew Anonymous Enro . This post has trackbacks.

Thursday, February 16, 2006

Postgenomic comments

Thanks for the comments on Postgenomic. I'm gradually fixing all the bugs and adding new feeds, etc.

If you've got a blog, then the best things to do (in terms of Postgenomic, of course - you might have reason to ignore some or all of the points below) are:
  1. Make sure that your feed contains complete posts, not just excerpts
  2. Make sure that your feed isn't stripped of all tags (this seems to be the case with feeds from some Movable Type installations)
  3. When linking to papers, link to the HTML abstract or fulltext rather than the PDF
  4. List tags in post bodies so that they appear in the feed
  5. If your post is about a paper, mark the anchor tag that links to the paper with a "rev=review" attribute (i.e. [a href="whatever" rev="review"]some paper[/a]
  6. Alternatively, use structured blogging (see Alf's comments on the last post) when writing reviews of papers
Implementing the RSS feeds for top papers, etc. is what I'm working on next. The suggestion to provide tag or blog collection feeds is a good one, thanks!

Linking in to Connotea / Citeulike / Hubmed is another good suggestion... not quite sure how to go about it yet, though.

Comments and trackbacks Feel free to post your comments Blogger Pierre Anonymous alf Blogger marciusmr . This post has trackbacks.

Wednesday, February 15, 2006

Postgenomic

I've been pretty quiet over the last week or so because... well, because I just got an Xbox 360 as an anniversary present (King Kong rocks). My talk of putting off any new video game related purchases for a while in favour of flash based biology games made Mrs Stew take pity on me, I think.

But it's also (and mainly) because I've been working on a new site at postgenomic.com.

Postgenomic aggregates the feeds from life science blogs in order to do useful and interesting things with them. It's kind of like Technorati crossed with a really big hot papers meeting.

Its main uses - hopefully - are to:

  • List the current top life science news stories and the hottest recent papers (or the papers most often cited by bloggers, anyway)
  • Store and index reviews of papers
  • Store and collate reports from conferences
  • Help bloggers to share their expertise and, flipside of the same coin, to find useful papers on a given topic

It achieves these by collecting blog posts via RSS (or Atom 0.3, a process not without some bugs - if you've got any experience in writing parsers for this kind of thing then please, let me know) and then collecting the URLs from them. It then does some simple pattern matching to pick out "interesting" pages which are on life science related domains (publishers, science news outlets, institutions... that sort of thing).

Interesting pages are retrieved and some simple heuristics are used to try and work out if they're papers, news stories or something else. This basically consists of looking for Pubmed IDs (PMIDs) and Digital Object Identifiers (DOIs) in the body of the page. If a DOI or PMID is found, PubMed is used to retrieve title, journal, abstract and author details.

By counting the number of posts that link to a particular URL or paper we can achieve the first point (and, incidentally, build up cool statistics like these. Well, cool in my opinion, anyway. Note how impact factors derived from citations in blog post match "real life" impact factors pretty well).

To find reviews we look for some very simple semantic markup (if you can call it that): a rev="review" attribute in the anchor tag containing a link to the URL of the paper. Alternatively you can use the hReview microformat, enclosing the review text in a div class="hreview" and marking the URL of the paper with a class="url" (edit: Alf pointed out that it doesn't need to be a div, you can enclose the review in span class="hreview" or p class="hreview" or whatever else you like) The index of reviews is looking pretty sparse at the moment, for obvious reasons (one limitation of Postgenomic is that as it collects content from feeds, changes to archived posts aren't picked up, so it's no good just going back over your old posts and inserting the rel attribute in the relevant places, unless your feed reflects this. I'm trying to think of ways round this). In the future it'd be an idea to look for optional, more detailed markup too, but this needs to be given more thought (what structured data does a review of a paper need to convey?)

Finding conference reports - and organizing them - is a different matter. I've no idea what the best way of going about this is: I'm hoping that others will have brilliant ideas and want to get involved. At the moment the site simply looks for keywords in post titles, which is far from ideal.

Hopefully, as the site develops and the database grows the fourth point can be accomplished by organizing the papers by topic (perhaps using MeSH terms, or keywords, or the Technorati tags from the posts containing links to them). If you're looking for papers on, say, Bayesian networks in molecular biology but don't know where to start then you could fire up your browser, click on the appropriate tag in the Postgenomic index and be presented with a list of relevant papers and the blog posts that talk about them.

The site is very much in beta: it has quite a few known limitations, which are listed in more detail on the "get involved" page there. Some of the more pressing issues include internationalization, problems with parsing RSS and Atom feeds in Perl, identifying the correct DOI or PMID from the HTML version of a paper, lack of a search function and the aforementioned questions about how to markup conference report posts.

It's also probably quite slow, as my web host (Yahoo!) sucks at anything script heavy.

Bearing all that in mind, please try it out - your feedback, ideas and contributions would be very much appreciated (if you fancy improving the web interface or analysis pipeline, or munging the data in some new and useful way, let me know: it's open source, you can have the code and the database and I'll incorporate good changes into the site).

Comments and trackbacks Feel free to post your comments Blogger Pedro Beltrão Blogger The Mad Scientist Anonymous alf Anonymous alf Anonymous fjossinet Blogger Pierre Anonymous Tobias Anonymous Neil Blogger e3 Anonymous Deepak Anonymous Enro Blogger Greg Tyrelle Anonymous Mauricio Blogger The Bioinformatics Blog Blogger Sandra Blogger Bill Hooker Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Tuesday, February 14, 2006

REPRINT: Detecting linear motifs in interaction networks

(this is an older post, reprinted so that it'll appear in today's feed: I'll explain why in my next new post)

There's an interesting paper in November's PLoS Biology by Neduva et al., about finding short linear motifs using protein interaction networks.
Many aspects of cell signalling, trafficking, and targeting are governed by interactions between globular protein domains and short peptide segments. These domains often bind multiple peptides that share a common sequence pattern, or “linear motif” (e.g., SH3 binding to PxxP). Many domains are known, though comparatively few linear motifs have been discovered. Their short length (three to eight residues), and the fact that they often reside in disordered regions in proteins makes them difficult to detect through sequence comparison or experiment.
The idea is that for each protein in an interaction network you take its interactors, remove the parts of each that are unlikely to contain linear motifs (like globular domains, coiled coils and signal peptides) and then search the remaining peptide sequences for overrepresented motifs, compared to a control set of 15,000 proteins selected at random from SWISSPROT. The motifs are then ranked according to their p-value, which represents how unlikely the motif is to be so frequently observed in so few proteins.

Three of the previously uncharacterized linear motifs they found in drosophila and yeast were tested in the lab, confirming two of them (doesn't seem like a set big enough to draw any conclusions from, but this is essentially an in-silico paper, after all).

The authors also used the same approach on sets of interacting proteins from the Eukaryotic Linear Motif database and found that often the curated linear motif from ELM was the same as the top ranking motif in their results.

While there isn't anything particularly exciting about the methodology here it's interesting to see protein interaction networks being used for something other than protein classification or hand waving (about network architecture, evolutionary pressures, etc.)

I'm also surprised that nobody has done anything similar up until now. I remember a paper about globular domains being used to predict new protein interactors, but nothing the other way round...

Comments and trackbacks Feel free to post your comments Blogger The Bioinformatics Blog Anonymous Neil Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Tuesday, February 07, 2006

Biomedical PDFs

Alf @ Hublog did a quick survey of the PDFs available from a variety of different biomedical publishers. He looked at things like authentication methods, whether or not simple metadata was included in the PDF document properties and what the default filename for downloaded PDFs was. The results are quite interesting.

None of the publishers did everything right, even though none of the things Alf was looking for are particularly difficult to implement. BioMedCentral came quite close, but then they're an internet based publisher, so perhaps you'd expect that.

I have an inherent hatred of all things PDF, mainly because my PC at work has a strange problem with embedding Acrobat in Firefox (it hangs for up to a minute whenever I click on a link that leads to a PDF). I've always preferred to just capture the relevant fulltext HTML with the ScrapBook extension, set to capture the appropriate depth of links.

Comments and trackbacks Feel free to post your comments Anonymous alf Blogger Stew Anonymous Deepak Blogger Jared Ryan Clemence Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Monday, February 06, 2006

Distributed text corpus tagging

I've been thinking about Amazon's Mechanical Turk (a scheme which gets humans to perform short, repetitive classification tasks that are easy but boring for them, but very difficult for computers) and about user driven annotation, as in the call for a gene function wiki (via Nodalpoint).

To build up a sufficiently large corpus for biomedical related natural language parsing tasks you could develop a freely available Firefox extension - a toolbar - that appears when it thinks that you're reading an abstract. The toolbar has buttons on it: buttons for different entities ("gene symbol", "gene product", "chemical", "cell line", "disease", "locus" ...) and relationships ("interacts with", "does not interact with", "belongs to", "involved in", "associated with").

If a user feels helpful then (s)he can highlight text in the abstract and then click the relevant button to tag it. The extension uses AJAX to call a central server in the background and to pass the current URL, the tag and the highlighted text (along with its position in the abstract, so that we can extract some context). Machine learning algorithms on the server get incrementally updated as new data comes in. Ideally these algorithms should eventually be able to tag new abstracts correctly (well, to an extent) by themselves.

If a user feels even more helpful then they can visit the server which is running an active learning algorithm of some sort. The algorithm provides abstracts which it doesn't think it can classify correctly: the user provides the correct answer and the algorithm learns from this. This is much more useful than highlighting the same old gene symbols again and again.

Any classifiers and data (including the raw markup from users) to come out of the project would, naturally, be freely available. As there are a relatively small set of possible classes, hopefully the weight of correct tags would outweigh the work of anybody deliberately sabotaging the system.

Of course, ideally PubMed should require that authors provide the correct semantic markup in abstracts themselves. Even if that starts tomorrow, though, there's still a tremendous backlog of valuable data available.

Comments and trackbacks Feel free to post your comments Blogger Pierre Blogger Pedro Beltrão Anonymous Matthew Cockerill Blogger Liam D. Gray . This post has trackbacks.

Thursday, February 02, 2006

Popular science books

Sandra posted a comment a few days ago recommending The Genome War by James Shreeve (which sounds interesting; I'm going to pick it up the next time I'm in town). That got me to thinking about good popular science books about genetics that I've read in the past few years.

Anyway, my list (linked through to Amazon*) would include:

  • Genome, by Matt Ridley : it's a bit of a gimmick, but Ridley organizes his book into 23 chapters, one for each pair of chromosomes. Each chapter then uses a single interesting gene (FOXP2, HD...) from that chromosome as a jumping off point to explore human genetics in more detail. There're many anecdotes, the science is solid and you can't fault Ridley's writing skills - it's a great read. Frankly I don't know why it's not used as a textbook in schools...
  • Mutants, by Armand Leroi : Leroi's book is about the extremes of human genetic variability (variation to the Elephant Man extent). He covers the basics of developmental biology before using famous historical cases of human mutants as a platform to delve deeper. Fascinating stuff. As a sidenote, Gene Expression has an interview with Leroi here.
  • The Origins of Virtue, by Matt Ridley (again) : How did cooperation and moral virtue develop during human evolution? Ridley tries to find some answers through experiments using game theory (which isn't as complicated as it sounds). It's another well written book from Ridley about an interesting topic.
  • DNA: The Secret of Life, by James Watson : Not autobiographical like his other books - there are no unwanted pregnancies, broken marriages or relentless chasing of younger women here. It does have a lot of personal anecdotes in it, naturally - Ewan Birney stayed at Watson's house during his gap year? -and it's that charm that sets Watson's book apart from the crowd. Covers the past, present and future (in Watson's opinion) of DNA - a good read for the lay person.
  • The Extended Phenotype, by Richard Dawkins : Dawkins' follow-up to The Selfish Gene, in which he expounds on what he calls the "extended phenotype" : the effect that a gene has upon the world. In Dawkins' view, there's no reason why phenotypes should stop at the skin or bark. An example he gives is beavers building a dam: a mutated gene which makes one of the beavers, say, build the dam a bit higher (an extended phenotype) might affect its survival just as much as a mutated gene which gives the beaver a slightly longer tail (a traditional phenotype).

Any other suggestions?

* I tried looking for an affiliate code that would send any clickthrough money to charity, but to no avail (if somebody is looking for an interesting web project, they could do worse than develop a clearinghouse for charity affiliate links). Therefore any profit from this post is going straight towards feeding my voracious donut habit.

Comments and trackbacks Feel free to post your comments Blogger Pedro Beltrão . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008