Flags and Lollipops

Monday, February 06, 2006

Distributed text corpus tagging

I've been thinking about Amazon's Mechanical Turk (a scheme which gets humans to perform short, repetitive classification tasks that are easy but boring for them, but very difficult for computers) and about user driven annotation, as in the call for a gene function wiki (via Nodalpoint).

To build up a sufficiently large corpus for biomedical related natural language parsing tasks you could develop a freely available Firefox extension - a toolbar - that appears when it thinks that you're reading an abstract. The toolbar has buttons on it: buttons for different entities ("gene symbol", "gene product", "chemical", "cell line", "disease", "locus" ...) and relationships ("interacts with", "does not interact with", "belongs to", "involved in", "associated with").

If a user feels helpful then (s)he can highlight text in the abstract and then click the relevant button to tag it. The extension uses AJAX to call a central server in the background and to pass the current URL, the tag and the highlighted text (along with its position in the abstract, so that we can extract some context). Machine learning algorithms on the server get incrementally updated as new data comes in. Ideally these algorithms should eventually be able to tag new abstracts correctly (well, to an extent) by themselves.

If a user feels even more helpful then they can visit the server which is running an active learning algorithm of some sort. The algorithm provides abstracts which it doesn't think it can classify correctly: the user provides the correct answer and the algorithm learns from this. This is much more useful than highlighting the same old gene symbols again and again.

Any classifiers and data (including the raw markup from users) to come out of the project would, naturally, be freely available. As there are a relatively small set of possible classes, hopefully the weight of correct tags would outweigh the work of anybody deliberately sabotaging the system.

Of course, ideally PubMed should require that authors provide the correct semantic markup in abstracts themselves. Even if that starts tomorrow, though, there's still a tremendous backlog of valuable data available.

Comments and trackbacks Feel free to post your comments Blogger Pierre Blogger Pedro Beltrão Anonymous Matthew Cockerill Blogger Liam D. Gray . This post has trackbacks.

Trackbacks:

4 Comments:

At February 06, 2006 11:11 AM, Blogger Pierre said...

In my own opinion, best would be that publishers also join a RDF version of the abstracts. Such semantic web abstracts would be a great source of curated data for knowledge discovery. One could also imagine MESH terms defined in SKOS,
authors defined with FOAF...

 
At February 06, 2006 1:55 PM, Blogger Pedro Beltrão said...

This should probably be a part of the submission procedures. A small effort from every author would make the whole job much easier.
I talked once to an editor for FEBS letters who is also in charge of one of the protein interaction databases and he was in principle in agreement with the idea. At least to give the possibility to the authors to do it. He thought that forcing them to do it would be too much to start off. I think that if someone would start such a service it would be possible then to send a couple of emails to some editors to propose that authors use this service as part of the submission procedure.

 
At February 10, 2006 4:48 PM, Anonymous Matthew Cockerill said...

BioMed Central is actively working in this area.
For one thing, if you view source of any BioMed Central HTML article, you will see it already contains embedded RDF for bibliographic and licensing data - but that is just the start.

Some relevant initiatives BioMed Central is involved in to semantically enhance the literature:

Neurocommons: http://sciencecommons.org/data/neurocommons

The W3C Semantic Web for Health Care and Life Sciences SIG
http://www.w3.org/2001/sw/hcls/

And see this upcoming symposium at the EBI:
http://www.ebi.ac.uk/Rebholz/SemanticEnrichment.html


Note also the recent establishment in the US of a National Center for BioMedical Ontologies - a key enabling step, in terms of defining standards:
http://bioontology.org/


This stuff may finally be starting to happen - exciting times....

 
At March 13, 2006 10:00 PM, Blogger Liam D. Gray said...

Great ideas! I agree with all of your. I'm with Matthew, it's a time for such ideas. (Oblivious to Stew's post until today, I posted a similar one to my blog on Feb. 26.)

I want to add what may be my only innovation, if I have one: It seems like we could, Wiki-style, recruit lay volunteers (for example, from patient support groups) to do rough, NON-AUTHORITATIVE markup, which in this case would then be vetted in successively larger spheres (where spheres might be defined by social networking technologies) before finally being "approved." The patient support groups are highly motivated, and some of their members eventually become quite savvy, because the literature topics related to their conditions are so "near and dear to their hearts." More musings at:

http://nearish.blogspot.com/2006/03/nearish-ideas-spirit-of-age.html

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008