Distributed text corpus tagging
To build up a sufficiently large corpus for biomedical related natural language parsing tasks you could develop a freely available Firefox extension - a toolbar - that appears when it thinks that you're reading an abstract. The toolbar has buttons on it: buttons for different entities ("gene symbol", "gene product", "chemical", "cell line", "disease", "locus" ...) and relationships ("interacts with", "does not interact with", "belongs to", "involved in", "associated with").
If a user feels helpful then (s)he can highlight text in the abstract and then click the relevant button to tag it. The extension uses AJAX to call a central server in the background and to pass the current URL, the tag and the highlighted text (along with its position in the abstract, so that we can extract some context). Machine learning algorithms on the server get incrementally updated as new data comes in. Ideally these algorithms should eventually be able to tag new abstracts correctly (well, to an extent) by themselves.
If a user feels even more helpful then they can visit the server which is running an active learning algorithm of some sort. The algorithm provides abstracts which it doesn't think it can classify correctly: the user provides the correct answer and the algorithm learns from this. This is much more useful than highlighting the same old gene symbols again and again.
Any classifiers and data (including the raw markup from users) to come out of the project would, naturally, be freely available. As there are a relatively small set of possible classes, hopefully the weight of correct tags would outweigh the work of anybody deliberately sabotaging the system.
Of course, ideally PubMed should require that authors provide the correct semantic markup in abstracts themselves. Even if that starts tomorrow, though, there's still a tremendous backlog of valuable data available.
Pierre
Pedro Beltrão
Matthew Cockerill
Liam D. Gray
. This post has trackbacks.
