Flags and Lollipops

Wednesday, October 05, 2005

Bioinformatics Zeitgeist '05

Introduction

In the bath the other day I was wondering: if you downloaded all of the titles and abstracts of papers published in bioinformatics journals over the last eight years and then did a little bit of text mining to look for certain patterns, would anything interesting emerge? Could you track fads and fashions? Would you see hype-cycles?

I'll present the full results below, but if you're impatient the answers to the questions above are "sort of" "kind of" and "no", respectively. Otherwise, prepare yourself.... for a lot of dodgy graphs... for wild speculation (and that's just the abstracts, ha ha).... for the Bioinformatics Zeitgeist 2005!

Methodology

All of the abstracts from the last 2,920 days published in Bioinformatics or BMC Bioinformatics were retrieved from PubMed (at first I also included NAR, Genome Research and Genome Biology, but there was too much noise from non-bioinformatics papers). They were grouped by year of publication and counted.

Subject areas were chosen and examplar abstracts for each were selected. These exemplars were scanned for key phrases which were then used to count the rough number of abstracts in each year dealing with a particular subject.

Note that when we talk about "Bioinformatics" it means the journal of that name; "bioinformatics" refers the discipline.

Results

First thing: in terms of raw numbers, the amount of work published in the two journals has grown tremendously. This isn't really surprising when you think about it: journals have a harder time attracting authors when they're just starting out. Bioinformatics has grown from 127 papers per year in 1999 to more than 782 ppy in 2005. Similarly BMC Bioinformatics has gone from publishing 90 papers in 2003 to 299 so far this year. Without a more exhaustive look at other bioinformatics journals unfortunately we can't say how much of this growth is down to an explosion in bioinformatics research and how much is down to these two journals simply becoming better known.

Anyway, what does all that mean? Why, that we'll be dealing in percentages of all available abstracts from now on so that we can look at trends. If you just look at the raw numbers almost every conceivable topic will have experienced growth partly as a side effect of us only looking at these two journals.

Some Trends

Open source is pretty hot right now. It was pretty hot everywhere else in IT a decade ago but you'd expect things to take a while to percolate here in the patent strewn biotech world. And look! The data sort of bears this out. This year ~ 4.5% of abstracts contained the words "open source" or "GPL", up from ~ 1.5% in 2001. Presumably bioinformatics software was being released under open source licences before 2001 but that wasn't considered important enough to be mentioned in the abstract (or perhaps authors thought that their audience didn't care).

Speaking of bioinformatics software, authors handily tend to stick to the same basic title structure when presenting new systems and databases. That structure looks something like this:
SUPERDOODAH [title]: A Novel Tool for ... etc. [description]
The percentage of abstracts whose title matches that pattern stays pretty steady - after a quick drop off in the early years - a little under 15%. I'd postulate that the drop off represents the switch from bioinformatics being all about software to accomplish something specific (like multiple alignments) to more "pure" research (like modelling interaction networks) but who knows?

The "same" drop off is seen in the number of new databases although they sometimes also match the title structure outlined above (so perhaps there's no drop off in software after all, just in databases). Apparently 2001 is the turning point for databases - suddenly we start seeing growth again after a big drop. One semi-plausible sounding theory: locus or task specific databases stop appearing as the genome sequence is completed and the big public repositories and viewers become available, hence the drop: once there's enough data to work with new databases are started up to hold the research coming out of genome-wide studies.

Some Things Never Change

Take four pretty big areas in bioinformatics research: detecting regulatory regions, multiple alignments, predicting protein structure and motif discovery. What has happened to them over the years?
First of all, contrary to my belief it turns out that motif discovery isn't such a big part of bioinformatics research after all; either that or the key phrases that we're looking for aren't very good. Leaving that aside you'll notice that regulatory region detection and protein structure prediction are fairly stable - though perhaps protein structure prediction has tailed off a little (now that it's become much harder?).

Mention of multiple-alignments, though, has gone way down. One explanation might be that the golden days of sequence alignment happened when there was a push to put together the draft human genome (in the late nineties); now interest has flagged.

Newer Ideas
What about new-fangled systems like the Gene Ontology, microarray technology and large scale protein interaction networks? GO started off as a collaboration between Flybase, MGD (mouse) and SGD (yeast) in 1998: at that point 2% of papers mentioned it in their abstracts. Now we're up to 8%, which is actually lower than I'd expected.

There's a quite nice peak in protein interaction network and microarray analysis papers - hurray! Hurray in that the peak has already past, I mean. I'll be happy to see more analyses of network topology once there's more data available, but until then... enough with the is it scale-free or isn't it and what does that mean question.

AI Smackdown
I have a soft spot for prediction and classification in bioinformatics, hence this (slightly more complicated) graph of machine learning techniques. Machine learning techniques and Markov Chains, anyway.

Mining the literature is getting more popular - just over 5% of papers this year mentioned it. "Text mining" in this case means everything from simple entity extraction to inferring protein interactions. As with GO I expected a higher percentage here: there are a lot of papers that don't deal specifically with text mining but that describe systems that leverage the data in PubMed abstracts one way or another.

Neural Networks have gotten less popular over time and Support Vector Machines have gotten more popular, which is perhaps what you'd expect. Apart from anything else SVMs are more popular in text mining nowadays and text mining abstracts are up, so there's a correlation there.

Discussion

Q. Would anything interesting emerge?
A. Sort of. Depends on your point of view.

Q. Could you track fads and fashions?
A. Kind of - neural networks is down, SVMs and GO are up, microarray analysis and protein interaction networks were popular a year or two ago and are now dipping.

Q. Would you see hype-cycles?
A. No. Perhaps the editors reject papers on the basis that they've seen too many dealing with the same topic that month. Perhaps the time-lag between reading about a new technique, implementing it and writing something up about it is too long.

If you'd like the source data, feel free to email me.

Comments and trackbacks Feel free to post your comments Anonymous Neil Anonymous Mauricio Blogger Stew Blogger Pedro Beltrão Anonymous Mauricio Anonymous Anonymous Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Trackbacks:

8 Comments:

At October 07, 2005 5:13 AM, Anonymous Neil said...

This is fun, I enjoyed it a lot.

We could quibble about methods of course :) but that's not the point. What I found interesting was how the numbers matched up (or not) with the impressions I have in my mind of "what's out there". Sometimes I think every other paper is "Bla: a bla for bla bla", but perhaps I'm overly-cynical.

Databases is interesting - I think you're right in that we've gone from purely sequence/structure databases through an explosion of more specialised subsets (think GenBank -> Pfam, Tfam or PDB->OWL, SCOP, Atlas or indeed everything->InterPro) to a quieter period. There are a lot of really crappy ultra-specialised dbs that just don't work at all, perhaps journals are getting pickier.

SVMs...really must teach myself SVMs...

 
At October 07, 2005 7:01 PM, Anonymous Mauricio said...

Really nice experiment!

Did you tried something else about AI (specifically cellular automata and genetic algorithms/programming)?

I'd like to take a look at the source data. Surely there will be something interesting that I haven't read before.

Keep the good work. Regards.

 
At October 08, 2005 10:08 PM, Blogger Stew said...

Heh, glad it went down well. Experiment is probably too strong a word - yeah, I wouldn't want to suggest that the methods used were particularly scientific.

After downloading everything and writing scripts to run regular expressions over abstracts and yakkity yak I realised that PubMed actually lets you do this sort of thing fairly easily using search codes:

bioinformatics[ta] 1999[dp]

will tell you how many papers were published in journals with "Bioinformatics" in their name (BMC, Bioinformatics and Briefings In AFAIK) in 1999, for example, and

bioinformatics[ta] ("cellular automata" OR "artificial life") 2004[dp]

will tell you how many papers in 2004 were about CA in those journals, and so on and so forth. No need for MySQL databases and Perl after all, bah.

Selecting subject areas to look at was a bit tricky in that there wasn't that much data to work with. The "cellular automata" search like the one above only comes up with one paper in the last few years (in BMC or Bioinformatics) so it wouldn't be enough to look pretty on a graph, which was obviously the prime concern... Genetic algorithms seem more promising. I'll include them in Zeitgeist '06 (or let me know if you find anything else that looks interesting and I'll post it / link to it as an update).

 
At October 08, 2005 11:03 PM, Blogger Pedro Beltrão said...

That was interesting :) Reminded me of a wird graph I saw in Molecular Systems Biology recently. It looked nice to spot trends.

 
At October 11, 2005 7:06 AM, Anonymous Mauricio said...

Excellent!! That was real beta! Now I'm trying different topics and getting fast results.

Thanks again Stew.

 
At August 29, 2006 1:54 PM, Anonymous Anonymous said...

This post has been removed by a blog administrator.

 
At August 29, 2006 1:54 PM, Anonymous Anonymous said...

This post has been removed by a blog administrator.

 
At August 29, 2006 1:54 PM, Anonymous Anonymous said...

Great work!
My homepage | Please visit

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009 February 2009 March 2009