Bioinformatics Zeitgeist '05
In the bath the other day I was wondering: if you downloaded all of the titles and abstracts of papers published in bioinformatics journals over the last eight years and then did a little bit of text mining to look for certain patterns, would anything interesting emerge? Could you track fads and fashions? Would you see hype-cycles?
I'll present the full results below, but if you're impatient the answers to the questions above are "sort of" "kind of" and "no", respectively. Otherwise, prepare yourself.... for a lot of dodgy graphs... for wild speculation (and that's just the abstracts, ha ha).... for the Bioinformatics Zeitgeist 2005!
Methodology
All of the abstracts from the last 2,920 days published in Bioinformatics or BMC Bioinformatics were retrieved from PubMed (at first I also included NAR, Genome Research and Genome Biology, but there was too much noise from non-bioinformatics papers). They were grouped by year of publication and counted.
Subject areas were chosen and examplar abstracts for each were selected. These exemplars were scanned for key phrases which were then used to count the rough number of abstracts in each year dealing with a particular subject.
Note that when we talk about "Bioinformatics" it means the journal of that name; "bioinformatics" refers the discipline.
Results
First thing: in terms of raw numbers, the amount of work published in the two journals has grown tremendously. This isn't really surprising when you think about it: journals have a harder time attracting authors when they're just starting out. Bioinformatics has grown from 127 papers per year in 1999 to more than 782 ppy in 2005. Similarly BMC Bioinformatics has gone from publishing 90 papers in 2003 to 299 so far this year. Without a more exhaustive look at other bioinformatics journals unfortunately we can't say how much of this growth is down to an explosion in bioinformatics research and how much is down to these two journals simply becoming better known.Anyway, what does all that mean? Why, that we'll be dealing in percentages of all available abstracts from now on so that we can look at trends. If you just look at the raw numbers almost every conceivable topic will have experienced growth partly as a side effect of us only looking at these two journals.
Some Trends
Open source is pretty hot right now. It was pretty hot everywhere else in IT a decade ago but you'd expect things to take a while to percolate here in the patent strewn biotech world. And look! The data sort of bears this out. This year ~ 4.5% of abstracts contained the words "open source" or "GPL", up from ~ 1.5% in 2001. Presumably bioinformatics software was being released under open source licences before 2001 but that wasn't considered important enough to be mentioned in the abstract (or perhaps authors thought that their audience didn't care).
Speaking of bioinformatics software, authors handily tend to stick to the same basic title structure when presenting new systems and databases. That structure looks something like this:SUPERDOODAH [title]: A Novel Tool for ... etc. [description]The percentage of abstracts whose title matches that pattern stays pretty steady - after a quick drop off in the early years - a little under 15%. I'd postulate that the drop off represents the switch from bioinformatics being all about software to accomplish something specific (like multiple alignments) to more "pure" research (like modelling interaction networks) but who knows?
The "same" drop off is seen in the number of new databases although they sometimes also match the title structure outlined above (so perhaps there's no drop off in software after all, just in databases). Apparently 2001 is the turning point for databases - suddenly we start seeing growth again after a big drop. One semi-plausible sounding theory: locus or task specific databases stop appearing as the genome sequence is completed and the big public repositories and viewers become available, hence the drop: once there's enough data to work with new databases are started up to hold the research coming out of genome-wide studies.
Some Things Never Change
Take four pretty big areas in bioinformatics research: detecting regulatory regions, multiple alignments, predicting protein structure and motif discovery. What has happened to them over the years?
First of all, contrary to my belief it turns out that motif discovery isn't such a big part of bioinformatics research after all; either that or the key phrases that we're looking for aren't very good. Leaving that aside you'll notice that regulatory region detection and protein structure prediction are fairly stable - though perhaps protein structure prediction has tailed off a little (now that it's become much harder?).Mention of multiple-alignments, though, has gone way down. One explanation might be that the golden days of sequence alignment happened when there was a push to put together the draft human genome (in the late nineties); now interest has flagged.
Newer Ideas
What about new-fangled systems like the Gene Ontology, microarray technology and large scale protein interaction networks? GO started off as a collaboration between Flybase, MGD (mouse) and SGD (yeast) in 1998: at that point 2% of papers mentioned it in their abstracts. Now we're up to 8%, which is actually lower than I'd expected.There's a quite nice peak in protein interaction network and microarray analysis papers - hurray! Hurray in that the peak has already past, I mean. I'll be happy to see more analyses of network topology once there's more data available, but until then... enough with the is it scale-free or isn't it and what does that mean question.
AI Smackdown
I have a soft spot for prediction and classification in bioinformatics, hence this (slightly more complicated) graph of machine learning techniques. Machine learning techniques and Markov Chains, anyway.Mining the literature is getting more popular - just over 5% of papers this year mentioned it. "Text mining" in this case means everything from simple entity extraction to inferring protein interactions. As with GO I expected a higher percentage here: there are a lot of papers that don't deal specifically with text mining but that describe systems that leverage the data in PubMed abstracts one way or another.
Neural Networks have gotten less popular over time and Support Vector Machines have gotten more popular, which is perhaps what you'd expect. Apart from anything else SVMs are more popular in text mining nowadays and text mining abstracts are up, so there's a correlation there.
Discussion
Q. Would anything interesting emerge?
A. Sort of. Depends on your point of view.
Q. Could you track fads and fashions?
A. Kind of - neural networks is down, SVMs and GO are up, microarray analysis and protein interaction networks were popular a year or two ago and are now dipping.
Q. Would you see hype-cycles?
A. No. Perhaps the editors reject papers on the basis that they've seen too many dealing with the same topic that month. Perhaps the time-lag between reading about a new technique, implementing it and writing something up about it is too long.
If you'd like the source data, feel free to email me.
Neil
Mauricio
Stew
Pedro Beltrão
Mauricio
Anonymous
Anonymous
Anonymous
. This post has trackbacks.
