Dawkins officially bigger than Jesus - datamining Scienceblogs.com
I'm in the process of pulling things together for a visualization, but here's a quick answer to the 'who are Sciblings talking about?' question. The 'count' is the number of times that each entity was seen (could be multiple times in the same post) across 2007.
+-----------------------------------------------+-------+
| term | count |
+-----------------------------------------------+-------+
| Michael Egnor | 1855 |
| Richard Dawkins | 1737 |
| Bush | 1669 |
| Congress | 1430 |
| Charles Darwin | 1226 |
| Michael Behe | 1031 |
| Chris Mooney | 927 |
| FDA | 920 |
| DCA | 765 |
| National Aeronautics and Space Administration | 745 |
| National Institute of Health | 741 |
| Bush administration | 721 |
| Google | 700 |
| Guillermo Gonzalez | 691 |
| White House | 658 |
| Supreme Court | 655 |
| Thomas Jefferson | 632 |
| John Edwards | 614 |
| Casey Luskin | 605 |
| George W. Bush | 603 |
| Jesus Christ | 601 |
| Discovery Institute | 596 |
| the New York Times | 587 |
| Larry Moran | 576 |
| World Health Organization | 543 |
| Hillary Clinton | 517 |
+-----------------------------------------------+-------+
Bear in mind that ClearForest extracts entities, not key terms. It can't tell us how often blog posts are talking about mammoth DNA, supernovae or dicyemid mesozoa. That's a different dataset entirely...
.... this one, in fact, generated using the Yahoo! term extraction API which pulls out important concepts (terms) from text. The dataset is about half the size of the above as I'm only including ScienceBlogs indexed in Postgenomic. Here 'count' is the number of distinct posts containing a term:
+---------------------+-------+
| term | count |
+---------------------+-------+
| evolution | 963 |
| carnival | 923 |
| global warming | 640 |
| intelligent design | 543 |
| new york times | 542 |
| blogosphere | 468 |
| religion | 460 |
| brain | 437 |
| climate change | 432 |
| creationist | 420 |
| birds | 415 |
| creationism | 409 |
| creationists | 398 |
| pz | 378 |
| darwin | 367 |
| discovery institute | 354 |
| atheists | 351 |
| atheist | 333 |
| biology | 314 |
| richard dawkins | 301 |
| skeptics | 290 |
| love | 289 |
| genes | 288 |
| job | 286 |
| money | 283 |
| orac | 281 |
| god | 276 |
| atheism | 266 |
| animals | 261 |
| bush | 258 |
| google | 258 |
+---------------------+-------+
In light of this data it's tempting to revisit that Bayblab post suggesting that Sciblings spend too much time discussing ID. That'd be a mistake, though: the numbers above are absolutes. 963 posts had 'evolution' as a key term but that's only 2.4% of all posts that year (my 2c: I think that Sciblings do talk about Egnor, ID and creationism too much, but hey, it's their blogs - I just skip over those posts).
I also had a look at linking patterns - who do ScienceBloggers link to the most? Here 'count' is the number of unique posts that have a link to a particular domain.
+-------------------------+-------+
| domain | count |
+-------------------------+-------+
| www.scienceblogs.com | 15966 |
| en.wikipedia.org | 2016 |
| www.technorati.com | 1797 |
| www.nytimes.com | 1388 |
| www.amazon.com | 1078 |
| www.sciencedaily.com | 661 |
| www.washingtonpost.com | 478 |
| feeds.feedburner.com | 467 |
| www.nature.com | 453 |
| news.yahoo.com | 401 |
| news.bbc.co.uk | 333 |
| www.youtube.com | 305 |
| www.del.icio.us | 297 |
| www.cnn.com | 260 |
| www.eurekalert.org | 260 |
| farm3.static.flickr.com | 259 |
| www.sciencemag.org | 231 |
| www.ncbi.nlm.nih.gov | 225 |
| www.pandasthumb.org | 224 |
| www.google.com | 219 |
| www.latimes.com | 213 |
| www.gnxp.com | 208 |
| sandwalk.blogspot.com | 197 |
| www.dailykos.com | 196 |
| www.donorschoose.org | 194 |
+-------------------------+-------+
Presumably the technorati links are from tags. Sciencebloggers link to scienceblogs.com far more than anywhere else - but I'd guess that this is simply because there are a lot of good science blogs on one domain there.
Wikipedia's reliability might be in question but it's interesting that almost everybody uses it to define terms.
Drilling down, where do ScienceBloggers link to papers?
+--------------------------------+-------+
| domain | count |
+--------------------------------+-------+
| www.nature.com | 241 |
| www.sciencemag.org | 194 |
| www.dx.doi.org | 177 |
| www.ncbi.nlm.nih.gov | 111 |
| www.pnas.org | 104 |
| www.plosone.org | 89 |
| biology.plosjournals.org | 76 |
| content.nejm.org | 67 |
| medicine.plosjournals.org | 65 |
| www.sciencedirect.com | 43 |
| www.arxiv.org | 33 |
| genetics.plosjournals.org | 22 |
| www.jneurosci.org | 15 |
| www.cell.com | 14 |
| compbiol.plosjournals.org | 10 |
| pediatrics.aappublications.org | 10 |
| www.jcb.org | 10 |
| mbe.oxfordjournals.org | 9 |
| www.ajp.psychiatryonline.org | 8 |
| www.current-biology.com | 8 |
| www.journals.uchicago.edu | 8 |
| www.plosntds.org | 8 |
| www.blackwell-synergy.com | 7 |
+--------------------------------+-------+
Nature and Science are at the top, perhaps unsurprisingly - but if you add up the counts from the different PLoS journals it'd be up there too.
T Tague
Anonymous
maxine
RPM
Dave Munger
. This post has trackbacks.
