Flags and Lollipops

Wednesday, March 19, 2008

Dawkins officially bigger than Jesus - datamining Scienceblogs.com

I've run all of the posts from Scienceblogs.com in 2007 through the ClearForest API. ClearForest extracts entities - people, places, organizations - from plain text.

I'm in the process of pulling things together for a visualization, but here's a quick answer to the 'who are Sciblings talking about?' question. The 'count' is the number of times that each entity was seen (could be multiple times in the same post) across 2007.


+-----------------------------------------------+-------+
| term | count |
+-----------------------------------------------+-------+
| Michael Egnor | 1855 |
| Richard Dawkins | 1737 |
| Bush | 1669 |
| Congress | 1430 |
| Charles Darwin | 1226 |
| Michael Behe | 1031 |
| Chris Mooney | 927 |
| FDA | 920 |
| DCA | 765 |
| National Aeronautics and Space Administration | 745 |
| National Institute of Health | 741 |
| Bush administration | 721 |
| Google | 700 |
| Guillermo Gonzalez | 691 |
| White House | 658 |
| Supreme Court | 655 |
| Thomas Jefferson | 632 |
| John Edwards | 614 |
| Casey Luskin | 605 |
| George W. Bush | 603 |
| Jesus Christ | 601 |
| Discovery Institute | 596 |
| the New York Times | 587 |
| Larry Moran | 576 |
| World Health Organization | 543 |
| Hillary Clinton | 517 |
+-----------------------------------------------+-------+


Bear in mind that ClearForest extracts entities, not key terms. It can't tell us how often blog posts are talking about mammoth DNA, supernovae or dicyemid mesozoa. That's a different dataset entirely...

.... this one, in fact, generated using the Yahoo! term extraction API which pulls out important concepts (terms) from text. The dataset is about half the size of the above as I'm only including ScienceBlogs indexed in Postgenomic. Here 'count' is the number of distinct posts containing a term:


+---------------------+-------+
| term | count |
+---------------------+-------+
| evolution | 963 |
| carnival | 923 |
| global warming | 640 |
| intelligent design | 543 |
| new york times | 542 |
| blogosphere | 468 |
| religion | 460 |
| brain | 437 |
| climate change | 432 |
| creationist | 420 |
| birds | 415 |
| creationism | 409 |
| creationists | 398 |
| pz | 378 |
| darwin | 367 |
| discovery institute | 354 |
| atheists | 351 |
| atheist | 333 |
| biology | 314 |
| richard dawkins | 301 |
| skeptics | 290 |
| love | 289 |
| genes | 288 |
| job | 286 |
| money | 283 |
| orac | 281 |
| god | 276 |
| atheism | 266 |
| animals | 261 |
| bush | 258 |
| google | 258 |
+---------------------+-------+


In light of this data it's tempting to revisit that Bayblab post suggesting that Sciblings spend too much time discussing ID. That'd be a mistake, though: the numbers above are absolutes. 963 posts had 'evolution' as a key term but that's only 2.4% of all posts that year (my 2c: I think that Sciblings do talk about Egnor, ID and creationism too much, but hey, it's their blogs - I just skip over those posts).

I also had a look at linking patterns - who do ScienceBloggers link to the most? Here 'count' is the number of unique posts that have a link to a particular domain.


+-------------------------+-------+
| domain | count |
+-------------------------+-------+
| www.scienceblogs.com | 15966 |
| en.wikipedia.org | 2016 |
| www.technorati.com | 1797 |
| www.nytimes.com | 1388 |
| www.amazon.com | 1078 |
| www.sciencedaily.com | 661 |
| www.washingtonpost.com | 478 |
| feeds.feedburner.com | 467 |
| www.nature.com | 453 |
| news.yahoo.com | 401 |
| news.bbc.co.uk | 333 |
| www.youtube.com | 305 |
| www.del.icio.us | 297 |
| www.cnn.com | 260 |
| www.eurekalert.org | 260 |
| farm3.static.flickr.com | 259 |
| www.sciencemag.org | 231 |
| www.ncbi.nlm.nih.gov | 225 |
| www.pandasthumb.org | 224 |
| www.google.com | 219 |
| www.latimes.com | 213 |
| www.gnxp.com | 208 |
| sandwalk.blogspot.com | 197 |
| www.dailykos.com | 196 |
| www.donorschoose.org | 194 |
+-------------------------+-------+


Presumably the technorati links are from tags. Sciencebloggers link to scienceblogs.com far more than anywhere else - but I'd guess that this is simply because there are a lot of good science blogs on one domain there.

Wikipedia's reliability might be in question but it's interesting that almost everybody uses it to define terms.

Drilling down, where do ScienceBloggers link to papers?


+--------------------------------+-------+
| domain | count |
+--------------------------------+-------+
| www.nature.com | 241 |
| www.sciencemag.org | 194 |
| www.dx.doi.org | 177 |
| www.ncbi.nlm.nih.gov | 111 |
| www.pnas.org | 104 |
| www.plosone.org | 89 |
| biology.plosjournals.org | 76 |
| content.nejm.org | 67 |
| medicine.plosjournals.org | 65 |
| www.sciencedirect.com | 43 |
| www.arxiv.org | 33 |
| genetics.plosjournals.org | 22 |
| www.jneurosci.org | 15 |
| www.cell.com | 14 |
| compbiol.plosjournals.org | 10 |
| pediatrics.aappublications.org | 10 |
| www.jcb.org | 10 |
| mbe.oxfordjournals.org | 9 |
| www.ajp.psychiatryonline.org | 8 |
| www.current-biology.com | 8 |
| www.journals.uchicago.edu | 8 |
| www.plosntds.org | 8 |
| www.blackwell-synergy.com | 7 |
+--------------------------------+-------+


Nature and Science are at the top, perhaps unsurprisingly - but if you add up the counts from the different PLoS journals it'd be up there too.

Comments and trackbacks Feel free to post your comments Blogger T Tague Anonymous Anonymous OpenID maxine Blogger RPM Anonymous Dave Munger . This post has trackbacks.

Trackbacks:

5 Comments:

At March 19, 2008 1:25 PM, Blogger T Tague said...

Stew:

Tom Tague from the Calais team here.

This is very interesting - first time we've seen the Calais service used to do a retrospective analysis of a corpus of documents - though this has been a very common use of the commercial versions of the software.

If, in the future, you're looking for a bit better accuracy you might try bundling a whole series of posts together into a large chunk (up to 100K) and submitting that. With small documents (like postings) there is often inadequate context for an NLP tool like Calais to fully extract all entities. The other benefit you'll see is better normalization - for example your various flavors of "Bush" would most likely be understood and normalized as a single person.

Thanks for giving us a try!

Regards,

 
At March 19, 2008 2:40 PM, Anonymous Anonymous said...

"Dawkins officially bigger than Jesus"

I'm sure Jesus is impressed. < /s>

What is interesting is that Jesus Christ, not generally known as a science writer Himself, would be mentioned that many times in Scienceblogs. What's up with that?

In terms of the number of references in all writings of human history, Mr. Dawkins has a very long way to go to catch up.

Cordially,

 
At March 19, 2008 7:52 PM, OpenID maxine said...

So who is this Michael Egnor guy, then?

 
At April 17, 2008 7:41 PM, Blogger RPM said...

Really late to the game, but the PLoS links might be biased by Bora who frequently posts links to recent PLoS papers. That's because he works for PLoS.

 
At April 17, 2008 7:52 PM, Anonymous Dave Munger said...

When I'm writing up a journal article I don't necessarily link to it. Usually the article is behind a paywall anyway and even if your library has a subscription you need to access it a different way.

Now that I use ResearchBlogging.org, many of my posts do link to the DOI information, though.

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008