Flags and Lollipops

Wednesday, March 19, 2008

Dawkins officially bigger than Jesus - datamining Scienceblogs.com

I've run all of the posts from Scienceblogs.com in 2007 through the ClearForest API. ClearForest extracts entities - people, places, organizations - from plain text.

I'm in the process of pulling things together for a visualization, but here's a quick answer to the 'who are Sciblings talking about?' question. The 'count' is the number of times that each entity was seen (could be multiple times in the same post) across 2007.


+-----------------------------------------------+-------+
| term | count |
+-----------------------------------------------+-------+
| Michael Egnor | 1855 |
| Richard Dawkins | 1737 |
| Bush | 1669 |
| Congress | 1430 |
| Charles Darwin | 1226 |
| Michael Behe | 1031 |
| Chris Mooney | 927 |
| FDA | 920 |
| DCA | 765 |
| National Aeronautics and Space Administration | 745 |
| National Institute of Health | 741 |
| Bush administration | 721 |
| Google | 700 |
| Guillermo Gonzalez | 691 |
| White House | 658 |
| Supreme Court | 655 |
| Thomas Jefferson | 632 |
| John Edwards | 614 |
| Casey Luskin | 605 |
| George W. Bush | 603 |
| Jesus Christ | 601 |
| Discovery Institute | 596 |
| the New York Times | 587 |
| Larry Moran | 576 |
| World Health Organization | 543 |
| Hillary Clinton | 517 |
+-----------------------------------------------+-------+


Bear in mind that ClearForest extracts entities, not key terms. It can't tell us how often blog posts are talking about mammoth DNA, supernovae or dicyemid mesozoa. That's a different dataset entirely...

.... this one, in fact, generated using the Yahoo! term extraction API which pulls out important concepts (terms) from text. The dataset is about half the size of the above as I'm only including ScienceBlogs indexed in Postgenomic. Here 'count' is the number of distinct posts containing a term:


+---------------------+-------+
| term | count |
+---------------------+-------+
| evolution | 963 |
| carnival | 923 |
| global warming | 640 |
| intelligent design | 543 |
| new york times | 542 |
| blogosphere | 468 |
| religion | 460 |
| brain | 437 |
| climate change | 432 |
| creationist | 420 |
| birds | 415 |
| creationism | 409 |
| creationists | 398 |
| pz | 378 |
| darwin | 367 |
| discovery institute | 354 |
| atheists | 351 |
| atheist | 333 |
| biology | 314 |
| richard dawkins | 301 |
| skeptics | 290 |
| love | 289 |
| genes | 288 |
| job | 286 |
| money | 283 |
| orac | 281 |
| god | 276 |
| atheism | 266 |
| animals | 261 |
| bush | 258 |
| google | 258 |
+---------------------+-------+


In light of this data it's tempting to revisit that Bayblab post suggesting that Sciblings spend too much time discussing ID. That'd be a mistake, though: the numbers above are absolutes. 963 posts had 'evolution' as a key term but that's only 2.4% of all posts that year (my 2c: I think that Sciblings do talk about Egnor, ID and creationism too much, but hey, it's their blogs - I just skip over those posts).

I also had a look at linking patterns - who do ScienceBloggers link to the most? Here 'count' is the number of unique posts that have a link to a particular domain.


+-------------------------+-------+
| domain | count |
+-------------------------+-------+
| www.scienceblogs.com | 15966 |
| en.wikipedia.org | 2016 |
| www.technorati.com | 1797 |
| www.nytimes.com | 1388 |
| www.amazon.com | 1078 |
| www.sciencedaily.com | 661 |
| www.washingtonpost.com | 478 |
| feeds.feedburner.com | 467 |
| www.nature.com | 453 |
| news.yahoo.com | 401 |
| news.bbc.co.uk | 333 |
| www.youtube.com | 305 |
| www.del.icio.us | 297 |
| www.cnn.com | 260 |
| www.eurekalert.org | 260 |
| farm3.static.flickr.com | 259 |
| www.sciencemag.org | 231 |
| www.ncbi.nlm.nih.gov | 225 |
| www.pandasthumb.org | 224 |
| www.google.com | 219 |
| www.latimes.com | 213 |
| www.gnxp.com | 208 |
| sandwalk.blogspot.com | 197 |
| www.dailykos.com | 196 |
| www.donorschoose.org | 194 |
+-------------------------+-------+


Presumably the technorati links are from tags. Sciencebloggers link to scienceblogs.com far more than anywhere else - but I'd guess that this is simply because there are a lot of good science blogs on one domain there.

Wikipedia's reliability might be in question but it's interesting that almost everybody uses it to define terms.

Drilling down, where do ScienceBloggers link to papers?


+--------------------------------+-------+
| domain | count |
+--------------------------------+-------+
| www.nature.com | 241 |
| www.sciencemag.org | 194 |
| www.dx.doi.org | 177 |
| www.ncbi.nlm.nih.gov | 111 |
| www.pnas.org | 104 |
| www.plosone.org | 89 |
| biology.plosjournals.org | 76 |
| content.nejm.org | 67 |
| medicine.plosjournals.org | 65 |
| www.sciencedirect.com | 43 |
| www.arxiv.org | 33 |
| genetics.plosjournals.org | 22 |
| www.jneurosci.org | 15 |
| www.cell.com | 14 |
| compbiol.plosjournals.org | 10 |
| pediatrics.aappublications.org | 10 |
| www.jcb.org | 10 |
| mbe.oxfordjournals.org | 9 |
| www.ajp.psychiatryonline.org | 8 |
| www.current-biology.com | 8 |
| www.journals.uchicago.edu | 8 |
| www.plosntds.org | 8 |
| www.blackwell-synergy.com | 7 |
+--------------------------------+-------+


Nature and Science are at the top, perhaps unsurprisingly - but if you add up the counts from the different PLoS journals it'd be up there too.

Comments and trackbacks Feel free to post your comments Blogger T Tague Anonymous Anonymous OpenID maxine Blogger RPM Anonymous Dave Munger . This post has trackbacks.

Science streaming

Michael Barton has a nice post up:


I currently use Subversion to back up my project files, and I noticed Twitter status updates are very similar in length to subversion log messages. I created a short script so that every time I do a subversion repository check in, the message is also sent to Twitter.


I'd like to see activity aggregators accept arbitrary updates - sort of like Facebook's Beacon updating people's News Feed, but done properly.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Tuesday, March 18, 2008

Nature archive visualized - draft

I'm using up my annual carry-over vacation days by taking some time off work this week. Normal people probably use this valuable breathing space to bond with their loved ones, play badminton and learn exciting new hobbies. So far I've sat alone in my flat for thirty six hours straight writing Processing sketches *.

So... here's a draft visualization (14mb MP4, should play in your browser with Quicktime) of the key words and phrases found in Nature journal over the past thirty years.

The video starts with the phrases from 1970 and continues until 2007.

Phrases appear on the right in the year that they were first seen, then travel leftwards, disappearing in the year they were last seen.

The size of each phrase is related to how often it was seen relative to all the other phrases.

The hue of each phrase is related to how many distinct journal issues it appeared in - green / yellow phrases are relatively transient while red / brown phrases are stable, appearing in many different contexts.

The data is incomplete (it's a bit sparse after '88) and I took lots of shortcuts to see how things might look, so don't read too much into which phrases appear and when for now... a better version will follow - this is just a release early, release often draft.

Eventually I'd like to have a sort of Pop-up Video timeline of science from the 50s till today, with major events (and relevant terms) flashing up on screen.

If you're particularly impatient here's a version from Vimeo. The quality is rubbish, mainly because I munged the file with iMovie (which is crap) to add some rockin' beats. I still suggest you get the mp4 instead, though.

Tommorrow I'm going to the park.



* I watched Rendition, too, it was quite good.

Comments and trackbacks Feel free to post your comments Blogger Neil . This post has trackbacks.

Tuesday, March 11, 2008

Seattle

I'm going to be in Seattle the first week of April for the ICWSM.

There are a whole bunch of awesome looking talks and one or two wild cards. Wild cards like:

Spontaneous Inference of Personality Traits from Online Profiles
Kristin Stecher, Scott Counts

Which sounds interesting, anyway.

Let me know if you're in the area and fancy meeting up for lunch or a drink. I'm in town from the 29th of March to the 5th April.

Comments and trackbacks Feel free to post your comments Blogger Spitshine Blogger Pedro Beltrão . This post has trackbacks.

Monday, March 10, 2008

New JoVE blog & commenting on papers

Anna Kushnir's new blog for JoVE is up and running (actually it has been up and running for a while, I'm a bit behind with blogging. Those January Open Science posts are coming at some point, too). It's a nice mix of content.

Of particular interest are a couple of interesting entries talking about the online participation - or lack thereof - of scientists. See also Noah Gray's take on neuroscientists and web 2.0 and David Crotty's 'why web 2.0 is failing in biology' post.

Did you skip over all those links? You shouldn't, really. At least read David Crotty's.

So, yeah, anyway, why scientists don't comment on papers - my take is that being too busy and being afraid of the consequences don't come into it.

Sure, they're valid concerns - but everybody is busy at work and everybody realizes that what you say on the internet is recorded forever by Googlebot. People still write ranty forum posts and blog comments.

IMHO the main reasons scientists don't leave comments are:

There's no point - who's going to read it? Will you get any feedback? Will you get any credit for it?

and

It's too much work - writing a comment should be a one click operation. Well, two clicks, one to get the focus in the textbox and the other to press 'submit'.

Science publishers can address both of these issues, but we've been failing to do so.

Comments and trackbacks Feel free to post your comments Blogger Neil Blogger Ian Mulvany Blogger Bill Hooker Blogger Neil Blogger McDawg Anonymous David Crotty Anonymous David Crotty . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009