EPUBs beyond the PDF

Martin Fenner wrote an interesting post about EPUBs after they came up at the Beyond the PDF workshop organized by computational bio hero Phil Bourne and Elsevier’s Anita de Waard.

We’ve been working with scholarly content in EPUBs for about a year at NPG now. An internal web service generates them automatically from contnet in an XML article store, it’s how we deliver content to the iPad and iPhone mobile apps.

(Actually we cheat. The mobile app EPUBs have different styles, reference fonts embedded in the app bundle and don’t always contain valid XHTML 1.1. Basically they don’t all conform to the EPUB standard, but it’s expedient while our apps are the only consumer)

It’d be a bad idea to use HTML through the entire workflow for scholarly content. EPUB is, again, designed to solve a presentation problem, not for reusability or storage or easy querying. Though the status quo often sucks in this case there’s good reason everybody is moving towards XML.

All that said I love the idea of hacking together a WordPress tool for writing, storing and publishing papers so wanted to contribute a few thoughts:

  • EPUB requires XHTML 1.1, so Martin’s 5th slide isn’t true, strictly speaking (see what I did there?). XHTML also isn’t very exciting after the hundredth “no block elements inside inline elements” validation error
  • If I understand correctly you’re not that bothered about supporting lots of different eReaders, the attraction of EPUB is that it is a single file that holds a bunch of other files. If that’s the case then it’s not EPUB you want, per se, just the OPF and OEBPS bits. EPUB is the combination of OEBPS (a zip file), OPF (manifest for zip file contents) and OPS (content in XHTML)
  • There isn’t a standard way that reading software will currently understand of representing information like “file x is a letter from reviewer 2″ “file y is the dataset used in paragraph 4″ etc. etc. There’s a limited set of types to pick from when describing files in the manifest, but you can make up some new ones.
  • You can include the same content in different formats e.g. HTML, PDF inside the same OEBPS container
  • Web based eReaders (including ones on mobile devices that use webkit to render pages) probably won’t care if you’ve got XHTML 1.1 or HTML5 inside your container, so you’ll still be supported by many clients even if you drop the XHTML 1.1 requirement and don’t call them .epubs.
  • … or you could just wait for EPUB v3
Posted in Uncategorized | Leave a comment

Igor – a Google Wave robot to manage your references

(Google Wave hasn’t been released yet but if you’re interested in working with the preview you can request a developer account on the sandbox here)

Google Wave is a new open source project from Google that holds a lot of promise as a platform for scholarly communication. It’s a little bit like email but allows for collaborative document editing, versioning and real time conversation within groups – check out Cameron and Martin’s archives for more.

Igor is a proof of concept Wave robot that allows Wave users to pull in citations from Pubmed or their libraries on Connotea and CiteULike as they type.

To use it invite helpmeigor@appspot.com to join a wave.

Say you’d like to cite ‘Chaperonin overexpression promotes genetic variation and enzyme evolution‘ by Nobuhiko Tokuriki and Dan Tawfik from last month’s Nature.

In the Wave you’d write:

… as shown by Tokuriki et al. (cite chaperonin tokuriki)

Igor will notice the (cite x), connect to PubMed, search for articles where the title, authors or journal contain “chaperonin” and “tokuriki” and then pull in the relevant citation. The (cite x) will be replaced with a number and the citation will be appended to the end of the document.

… as shown by Tokuriki et al. [1]References

1. Chaperonin overexpression promotes genetic variation and enzyme evolution. Tokuriki et al 2009 Nature
(http://www.ncbi.nlm.nih.gov/pubmed/19494908)

If Igor comes up empty handed or multiple articles match the cite query then it’ll tell you by dropping in a message after the relevant part of the document.

To cite a web page just do

Google Wave (cite http://wave.google.com)

To switch to using your Connotea or Citeulike library you can use the (cite from x) command.

e.g.

(cite from citeulike dullhunk)
(cite from connotea euanadie)
(cite from pubmed)

You can switch between citation libraries in the same session:

candidate genes include NRG1 (cite from connotea euanadie)(cite schizophrenia neuregulin) and (cite from pubmed)DISC1 (cite PDE4B evans schizophrenia)

Commands are processed in order of appearance so Igor will search Connotea for “schizophrenia neuregulin” and PubMed for “PDE4B evans schizophrenia”.

References are always numbered in order of their first appearance in the text. If you move a reference from the bottom of the article to the top then reference numbers will be change accordingly.

Igor is written in Java and runs on App Engine. It’s almost inevitable that you’ll experience some turbulence, especially when introducing him to a new Connotea or CiteULike account for the first time – Wave robots are very unforgiving of sites timing out. If something looks broken try leaving the wave and coming back to it later or reloading the page. Let us know how you get on!

Posted in Hacks, apps & mashups, Originally posted on Nascent | Leave a comment

Lies, damn lies and article download statistics

Shirley Wu posted on Friendfeed earlier about some of the things she’d overheard people saying about PLoS ONE papers. PLoS ONE Manging Ed Peter Binfield weighed in early to point out that the best way of combating misconceptions about the journal is to push out positive info and mentioned the journal’s article-level metrics program.

Near the end of the (long) thread was this exchange:

“You could try asking them exactly how many downloads their last paper in a ‘high impact’ journal got… – Peter BinfieldFair enough, but you know, I really don’t think they think about that. They think “what will be in my CV?” and they think any journal that is somewhat competitive [includes other PLoS journals, BMC journals, etc] looks better than one that accepts anything that’s methodologically sound. Again, not my view, but perhaps one that is held by many. Do people list # of downloads on their CV for publications? – Shirley Wu

They dont, because they dont have the data. However, people do list if their paper was rated by F1000; or if BMC designated it a ‘highly accessed’ article. So I think they will start to say “this paper was downloaded 5000 times in the first 3 months which put it in the top x% of all PLoS ONE articles, the top y% of all PLoS articles, and the top z% of ALL articles” (when the rest of the world starts quoting this data) – Peter Binfield”

Do people here think that article downloads stats should be put on academic CVs? (serious question)

It feels wrong to me. IMHO encouraging anybody to take download statistics seriously as a measure of success / quality would be a mistake. Taken on their own they’re meaningless, surely – nice to know for the author, but meaningless. For them to be at all useful you’d have to supply a lot of context – as Peter suggests – though I don’t think the journal level “top 10% of papers in first three months” context he outlined would be enough either.

(just to be clear I don’t think Peter was necessarily saying that people should put only the download count on their CV – am using his comment above simply as a jumping off point for discussion)

A download counter can’t tell if the person visiting your paper is a grad student looking for a journal club paper, a researcher interested in your field or… somebody who typed in an obscure porn related search that turned up unconnected words in the abstract. A search bot. Somebody on Google Images looking for free clipart. Got a blog? Check your traffic stats. Journals get those crazy queries too, lots of them. Mainstream search engines are a major source of traffic for journals but not always for the reasons publishers might want.

As a publisher do you account for this and only record ‘good’ traffic? What if your competition don’t?

Institutions and ISPs transparently cache pages. If my lab mate and I both download your paper depending on the publisher’s stats package it might register as only one hit (from the university proxy server). Do you compensate for that somehow?

Am I going to be penalized if I host my papers on my homepage? In my institutional repository? Should I add all those counts up for my CV? Do I need to cite my sources?

Should I tell my mum to set my paper as her homepage (and to be sure to delete her cookies each morning)?

If Science spends $50m on SEO next year and hits on their article pages double will the articles in 2010 be twice as good as those in 2009?

As an author should I be repeating keywords in my title to get more Google traffic? Should I try to include a figure of Britney Spears?

If we stick to giving ‘top x percentage’ context then do we make concessions for smaller disciplines publishing in multidisciplinary journals? More people work and publish in genetics than in quantum physics. Even if every important person in your field downloads your paper they might be outnumbered by grad students from the three dozen groups working on Rab4A effectors that download the genetics paper next to yours in the TOC.

I’m not saying that download stats aren’t useful in aggregate or that authors don’t have a right to know how many hits their papers received but they’re so potentially misleading (& open to misinterpretation) that it doesn’t seem to me the type of metric we want to be bandying about as an impact factor replacement.

Posted in Originally posted on Nascent, Scholarly publishing | Leave a comment

Streamosphere update

This month’s iteration of Streamosphere is now up. It’s still more a preview than a product but imho it’s approaching usefulness!

grid.png

The main changes are:

  • a new way of exploring the site – the list view shows you the most popular items within a given time frame. It’s sort of like Digg but to vote an item up you need to have commented on it or shared it on a social media site.
  • simplified sidebar, visual cues on the grid / timeline view and a help link will hopefully help new users work out what they’re seeing
  • the aggregation logic now uses Friendfeed’s SUP feed and connects directly to Twitter, so messages are picked up much faster.
  • trending topics – this is a list of topics that are appearing more frequently than you might expect. Bear in mind that it’s generated algorithmically so items are sometimes grouped together in odd (but technically correct ;) ) ways…
  • clicking on “see details” in the list view or on an item in the grid view brings up a breakdown of comments and tweets which you can use to jump straight into a conversation on, for example, Friendfeed.

There are still lots of little niggles. On smaller timescales (anything under than four hours) there’s lots of items that aren’t strictly speaking about science, too. Still not sure if that’s a bug or a feature.

The next version will focus on people – both the people being followed by Streamosphere and visitors to the site – and grouping items by topic.

Posted in Hacks, apps & mashups, Originally posted on Nascent | Leave a comment

Welcome to the Streamosphere

Web publishing as a discipline has few tenets but I think release early, release often and don’t be afraid to fail are pretty sound. That was the philosophy behind Connotea when Timo and Ben Lund launched it in 2004 and it’s the spirit in which I’ve just put up an early version of Streamosphere.

Streamosphere is a pet side project which I’m running according to what I guess you could call the Paul Graham principles (it’d be disingenuous to say “as a start-up” as most startups don’t have NPG level resources. OTOH we lack a fussball table and free M&Ms). Think of it as a pre-alpha alpha.

The elevator pitch

Streamosphere lets you track scientific discussion on the web, in real time.

What it does

If you visit streamosphere.nature.com/preview.php#24 you’ll see a page of stacked timelines like these:

Picture 5.png

Each timeline shows discussion around a particular item, for now always a web page. The portrait on the left is of one of the people who first started talking about the item. The slice of time in which the discussion was active (people were leaving comments, tweeting, liking or bookmarking it) is coloured a shade of magnolia. Behind the active slice is a graph – this shows you how much activity there was at any one point.

Click on an item’s active slice to pop up more details about it including an activity breakdown and a selection of associated comments and tweets. If the item is a video or photograph it should be embedded in the popup. If the item description is in a foreign language hover your mouse cursor over it to get the English translation.

Picture 6.png

Streamosphere only ever shows the most active items in a given time period. Use the controls on the right hand side of the screen to see the most active items in the past few hours, day, week or month. You can also filter items by domain or by keywords in their description.

In smaller time periods you’ll see some items that aren’t anything to do with science: recently there’s been stuff about Iran and a viral video for example. I’m not sure if this is a bug or a feature, or how to filter out non-science stuff is that’s a requirement – suggestions welcome.

In the future I’d like to see the page update dynamically as new activity gets tracked but for now to refresh the page you need to reload or choose a new time period.

How it works

Streamosphere tracks ~ 4k accounts on half a dozen different social media sites including Friendfeed, Twitter and bookmarking services like Delicious. The account owners have all self-identified (sometimes implicitly) as scientists or people interested in science.

It uses a combination of polling, web hooks (via GNIP) and SUP feeds to aggregate public updates from tracked accounts as soon after they happen as possible. Average latency is ~ 3 minutes for Friendfeed and a few seconds for Twitter.

Right now there’s only one view on the data: by item. Items are the URIs associated with or mentioned in updates: if I tweet “I love http://lolcats.com” and you bookmark it on delicious then the streamosphere database will record a single item (lolcats.com) associated with two updates.

Items are currently always websites but in the future I’d like to add views for users and topics; these are non-trival because of problems with account owner disambiguation and classifying short messages respectively.

Owner disambiguation relies on the Google Social Graph API. We need to disambiguate owners because otherwise the same person could post a single link on multiple services and Streamosphere would believe it’s amazingly popular.

Sometimes users have set up rules to automatically route updates from one service to another (e.g. they share an item on Google Reader which appears in their Friendfeed stream which gets pushed out to their Twitter account). Rules like this are the bane of Streamosphere’s existence – it’s non-trivial to detect this kind of thing and handle them correctly.

I’m collecting hashtags, tags and extracting key terms from all updates but don’t quite know what to do with them yet – still need a good algorithm to detect trending topics. Links are extracted from updates but right now there’s no disambiguation for papers (Buggotea is alive and well in Streamosphere). There’s a best effort attempt to resolve shortened URLs though occasionally one will slip through.

There’s no API but if anybody has a good use for the data I’m happy to set something up using GNIP or long polling to support real time updates if necessary – just send me a use case.

Posted in Hacks, apps & mashups, Originally posted on Nascent | Leave a comment

Aggregating activity from Twitter

Update: you can’t follow a specific set of users using GNIP any more – their feed is equivalent to the ‘spritzer’ method in the official Twitter API.

Interested in building a real time aggregator for Twitter? Who isn’t? You have lots of options:

Just the vanilla API

Simply call user_timeline for each user that you are interested in every x minutes.

The standard rate limit on the Twitter API is 100 requests per hour e.g. checking 25 users every 15 minutes is pretty much the best you’ll be able to do. If you’re a lazy chancer you can try and get your application whitelisted which removes rate limits.

Good:

  • Very simple

Not so good:

  • Too simple – won’t scale.
  • Slow update time (while the number of calls you can make per hour is limited)
  • Seeing so much redundant data returned for each call makes the internet cry.

Vanilla API + robot

Create a new Twitter account, log in and follow the people you’re interested in aggregating tweets from. You don’t have to follow people manually – you could do it programmatically using the friendships/create API call.

Now just check the friends_timeline for that user as often as you like (up to the hourly rate limit, obviously). Page through results if necessary.

Twitter has some (sensible) rules about follower / following ratios. Once you’re following ~ 800 people further follow requests will be blocked; you have to wait until you have more followers before adding anybody else. You can’t whitelist your way out of this.

Good:

  • Again, pretty simple.
  • Better update time (aggregation within a couple of minutes of a tweet)

Not so good:

  • Can only follow ~ 800 people before Twitter starts blocking your follow requests.
  • Users will know that you’re aggregating them (is this a bug or a feature?). Can’t keep following / unfollowing people – they’ll get spammed by emails telling about it.

GNIP

GNIP works activity streams from a bunch of different web 2.0 sites. Here’s how it works in a nutshell:

  1. you set up a GNIP account
  2. you add rules to your account (“give me all tweets by @twalf” “give me all tweets by @ianmulvany”) and set up a web hook (a script on your server). You can have up to 25k rules per site for free.
  3. GNIP receives data in real time from Twitter
  4. If any data matches your rule set then GNIP POSTs to your web hook with some metadata about the matching tweet (a unique id, the tweeter’s username, a URI for the actual message)

Now you’ll get pinged whenever anybody in your rules tweets – in close to real time.

Rules can be added programmatically or by hand. GNIP’s API docs are pretty opaque but it’s actually a fairly simple, efficient system once you’ve gotten to grips with it.

Unfortunately the metadata that gets POSTed to you doesn’t contain the actual tweet. For that you have to go back to Twitter using the supplied URI, which points to the message in XML format. Remember that there’s a rate limit on the Twitter API so by default you won’t be able to aggregate more than a hundred messages per hour. This sucks. Whitelisting is pretty much the only way you’re going to overcome this.

Twitter on GNIP is unique in this respect; none of the other services require you to call the originating site to get messages. It’s especially annoying as tweets are only 140 characters long – it’s definitely not a space / bandwidth issue!

Good:

  • Fast update time (pretty close to real time)
  • GNIP infrastructure can help you aggregate from other sites (Digg, Delicious…) in the future.
  • Follow up to 25k people for free and without scaling issues.

Not so good:

  • Relatively complex.
  • GNIP can be a bit flaky – occasionally it goes down and you lose updates for a few hours.
  • Requires whitelisting by Twitter once you’re collecting more than a hundred tweets p/h.

Twitter streaming API

Twitter has a streaming API in alpha.

You can follow up to 200k users by POSTing their ids to http://stream.twitter.com/birddog.json – after you’ve been approved by Twitter and signed a usage agreement.

You can follow up to 2k users for free using http://stream.twitter.com/shadow.json which is similar.

You can follow up to 200 users for free using http://stream.twitter.com/follow.json which is similar.

Once you’ve opened a connection to shadow or birddog it’ll never close. When a followed user tweets it’ll come down the wire as a line of JSON (ending with a carriage return). Think Comet.

Good:

  • As fast an update as you’re ever going to get.
  • Don’t need to rely on third parties (like GNIP)

Not so good:

  • Still in alpha.
  • Need an agreement from Twitter to follow more than 2k users.
  • Complex (in that it requires you to move away from reactive, asynchronous scripts towards an app that can keep an HTTP connection open for hours)
Posted in Uncategorized | Leave a comment

Pubmed Faceoff

I find the science of face perception fascinating. The human brain is highly tuned to identify, process and interpet faces – understandable, as they play a tremendously important role in our social interactions. It’s a hardwired proficiency that kicks in early and if anything works too well (Toast. Ebay. $28k. Say no more).

Chernoff Faces are a visualization technique developed in the 70s to take advantage of our innate ability to detect small differences in the size, shape and expressions of human faces. The idea is to take a dataset and then map each dimension to a different facial feature, be it the slant of the eyebrows, size of the nose or the chubbiness of cheek (Herman Chernoff, who came up with the idea, suggested ten different possibilities).

It’s an appealing concept. Sadly Chernoff Faces never really took off, possibly because existing implementations don’t produce anything that looks like a face. You’d have a hard time finding anybody who prefers the faces produced by R to the data table they were derived from.

Computer graphics have moved on a bit from 2D lines and circles, though. Photorealistic 3D facial models are de rigeur nowadays in everything from Second Life to video games. What if we took the technology from there and applied it to Chernoff Faces?

I gave it a go. Check out Pubmed Faceoff (and be gentle – it hooks into other webservices and can be quite slow).

Pubmed Faceoff is a mashup of Pubmed, Carl Bergstrom’s Eigenfactors dataset and Scopus, inspired by something that Pierre Lindenbaum mentioned on Twitter. It renders PubMed results as a set of photorealistic Chernoff Faces whose facial features are determined by the age, citation count and journal impact factor associated with each paper. The idea is that you can tell at a glance which papers are new, exciting and high impact and which are languishing, uncited and unread.

I’m quite pleased with how the system turned out although to be honest I still think the usefulness of Chernoff Faces is debatable. Does it actually work? Is the amount of time it takes you to adjust to scanning the faces more than the amount of time it’d take to simply scan a table of data? Or is it just cute?

The gender and ethnicity of each face are picked at random to add a bit of visual interest but personally I find it slightly easier to interpret the faces when they’re all male and European. That I’m rubbish at reading women comes as no surprise but the ethnicity thing is interesting as it fits with research into cross-race facial recognition that suggests we’re each better at recognizing the types of faces that we see every day.

While the photorealism helps it’s important with Chernoff Faces to map dimensions to the right features to aid comprehension. It definitely helps that it’s a short logical leap from ‘happy faces’ to ‘happy papers’ (in good journals that have been cited lots). The age feature for age of paper is also a no-brainer.

It’d be interesting to incorporate other dimensions into the faces, though. Perhaps the number of authors of a paper could determine how fat or thin a face is? A spotty complexion could indicate a first time author? Nature papers could be represented by Chuck Norris?

update: for more on the ‘sort by impact’ idea have a look at the commentary surrounding Pierre’s original tweet.

Posted in Hacks, apps & mashups, Originally posted on Nature Network, Visualization | Leave a comment

Sentiment analysis on science blogs

FeelingsLP.jpg We’ve been thinking about new features for Nature.com Blogs recently, after spending a lot of time on the back end doing boring yet vital things like enabling trackbacks for journal articles on nature.com.

One particularly cool new feature (potentially) is sentiment analysis. Nature.com Blogs already performs entity extraction, pulling out all of the names, places and things mentioned in each blog post. We use this to cluster posts about the same topic together in the “stories” section.

Sentiment analysis tries to give emotional context to entities. For example, if I blog:

“I love Biology. It rules, Physics drools”

and Nature.com Blog processes my post then it might store the following metadata alongside it:

<entities>
   <entity name="Biology" emotion="Positive" score="0.6" />
   <entity name="Physics" emotion="Negative" score="0.3" />
</entities>

… here “Biology” and “Physics” are the entities; each has an emotion associated with it in the text. There are more positive emotions associated with “biology” than there are negative emotions associated with “physics” – that’s the score part.

Sentiment analysis is still a young field and frankly it gets things wrong a lot of the time. It’s also difficult to find a system that can do both entity extraction and sentiment analysis properly – to build a proof of concept I had to use a combination of Yahoo! Term Extraction and OpenAmplify.

Having said that, I think results over large datasets are promising. I’ve run a couple of thousand posts through the proof of concept system and compiled lists of the entities most strongly associated with positive and negative emotions in science blogs this week (published in the next couple of posts). Is this information useful? Interesting? Fun? Misleading? Any suggestions for how it might be presented are welcome!

Posted in Analysis, Machine learning, Originally posted on Nascent | Leave a comment

Wolfram Alpha has potential but I can’t see scientists using it for a while yet

hal9000.jpgWolfram|Alpha should have launched officially by the time you read this, though it has been live since Friday evening. The execution is slick. The different result visualizations are a great idea. It’s loaded up with cool widgets and APIs. Most of the time the servers don’t fall over (despite some glaring security holes). To quote FriendFeeder Iddo Friedberg it’s “a free, somewhat simple interface to Mathematica”. Free for personal, non-commercial use, anyway. If you’ve got any questions about the GDP of Singapore then wolframalpha.com is the place to go.

I think that it’s a very interesting project and that it’s important to bear in mind that as the homepage says:

Today’s Wolfram|Alpha is the first step in an ambitious, long-term project to make all systematic knowledge immediately computable by anyone

(emphasis mine)WA certainly has lots of potential but was anybody who used it over the weekend not left mildly let down? You’d have thought that we’d all have learned not to believe interweb hype after the Powerset and Cuil launches but even if you took all the pre-launch media guff with a liberal sprinkling of salt it was hard not to expect much from Alpha. A breathless Andrew Johnson suggested that it was “the biggest internet revolution for a generation” in The Independent: “Wolfram Alpha has the potential to become one of the biggest names on the planet”.

Personally I was disappointed because I’d been expecting the wrong thing. I’d assumed that WA was akin to Cyc, which is a computational engine that takes a large manually curated database of “common sense” facts and relations and uses it to infer new knowledge. For example: searching photos for “someone at risk for skin cancer” might return a photo captioned “girl reclining on a beach”. Reclining at the beach implies suntanning and suntanning implies a risk of skin cancer.

A few years back a Paul Allen venture called Project Halo took the engine behind Cyc and taught it facts and rules from chemistry textbooks; it took a lot of time and money but the resulting system had a good go at answering college level chemistry exam questions.

It turns out that WA doesn’t do anything like this. One of the most interesting posts about the system that I’ve read comes from Doug Lenat who perhaps not coincidentally is the founder of Cyc. Lenat was impressed by WA but notes that it’s a different beast altogether:

It does not have an ontology, so what it knows about, say, GDP, or population, or stock price, is no more nor less than the equations that involve that term”… [it's] able to report the number of cattle in Chicago but not (even a lower bound on) the number of mammals because it doesn’t know taxonomy and reason that way

If a connection isn’t represented by a manually curated equation it isn’t represented at all. Apparently the Mathematica theorem prover is currently turned off as it’s too computationally expensive.

One example of this is: “How old was Obama when Mitterrand was elected president of France?” It can tell you demographic information about Obama, if you ask, and it can tell you information about Mitterrand (including his ruleStartDate), but doesn’t make or execute the plan to calculate a person’s age on a certain date given his birth date, which is what is being asked for in this query.

It might seem harsh to criticize WA for not being what people (OK, I) wanted it to be but bear in mind that Wolfram’s About and FAQ pages suggest that WA is an amazing leap forward that brings “expert level knowledge” to everybody and “implements every known model, method, and algorithm” – it’s not like they were managing expectations particularly well.

Even if the computational inference part is lacking the system is still potentially useful as a well presented structured data almanac – but I’m not convinced that it’s a winner for life sciences data.

Wolfram|Alpha for genetics questions

If I search for “DISC1″ I get back information about the human gene (genetics coverage in WA is lacking, despite Stephen Wolfram using a sequence search in the video demo. Only the human genome is available). It tells me the transcripts, reference sequence, the coordinates of DISC1, protein functions and a list of nearby genes.

That data is useless without proper citations, though. What genome assembly release are the gene coordinates on? Are the “nearby genes” nearby on the same assembly, or do they come from a different source? Who and what predicted the transcripts, and what data did they use? Were the protein functions confirmed by work in the lab or just predicted by algorithm (if so, what’s the confidence score)?

The “sources” link at the bottom provides a bunch of high level papers describing different genome databases but doesn’t specifically match these to elements of data on the page: furthermore there’s a disclaimer suggesting that actually the data could be from somewhere else entirely that isn’t listed. Not much help.

What happens with contradictory data? The GDP of North Korea varies depending on who I ask. How does WA – or rather whoever curates that data for WA – decide which version of the answer to show?

I’m also worried about how current the data is. Lenat mentions that:

In a small number of cases, he also connects via API to third party information, but mostly for realtime data such as a current stock price or current temperature. Rather than connecting to and relying on the current or future Semantic Web, Alpha computes its answers primarily from [Wolfram's] own curated data to the extent possible; [Stephen Wolfram] sees Alpha as the home for almost all the information it needs, and will use to answer users’ queries.

I can see why you wouldn’t want to rely on connections to third party data sources for anything that looks like a search engine; users expect a quick response. But in fast moving scientific fields the systematic knowledge that’s useful to researchers isn’t static like dates of birth or melting points – datapoints get updated, corrected and deleted all the time. Does Wolfram bulk import whole datasets regularly? If I correct an error in a record at the NCBI when will Wolfram pick it up?

Can a monolithic, generalized datastore run by Wolfram staff work as well as smaller specialized databases run by experts? What’s the incentive for the specialized databases to release data to Wolfram in the first place, given that WA will be a commercial product?

(for more science tinged coverage there’s lots of Wolfram|Alpha chatter on Friendfeed, a new room dedicated to collecting life sciences feedback for Wolfram and Deepak has a good blog post out)

Posted in Originally posted on Nascent, Thumbs down | Leave a comment

Which Web 2.0 services do scientists use?

Which web services are scientists actively contributing to?

There are ~ 1,240 Friendfeeders in science related rooms (the-life-scientists, scienceapps, science-2-0, science-online…). What percentage have listed usernames associated with the science related tools supported by Friendfeed?

Picture 10.png

Service Count
citeulike 41
connotea 31
delicious 431
digg 208
googlereader 394
reddit 68
slideshare 143
twitter 675
youtube 341

Why this dataset isn’t very good…

There’s a bias towards services formally supported by Friendfeed – it’s easy to add feeds from supported services. Connotea and CiteULike aren’t formally supported though you can add your library RSS feeds manually. Many Friendfeed users won’t bother to do this.

People may be contributing to services (like YouTube…) for reasons that have nothing to do with science.

People who use Friendfeed aren’t a representative sample of scientists (though they may well be a representative sample of blog friendly, web savvy scientists).

People sometimes remove their Twitter feeds from Friendfeed to help keep the conversations that they start there in one place.

I picked the set of services to look at which is why you don’t see, say, Wikipedia or OpenWetWare above (some preliminary analysis suggested that the numbers would be negligible).

That said…

We can still use it to guess at broad trends.

Almost a third of Friendfeed scientists have delicious bookmarks. Don’t discount non-academic bookmarking services as a source of paper metadata.

A similar number use the share functionality in Google Reader.

Despite rumors to the contrary not everybody is on Twitter.

A surprising (to me) number of people are uploading and favouriting items on Slideshare.

Posted in Analysis, Originally posted on Nascent, Scholarly publishing | Leave a comment