Flags and Lollipops

Saturday, June 13, 2009

Aggregating activity from Twitter

Update: you can't follow a specific set of users using GNIP any more - their feed is equivalent to the 'spritzer' method in the official Twitter API.

Interested in building a real time aggregator for Twitter? Who isn't? You have lots of options:

Just the vanilla API

Simply call user_timeline for each user that you are interested in every x minutes.

The standard rate limit on the Twitter API is 100 requests per hour e.g. checking 25 users every 15 minutes is pretty much the best you'll be able to do. If you're a lazy chancer you can try and get your application whitelisted which removes rate limits.

Good:
  • Very simple

Not so good:
  • Too simple - won't scale.
  • Slow update time (while the number of calls you can make per hour is limited)
  • Seeing so much redundant data returned for each call makes the internet cry.

Vanilla API + robot

Create a new Twitter account, log in and follow the people you're interested in aggregating tweets from. You don't have to follow people manually - you could do it programmatically using the friendships/create API call.

Now just check the friends_timeline for that user as often as you like (up to the hourly rate limit, obviously). Page through results if necessary.

Twitter has some (sensible) rules about follower / following ratios. Once you're following ~ 800 people further follow requests will be blocked; you have to wait until you have more followers before adding anybody else. You can't whitelist your way out of this.

Good:
  • Again, pretty simple.
  • Better update time (aggregation within a couple of minutes of a tweet)

Not so good:
  • Can only follow ~ 800 people before Twitter starts blocking your follow requests.
  • Users will know that you're aggregating them (is this a bug or a feature?). Can't keep following / unfollowing people - they'll get spammed by emails telling about it.

GNIP

GNIP works activity streams from a bunch of different web 2.0 sites. Here's how it works in a nutshell:

  1. you set up a GNIP account
  2. you add rules to your account ("give me all tweets by @twalf" "give me all tweets by @ianmulvany") and set up a web hook (a script on your server). You can have up to 25k rules per site for free.
  3. GNIP receives data in real time from Twitter
  4. If any data matches your rule set then GNIP POSTs to your web hook with some metadata about the matching tweet (a unique id, the tweeter's username, a URI for the actual message)

Now you'll get pinged whenever anybody in your rules tweets - in close to real time.

Rules can be added programmatically or by hand. GNIP's API docs are pretty opaque but it's actually a fairly simple, efficient system once you've gotten to grips with it.

Unfortunately the metadata that gets POSTed to you doesn't contain the actual tweet. For that you have to go back to Twitter using the supplied URI, which points to the message in XML format. Remember that there's a rate limit on the Twitter API so by default you won't be able to aggregate more than a hundred messages per hour. This sucks. Whitelisting is pretty much the only way you're going to overcome this.

Twitter on GNIP is unique in this respect; none of the other services require you to call the originating site to get messages. It's especially annoying as tweets are only 140 characters long - it's definitely not a space / bandwidth issue!

Good:
  • Fast update time (pretty close to real time)
  • GNIP infrastructure can help you aggregate from other sites (Digg, Delicious...) in the future.
  • Follow up to 25k people for free and without scaling issues.

Not so good:
  • Relatively complex.
  • GNIP can be a bit flaky - occasionally it goes down and you lose updates for a few hours.
  • Requires whitelisting by Twitter once you're collecting more than a hundred tweets p/h.

Twitter streaming API

Twitter has a streaming API in alpha.

You can follow up to 200k users by POSTing their ids to http://stream.twitter.com/birddog.json - after you've been approved by Twitter and signed a usage agreement.

You can follow up to 2k users for free using http://stream.twitter.com/shadow.json which is similar.

You can follow up to 200 users for free using http://stream.twitter.com/follow.json which is similar.

Once you've opened a connection to shadow or birddog it'll never close. When a followed user tweets it'll come down the wire as a line of JSON (ending with a carriage return). Think Comet.

Good:
  • As fast an update as you're ever going to get.
  • Don't need to rely on third parties (like GNIP)

Not so good:
  • Still in alpha.
  • Need an agreement from Twitter to follow more than 2k users.
  • Complex (in that it requires you to move away from reactive, asynchronous scripts towards an app that can keep an HTTP connection open for hours)

Comments and trackbacks Feel free to post your comments Blogger Rose Anonymous Anonymous Blogger LargelyPolitical . This post has trackbacks.

Friday, March 20, 2009

Postgenomic hiatus

A couple of weeks ago I switched off the Postgenomic aggregation pipeline.

This is mainly because the pipeline scripts were hogging disk / memory resources on the server which it shares with a bunch of other applications. I'm not sure exactly where the process is sticking; but to be honest it's not a complete surprise.

Writing a blog aggregator is actually pretty easy; the hard part is dealing with all the weird edge cases. I haven't been paying close attention to the Postgenomic pipeline recently; I think what's currently going wrong is a combination of slow queries across what's now a very large database and one or more odd posts or blogs clogging up the pipeline (I'd post more details if I had them).

NPG doesn't officially support Postgenomic any more, though it does host it ably. Patching the code is something I do 'on the side', which is why it hasn't been fixed yet - I'm really pushed for time with other projects that need to take priority and will be for at least another three or four weeks.

In the meantime, no new blogs will be picked up and posts won't be aggregated at postgenomic.com. The site itself and the API will continue to work.

If you use postgenomic.com for any mashups or scripts then I apologise for the outage - sorry! It will be fixed, it's just the timing that's an issue.

In the meantime, please consider switching to Nature.com Blogs - the user facing features aren't as complete but the backend is. What's more it's fully supported by NPG developers and IT staff.

Comments and trackbacks Feel free to post your comments Anonymous Anonymous . This post has trackbacks.

Monday, February 23, 2009

Paperview

About to put some software here, need to know the permanent URI.

Er, check back later?

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Friday, January 30, 2009

Sony eReader on OSX

For future Google reference, the Sony PRS-505 is perfectly compatible with OSX (just like the Kindle). If you plug it into your mac's USB port it should show up as a new disk image ("Untitled", but still); just drag and drop EPUB, PDF or text files into the database/media/books directory et voila.

Unfortunately it won't charge through USB while plugged into Macbook Pros (don't know about other laptops or desktop Macs) - seems like there's not enough power. I had to find a PC to recharge at.

Labels:

Comments and trackbacks Feel free to post your comments Blogger Morgan Langille Blogger Stew . This post has trackbacks.

Sunday, January 25, 2009

Graham Lawton and Darwin was Wrong

New Scientist this week has an eye grabbing cover.


The cover sports a big green tree with the words “Darwin Was Wrong.” I hope they sell a lot of magazines with that load of tripe, since they certainly were not thinking about the generations of school kids and church-goers who will now be treated to that cover in every creationist power point presentation between now and the Rapture. How many people do you think will actually read the article to discover what it was, precisely, that Darwin got wrong?

(from EvolutionBlog)

There's some fair (I think) coverage in a couple of places like Sandwalk and lots of not so fair coverage everywhere else.

I don't really understand what the big deal is. How dare a mainstream publication use a sensational cover to help sell copies? How dare a journalist cover a story that might be quote mined selectively by creationists?

It doesn't really matter if you're on a magazine front cover or tucked away on pg 127 - if somebody wants to quote you out of context then they can. Surely the thing to do at that point is to confront the person doing the mis-quoting, not to berate the original author.

The cover does make a lovely image for ID proponents to include in powerpoint presentations, yes. But why should New Scientist care? Why should they pander to creationists and sell fewer copies of a magazine that probably does more than any number of science blogs to get schoolkids interested in science?

Graham Lawton is not the enemy.

(New Scientist is full of crap sometimes, though)

Labels:

Comments and trackbacks Feel free to post your comments OpenID maxine Anonymous Anonymous . This post has trackbacks.

Saturday, January 24, 2009

Unforgiving UTF-8 to ASCII conversion

The bulk loader for App Engine doesn't support unicode (?). Irksome.

Here's a quick and dirty solution if you've got iconv installed.


iconv -c -f UTF-8 -t ASCII utf8_data.csv > ascii_data.csv


Drops unacceptable unicode characters (i.e. anything that doesn't have a direct ASCII match). Did say it was dirty...

Comments and trackbacks Feel free to post your comments . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009 February 2009 March 2009 June 2009