The movie above is a (blurry when streamed, try the 3.6mb MPEG download) visualization of the signup and invite data from Bookshare from July 18th of last year to now. As users sign up they are geolocated and plotted on the map. More users signing up in a particular location mean bigger, brighter spots. Invites from one user to another are drawn as snaking lines that travel from the city of the inviter to the city of the invitee.
I'd love to hear from anybody who owns a large Facebook application and has kept id / install date / invitee data (bearing in mind Facebook's data privacy rules...). See also this.
I've spent the last six months combing Google for 'slow mysql' results. It was enlightening. Now in the spirit of giving back to the community I give you:
Euan's Top Tips For Medium Sized LAMP Powered Web Apps
Where 'medium sized' means you have four or five concurrent users on ten tables with around half a million rows each and 'top' means that you've already done all the basic stuff - picking table types, adding indexes, designing the database properly in the first place etc.
Some are common sense. Some are not appropriate for all situations. Use at your own risk.
If most operations are reads turn the MySQL query cache on.
Don't use mysqldump for backups. Users don't like five minutes of being unable to write anything to the database because the tables are locked. Use mysqlhotcopy or an LVM based system if you're feeling adventurous.
If you're retrieving large amounts of data with PHP but you don't need the entire rowset in memory at once (or mysql_num_rows()) then use mysql_unbuffered_query instead of mysql_query. Stops you running out of memory so fast.
xdebug is a PHP profiler. With kcachegrind it's like hot coder porn. It's incredibly useful to see how long your code takes to run and why.
in_array()s are innocent looking... and very slow. If the values are unique then use associative arrays instead - i.e. instead of
$data = array("a", "b" .. large amount of data .. "zzz");
for ($i=0; $i < 1000; $i++) { print "is c in array? ".in_array("c", $data); }
use
$data = array("a", "b" .. large amount of data .. "zzz"); $a_data = array(); foreach ($data as $point) {$a_data[$point] = true;}
for ($i=0; $i < 1000; $i++) { print "is c is array? ".isset($a_data["c"]); }
key_buffer_size is the amount of memory that MySQL puts aside to keep table indexes in memory, which is a very good thing. As a rule of thumb it should be 25%-50% of your database server's total RAM (reduce this if you're also using it for other things, natch). The default is 8mb, so once you've got more than 8mb worth of indexes...
How can you tell if you need a bigger key buffer size?
mysql> SHOW STATUS LIKE '%key_read%'; +-------------------+---------+ | Variable_name | Value | +-------------------+---------+ | Key_read_requests | 6375479 | | Key_reads | 130562 | +-------------------+---------+ 2 rows in set (0.00 sec)
There should be at least 100 key_read_requests (from memory) to every key_reads (from disk), as above.
Do a lot of sorting? Increase sort_buffer_size but bear in mind that unlike key_buffer_size it is allocated per connection - i.e. increase it to 32mb and MySQL will gobble 32mb x 4 = 128mb the next time you have four connections open at once (also remember that your code will - probably - sometimes fail or forget to shut down the connection).
Do a lot of ORDER BYs? Increase read_rnd_buffer_size. Allocated per connection, as above.
Fetching frequently used data from disk can sometimes be faster than using memcached or similar (no, really. Far less overhead + the file is cached by OS and so is in memory anyway).
serialize() and unserialize() are slow. Use JSON where possible instead (nowhere near as flexible, sadly)
To paraphrase the best optimization is avoiding doing the work in the first place: do you really need to return ALL of the rows, or just the first few hundred? Do you really need to sort them? Do they need to be distinct? Is it faster to get more data than you need and process it in code?
Avoid ORDER BY RAND(). Do something clever in code (pick some randomly selected ids?)
SQL_CALC_FOUND_ROWS can be faster than rerunning a query with COUNT() - but not always.
Yes, nothing to do with bioinformatics, but social networking:
Bebo announced that it'd support Facebook style apps as well as OpenSocial late last year. They've launched properly now.
The documentation has been up for a while already and there aren't any real surprises, though there's a couple of new Bebo specific tags (which seems like a bad idea to me; I mean, the whole point is code compatibility between multiple systems).
One major omission is Javascript support; there's no equivalent to FBJS (Facebook's sandboxed Javascript version). This means that a lot of the bigger apps will hold off moving over, I'd have thought - though it's not a problem for simpler quizzes and Flash based stuff like Scrabulous.
There are a couple of OpenSocial containers now but I haven't heard of any Facebook server components being made available (just a 'top level overview' document on their wiki). We've just finished building an API for Nature Network (private, for now), I wonder how much work it'd be to start supporting apps?
OK, so it's going to be a week of open science spread out over the month. Probably. I've been distracted by:
Silverlight: actually it's quite cool - writing simple multimedia apps is a cinch. Shame that it's probably not going to catch on... it's not that much better than Flash.
All that Zed Shaw ragging on the Rails community stuff - mainly cause I think Rails is overhyped (though definitely has its place). I also liked his quote about implementing the semantic web over existing web infrastructure: "Einsteins brain on a crack whores body isn’t going to happen". I disagree but made me laugh...
Flags and Lollipops is a blog about bioinformatics, genomics and science on the web. I work at Nature in the web publishing department, but the views and opinions expressed here are personal and do not represent those of Nature Publishing Group.
You can email me at e dot adie at nature.com.
"Full disclosure" information about this blog can be found here.