Continuing yesterday's well executed web applications theme I've drafted a list of common sense "golden rules" for bioinformatics web application interfaces. After all, the underlying algorithm might be fantastic but if nobody can use it then you may as well have kept it to yourself.
When I say "golden rule", of course, I mean "generally, and in my opinion". If you've got more to add, or disagree with any of them, add a comment and I'll check it out.
1. Know your Audience
- Explain jargon
- Describe promises and limitations of application at outset
- Have common presets for options
Work out who your application is designed for then tailor its presentation to them. Make it easy for people to get what they want out of your software: that's why you wrote it in the first place, isn't it?
Ideally all of your users will be your peers, or will have read your research papers and done some background literature searches on the algorithm that you've implemented. In practice your users will probably be lab monkeys interested in X and who want X without having to care about the precise details of how your system works.
This means you have to achieve a balance between complexity and usability. Have presets for sets of options ("best precision", "best recall"). Explain jargon as it appears on input forms and results pages. List your system's promises and limitations on the first page so that users don't have to be an expert in the field to know what kind of things will work and what won't.
2. Keep It Simple, Stupid
- Don't add features that nobody is going to use
- Let user decide when to ramp up complexity
Related to the above - don't add features that nobody is ever going to use. Adding lots of hyperlinks to GeneCards and RefSeq after every gene name doesn't add any value to your software. Neither does displaying internal database ids or algorithm score breakdowns which don't make any sense out of context.
Where possible, start off simple and let the user ramp up complexity when they are comfortable with it. If you've got a genome browser that can display lots of features, show a basic subset first and let the user add more rather than bombarding them with almost everything at once (*cough* Ensembl *cough*).
3. Guide your users
- Produce a tutorial (and possibly a user manual)
- Always provide examples of input
You don't need to write a manual. Just provide some guidance for your users: a short tutorial would do just fine. Always provide examples of valid input. If possible, try to explain any output on the results page.
4. Use standard formats
- Always accept relevant standard input formats
- Output relevant standard output formats
Accept standard inputs and export standard outputs. Why waste a hundred people's time on data munging (and put off a hundred more) when you could write a simple script now and implement it server side?
By standard, I mean proper standard - FASTA, please, for sequences. And make the output machine readable - that doesn't necessarily mean XML, tab delimited text files will do.
5. Software wants to be free
- Let your users decide how best to use your software
If your application can be run standalone, make it available standalone, preferably under an open source license. If your software is good enough to be peer reviewed as an application note the source code should be good enough to be peer reviewed too.
This will work to your advantage. If you don't believe me, just wait until your server crashes because some postdoc halfway across the world is trying to incorporate your application into a pipeline and is generating hundreds of requests per minute.
There's an interesting application note by Yusuf et al. from the Wasserman lab at Vancouver's Centre for Molecular Medicine and Therapeutics this week, in BMC Bioinformatics. Gene Set Builder - what, no acronym? - does exactly what is says on the tin. It's an online database which allows you to create, store and manipulate sets of genes.
It's a fairly basic idea. The rationale behind GSB is that:
while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. For something like this to actually be useful the benefits of using it have to outweigh the hassle of involving yet another system and its associated peculiarities in your scripts and analyses. Luckily that seems to be the case here - it's a simple system that's straightforward to access via the web, it exports different kinds of data quite happily and it comes with tutorials and a Perl API.
The benefits of keeping gene sets on GSB rather than in, say, a text file on your hard drive are to do with GSB's built-in functionality. The GSB backend integrates with Ensembl and the Wasserman's lab own GeneLynx database to allow users to
search and import genes in batches; synchronize missing and outdated gene annotations with currently available information; compile and export gene sets as FASTA sequences, cDNA transcripts, tables, or as lists of identifiers; share data with other users; and create sets of homologs to facilitate comparative studies across species. Nothing ground breaking - but then that's not the point. It works, and it's easy to use. It reminded me a bit of ORegAnno, in that it's a simple concept well executed. The only problem I'd have with it is... is it going to disappear six months after I put my data on it? Are the authors going to update it when Ensembl changes its API or database schema and the backend code breaks?
... Drosophila scientists get all the fun.
cleopatra The mutant's interaction with the asp gene is lethal. Queen Cleopatra is said to have committed suicide by a poisonous asp.
(via The Old New Thing)
I've been busy at work the past couple of days, attempting to get things finished up by Christmas. I've also been experimenting with genetic algorithms, about which I'll post later on this week.
OK, so this might be old news to some people. But screw Google Video, the Research Channel (via Inforbiomatica) has lots of bioinformatics related programming.... mainly videotaped lectures. Streaming videos in their archives cover ncRNA, the basics of clustering (including clustering of gene expression data), regulatory networks, motif discovery and an appearance by William Noble (previously mentioned here in the SVM tutorial) talking about transcriptional regulatory modules.
I didn't have time to watch all of the talks because Desperate Housewives was on TV. However, there's a lot there to explore. I particularly liked Breaking the Code, about sequencing Arabidopsis, a cress so dangerous that you have to wear futuristic safety googles when handling it in case it explodes in your face. So I gathered, anyway.
ResearchChannel is a consortium of research universities and corporate research divisions dedicated to broadening the access to and appreciation of our individual and collective activities, ideas, and opportunities in basic and applied research. [..] For our many viewers on cable, direct broadcast satellite, and the Internet, ResearchChannel is the C-SPAN of scientific and medical research.
Nice to see bioinformatics on TV, even if it is C-SPAN.
Alf has an informative post over at Hublog about bioinformatics workflows, which is an interesting area (you could also check out Fabrice's post on this subject at Propeller Twist). I had a look at Taverna a wee while ago when I wrote this faintly ranty post about the Grid, but only really sat down with it properly last week.
I thought it was promising (there's a "but" coming up in the next paragraph, but my take home message is that it's pretty nice). You select pipeline components from a big list of web services - which you can add to, obviously - and pipe input from one to the other, then the final output goes to a component which draws a graph, or outputs some text, or whatever. I'm not sure who these workflows are aimed at, though - people who do a lot of work with the same components all of the time?
The thing that stops me from using it regularly... well, the main thing is that I don't think it would save me any time or effort, so the cost / benefit of leaving behind my comfortable IDE just doesn't work out. That might change in the future, but anyway - there's also the fact that while 50% of the components I use during bioinformatics work might be stable objects that I need regularly - to fetch sequences, convert from GFF to FASTA, get some GO terms, etc. - the remaining 50% change frequently, and there's often some non-pluggable piece of software involved. To be able to add it to my workflow I need to wrap it somehow (so it can be used as a component) and have some sort of code glue to convert inputs and outputs into recognizable formats. I've not delved into the Taverna docs deeply enough to know for sure that there's not an easy way to do this, but I suspect Beanshell has to be involved as glue. That's a lot of coding in different languages when I can make both SOAP and system calls in a three line Perl script. The increase in complexity just doesn't translate into added productivity, yet.
Given that there's a skills shortage in bioinformatics, how do you convince students to spend their college years studying multiple alignments and talking about semantic life science databases when pretty much every other course on offer either sounds sexier or is more likely to make them rich?
Start young, that's how. Get to them in high school. You can talk all you want about role models but really the best way to do it is to appeal through what high school kids like best, at least in my male white middle class experience: porn and computer games. America's Army knows this already.
There's no bioinformatics porn, yet. There are plenty of bioinformatics computer games, though.
Starting off with the very basics, the Nobel Foundation has a host of different Flash games based on various prizes awarded over the years. For example, there's a simple DNA game based on the 1962 Prize for Medicine that went to Crick, Watson and Wilkins and a game called Cell Division Supervisor based on the 2001 Prize.
The UK's Royal Society - who've been in the science news recently - have a marginally more complex game with a celebrity tie-in. Activistion may have Tony Hawk, but the Royal Society have Terri Atwood of PRINTS fame. It involves drag n' drop sequence alignments. I actually quite like it - it'd keep kids occupied for a couple of minutes, anyway - but I'm perturbed by the fact that the animations depicting sequence database searches and the like take place in an Internet Explorer window that is clearly titled "Welcome to MSN" - does the Society know something that we don't?
Finally, the genuinely impressive Origin: Unknown from the Southwest Biotechnology and Informatics Centre is a complex web based sci-fi game combining space laboratories and holo-supervisors with BLAST and ClustalW. Beats Myst any day.
(I'm joking about the computer games and porn, thing, obviously. But getting kids interested in bioinformatics is an interesting topic: check out Sandra Porter's blog or this article about high school kids being taught bioinformatics, which is via Snowdeal).
There's an interesting paper in November's PLoS Biology by Neduva et al., about finding short linear motifs using protein interaction networks.
Many aspects of cell signalling, trafficking, and targeting are governed by interactions between globular protein domains and short peptide segments. These domains often bind multiple peptides that share a common sequence pattern, or “linear motif” (e.g., SH3 binding to PxxP). Many domains are known, though comparatively few linear motifs have been discovered. Their short length (three to eight residues), and the fact that they often reside in disordered regions in proteins makes them difficult to detect through sequence comparison or experiment. The idea is that for each protein in an interaction network you take its interactors, remove the parts of each that are unlikely to contain linear motifs (like globular domains, coiled coils and signal peptides) and then search the remaining peptide sequences for overrepresented motifs, compared to a control set of 15,000 proteins selected at random from SWISSPROT. The motifs are then ranked according to their p-value, which represents how unlikely the motif is to be so frequently observed in so few proteins.
Three of the previously uncharacterized linear motifs they found in drosophila and yeast were tested in the lab, confirming two of them (doesn't seem like a set big enough to draw any conclusions from, but this is essentially an in-silico paper, after all).
The authors also used the same approach on sets of interacting proteins from the Eukaryotic Linear Motif database and found that often the curated linear motif from ELM was the same as the top ranking motif in their results.
While there isn't anything particularly exciting about the methodology here it's interesting to see protein interaction networks being used for something other than protein classification or hand waving (about network architecture, evolutionary pressures, etc.)
I'm also surprised that nobody has done anything similar up until now. I remember a paper about globular domains being used to predict new protein interactors, but nothing the other way round...
Another day, another interface to PubMed. This one is from Muin et al. from the NLM, where you'd think they'd have the home field advantage (to be fair, the NLM is a big place with lots of projects going on).
It's called SLIM, for 'Slider Interface for MEDLINE / PubMed searches'. Basically, instead of all those drop down boxes and things you use sliders to choose to limit your search to papers submitted after a certain date, in certain types of journal and so on.
Let me first say that I like the sliders idea. It's just plain friendlier to be able to graphically manipulate limits. Let me then say... why wasn't the potential here maximized? How about six months developing the idea further, and then submitting a manuscript describing it?
When I read the abstract I envisioned some kind of AJAX / XML powered intelligent filter. You type in a broad search term, it comes back with 1,000 papers (and displays the first page so you get an idea of the type of results returned). You move the sliders around and as you do so the total number of papers changes as you watch. Something like this Laszlo demo (watch out: link leads to Flash). Er, it's not like that, at all. I was disappointed.
OK, the fault lies more with my heightened expectations than the authors. But there are already excellent PubMed interfaces out there with more good ideas presented every month. You have to do something a bit more special than an interface mockup and some reports on stability during alpha and beta testing, really, to stand out. Everybody who has access to the NCBI E-Utilities can build their own search interface, which is great. From the NLM we need stuff that we can't whip up ourselves in a couple of days...
See all posts from:
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
|
|