Flags and Lollipops

Friday, July 29, 2005

Machine Learning

I wonder if there's anybody left in bioinformatics who doesn't know what a support vector machine is? SVMs definitely seem to be the current machine learning technique of choice. Personally I'm glad if only because they've reduced the amount of hideous statistics in manuscripts. I still don't understand exactly how SVMs actually work (it's something to do with multidimensional data envelopes, right?) but I feel that I've got a handle on them, while anything to do with discriminant analysis and greek symbols just makes my head hurt - it's a psychosomatic maths lecture flashback thing.

Anyway, there's lots of other machine learning algorithms out there, some of which are a lot better suited to working with biological data. Unfortunately, unless you're collaborating with the local CS department or have a knowledgeable ainthusiast to hand it can be difficult to actually find implementations of those algorithms. Even if you do obtain one you're usually forced to jump through hoops to get data into the correct format (my pet hate is software that needs more than one file to describe a single dataset).

That's where Weka comes in. It's a freely available Java based suite of machine learning algorithims written by Ian Witten and Eibe Frank - amongst many others - at the University of Waikato in New Zealand (a weka is a kind of curious, flightless bird native to NZ).

The good thing about it is that once you have your data in Weka's ARFF format you can perform any amount of data manipulation, clustering, data mining and rule learning with it, although my installation suffers from an unfortunate memory leak that means I have to periodically restart the software. It's a flexible system; there's a GUI for those just interested in performing one-off tasks or you can access the underlying code via the Weka API. Lots of algorithms, too: decision trees, support vector machines, k-nearest neighbour, rule tables...

I discovered Weka a while back via the previous edition of Ian and Eibe's excellent book and that's probably the easiest way to get started with it (usually I'd say screw the book and just dive in, but elements of Weka's online documentation are sadly lacking). The book is also a pretty good introduction to machine learning and explains a lot of the underlying concepts in a clear and concise fashion.

Furthermore:
KDnuggets has a list of other machine learning software available on the web - if you're looking for more information, then that's a good place to start.

Comments and trackbacks Feel free to post your comments Anonymous Spitshine Blogger Stew Anonymous Anonymous Blogger Stew Anonymous Neil . This post has trackbacks.

Thursday, July 28, 2005

Information visualisation

I think there's definitely some interesting work to be done with new and improved visualisations of large genomic datasets. I spent a couple of weeks earlier on this year trying to build a DAS viewer with Lazlo, until it became apparent that Flash isn't the language to use if you're manipulating lots of data (it gets sloooow). If only Java was capable of prettier graphics...

Anyway, I recently came across information aesthetics, which is a blog about all things visualisation. They link to some great ideas - I loved this one, and this one is nice, too. A lot of it is art installation sort of stuff.

More specific to bioinformatics is Ben Fry's work in "genomic cartography". Ben is also one of the people behind processing, a Java based graphical scripting language (though perhaps that description does it a disservice) which has been on my to-learn list for months. Robert Kosara also has some interesting ideas and papers on his personal web site concerning molecule visualisation and a very cool depth of field like effect.

Comments and trackbacks Feel free to post your comments Blogger Mohamed Taher . This post has trackbacks.

Wednesday, July 27, 2005

Tired Topics

Bioinformatics suffers quite a lot from fads and bandwagons. Does this happen as much in other scientific disciplines, I wonder?

It's not difficult to do a literature search to find out how much work there has been done in a particular field already; either people do these but ignore the results, they're convinced that they can add some valuable insight (experience says: probably not) or they've started a project and can't back out of publishing because they've already got too much invested in it. Why do journals keep publishing this stuff?

Anyway, if I never read a paper on any of the following topics again, I'll be a happy man (not least because I'm guilty of dabbling in one or two of them myself):
  • Anything to do with analysing microarray data using GO terms : this was a good idea three years ago. Finding overrepresented GO terms - how many web based systems to do this do we need?
  • Grouping proteins somehow to help automatically annotate genes : unless the specificity of your system is a lot better than what already exists, please keep it to yourself. Poor quality annotation is far worse than no annotation at all. Sometimes I swear the only difference between some systems is the tortured acronym used for a name.
  • Text mining for protein interactions : does anybody actually trust this kind of data?
  • Analysing the structure of protein interaction networks : they're scale free! They're not scale free! Nobody knows! If current protein interaction networks are actually just incomplete graphs of all possible interactions (under a variety of different conditions, given the combined networks used nowadays) then how relevant to actual biological processes are any such analyses be anyway?
Bioinformatics is a fast moving subject, but sometimes it's just not fast moving enough.

Comments and trackbacks Feel free to post your comments Blogger Greg Tyrelle Anonymous Anonymous . This post has trackbacks.

Ensembl redesign

Ensembl have redesigned their web interface to go along with the v32 release. Looks nice - much more functional. Haven't really explored in any great depth but certainly it's much easier to get information about how the Ensembl MySQL databases are set up and the instructions on getting the Perl API are finally out of PDF format, hurray!

As a bonus, they recommend using Firefox. Always nice to see.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Full disclosure (almost)

(updated Feb '07)

Hsien over at Genetics and Health proposed a while back that medical and health bloggers should introduce "full disclosure" to help visitors evaluate the resources on each site.

The vast majority of posts on Flags and Lollipops are pitched at people working in bioinformatics or computer science who hopefully can already critically evaluate science writing. Some of the questions still apply, though, so in the interests of openness:

1. Who runs this site?

Me (Stew). I work for Nature Publishing Group in the web publishing department. Sorry for the pseudo-anonymity. I started off blogging that way and it seems wrong to stop.

2. Who pays for the site?

I do, out of my own pocket. It doesn't cost much.

3. What is the purpose of the site?

To talk about new developments in bioinformatics, genomics and science on the web; to highlight interesting topics in those areas; to provide a forum for my long inarticulate rants.

4. Where does the information come from?

Chats with colleagues, buzz at conferences, posts on other blogs, literature searches - I also keep an eye on del.icio.us, newspapers, that kind of thing.

5. What is the basis of the information?

You have to rely on my interpretation of the facts, bearing in mind that my molecular biology is mostly self-taught. If there's a peer reviewed paper to be referenced then I'll include the link.

6. How is the information selected?

To borrow a paragraph from Hsien:
Unlike a scientific journal, magazine, or newspaper, there is no editor for [Flags and Lollipops]. Hence, like most other blogs, there is no fact checker other than myself. I rely on you, my readers, to correct me when I'm mistaken and to share your experiences. I welcome all comments whether you agree with me or not.
7. How current is the information?

Bioinformatics is a fast moving area. The information on this site is current as of the date that the information was posted.

8. How does the site choose links to other sites?

If I like a site enough to subscribe to its RSS feed then I include it on the linkbar on the right hand side of the screen. I'm interested in any bioinformatics related blog though I try and keep the list down to those which are regularly updated (twice a month, say).

I don't do any background checks on sites that I link to from posts, but in general they should be relatively trustworthy. Sometimes I link to Wikipedia for background information on a topic, but it's usually to fairly uncontroversial science related pages.

9. What information about you does the site collect, and why?

I use Google Analytics, so this site collects information about your web browser, screen resolution and whether or not you use Javascript. In theory it also tells me how long you spend on the site and that sort of thing, but I don't actually look at all that stuff: it's too depressing. Last year I discovered that most people reach the site via searches for 'lollipops' or the 'biggest breasts in Europe' and so only stay for a few seconds.

10. How does the site manage interactions with visitors?

You can comment on most posts, or email me directly. Feel free to be critical of posts - I won't remove anything unless it's truly heinous (or spam).

Labels: ,

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Modelling Phenotypes with Bayesian Networks

Read an interesting paper this morning in Nature Genetics (via a commentary article in EJHG) by Paola Sebastiani et al at the Boston University School of Public Health.

The abstract and supplementary data is available here. Essentially, Sebastiani used Bayesian networks to analyse a set of ~ 100 SNPs in candidate genes for sickle cell anaemia to see if any of them modulated the risk of overt stroke, a severe complication that happens to around 1 in 14 of SCA patients.

SCA is classed as a monogenic disease; that is to say, a fault in a single gene gives rise to the disease phenotype. Of course, things are never that simple in human genetics and it turns out that many monogenic diseases - SCA included - are affected by mutations in other genes that alter things like the age of onset, the types and frequency of complications, disease severity and response to treatment. The SNPs that Sebastiani looked at, some of which have been shown to contribute negatively to the risk of stroke and some positively, were spread out over 39 different genes.

The Bayesian approach allowed Sebastiani to look at many mutations in genes suspected to play a role in SCA phenotype modification simultaneously. Most statistical methods used to analyse the effect of SNPs on disease phenotypes deal with the mutations one at a time. Trained on a set of markers from 92 SCD patients who had suffered strokes and 1306 who had not, the resulting network was tested to see if it could be used to predict the likelihood of a patient suffering from stroke, given their genotype.

Rather promisingly, they report a success rate of 98.2% on an independent test set of patients, with 100% of the true positives and 98% of the true negatives detected. 25 of the SNPs in 11 different genes were found to directly modulate stroke risk.

Those numbers are great, but there are caveats perhaps not immediately apparent: it's worth bearing in mind that the success rate might be population specific. Both the training set and independent test set of patients were African Americans - other populations might have subtle differences in the way particular mutations affects phenotype.

It also seems strange that there is so little environmental contribution to stroke risk in these patients; the 108 SNPs chosen for the study presumably don't make up an exhaustive list of potential disease-modifying mutations, so the 98.2% success rate based on genotype alone isn't even an upper bound - there may well be a number of SNPs not considered for inclusion in this study for whatever reason that complete the picture even further.

It's nice to see research into modelling phenotypes like this produce good results. This kind of thing isn't really my area, so I don't know if anybody has done similar studies with Bayesian approaches that include environmental factors; maybe one day it'll be possible to create predictive tests and model disease outcomes for schizophrenia, or autism, or heart disease.

Still seems very far away, though.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Tuesday, July 26, 2005

First Post

Decided I wanted somewhere I could post interesting bioinformatics links and stories to (nodalpoint is another option, but it's temporarily down). Thus Flags and Lollipops is born.

By Flags and Lollipops, incidentally, I'm referring to the ways of marking base pairs in diagrams of short stretches of DNA - it's not some strange fetish.

. This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008