Flags and Lollipops

Friday, July 29, 2005

Machine Learning

I wonder if there's anybody left in bioinformatics who doesn't know what a support vector machine is? SVMs definitely seem to be the current machine learning technique of choice. Personally I'm glad if only because they've reduced the amount of hideous statistics in manuscripts. I still don't understand exactly how SVMs actually work (it's something to do with multidimensional data envelopes, right?) but I feel that I've got a handle on them, while anything to do with discriminant analysis and greek symbols just makes my head hurt - it's a psychosomatic maths lecture flashback thing.

Anyway, there's lots of other machine learning algorithms out there, some of which are a lot better suited to working with biological data. Unfortunately, unless you're collaborating with the local CS department or have a knowledgeable ainthusiast to hand it can be difficult to actually find implementations of those algorithms. Even if you do obtain one you're usually forced to jump through hoops to get data into the correct format (my pet hate is software that needs more than one file to describe a single dataset).

That's where Weka comes in. It's a freely available Java based suite of machine learning algorithims written by Ian Witten and Eibe Frank - amongst many others - at the University of Waikato in New Zealand (a weka is a kind of curious, flightless bird native to NZ).

The good thing about it is that once you have your data in Weka's ARFF format you can perform any amount of data manipulation, clustering, data mining and rule learning with it, although my installation suffers from an unfortunate memory leak that means I have to periodically restart the software. It's a flexible system; there's a GUI for those just interested in performing one-off tasks or you can access the underlying code via the Weka API. Lots of algorithms, too: decision trees, support vector machines, k-nearest neighbour, rule tables...

I discovered Weka a while back via the previous edition of Ian and Eibe's excellent book and that's probably the easiest way to get started with it (usually I'd say screw the book and just dive in, but elements of Weka's online documentation are sadly lacking). The book is also a pretty good introduction to machine learning and explains a lot of the underlying concepts in a clear and concise fashion.

Furthermore:
KDnuggets has a list of other machine learning software available on the web - if you're looking for more information, then that's a good place to start.

Comments and trackbacks Feel free to post your comments Anonymous Spitshine Blogger Stew Anonymous Anonymous Blogger Stew Anonymous Neil Anonymous NS . This post has trackbacks.

Trackbacks:

6 Comments:

At July 30, 2005 1:27 PM, Anonymous Spitshine said...

Did you check GIST for both introduction and a simple implementation of a SVM on the web? Recommended.

Anyway, welcome to the tiny club of bioinformatics bloggers...

 
At August 01, 2005 4:24 PM, Blogger Stew said...

Hey, good link - cheers. Hadn't seen it before.

And thanks for the welcome!

 
At November 17, 2005 4:26 PM, Anonymous Anonymous said...

Did you see the Orange data mining suite - www.ailab.si/orange. They have a great interface, a lot of functionality and a lot of stuff for bioinformatics

 
At November 18, 2005 6:42 PM, Blogger Stew said...

Yeah, had a look at Orange a wee while ago when I was looking for a machine learning toolkit, though I've never used it - always thought that I should learn Python first...

Would be interested in hearing about any bioinformatics work done with Orange, though.

 
At February 16, 2006 3:55 AM, Anonymous Neil said...

Coming a little late to this post, but it set me thinking about my approach to the problem of learning these methods.

Coming from a biological background, I didn't get much maths or stats training and really regret that. It's absurd that most undergrad biologists get little or no training in multivariate stats when most real-world biological data is multivariate.

I find with a lot of effort, I can get my head around these concepts and gain at least some empirical understanding of what they do, when to use them and how to interpret the output. Lots of googling for introductory material - maths undergrad lectures online, tutorials and so on. I often find that once I have my data in the required input format for a piece of software, that goes a long way to aid understanding. This is especially the case with something like R, which has many of these very powerful methods (SVMs, discriminant analysis and so on), but has terrible "official" documentation. Once you see your data and can say "it's a matrix of x columns and y rows and column x1 contains the classes", then correlate the scores etc. back to that, it often starts to make intuitive sense.

 
At May 12, 2010 12:54 AM, Anonymous NS said...

Have you looked into RVMs at all? Relevance Vector Machines, a Bayesian analog to SVMs...

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008 January 2009 February 2009 March 2009 June 2009