Flags and Lollipops

Thursday, November 17, 2005

Classification Trees & Weka Docs

Adhanom Tewolde at the Katholieke Universiteit Leuven in Belgium has put up a nice resource for people interested in getting started with the Weka machine learning platform (which deals with a lot of different machine learning algorithms - see this previous post on the subject) and decision trees in general.

Decision trees are used for all sorts of things in bioinformatics: predicting genetic regulatory response to different experiments in yeast, determining the extent of resistance to antiretroviral drugs in HIV patients and protein annotation, for example.

Adhanom's guide explains how decision trees work, how they are created and what the advantages of using them are.

Simplistically: imagine that you have a training set of data labeled either class A or B. A tree building algorithm will start off with a single node, representing the entire training set. It'll then decide on a "split" which produces two child nodes, each representing a subset of the training data. The goal of each split is to maximize the purity of the child nodes: a node is purest when it only contains one class. Each child node can have further splits, producing child nodes of child nodes... and so on and so forth. Eventually you end up with a tree like structure with a "root" node at the base representing your entire training set (a mix of As and Bs) and lots of pure leaf nodes at the top (some of which are all As and some of which are all Bs, ideally).

You can then feed a different, unlabeled dataset into that tree and by following the splits, classify each element of that dataset on the basis of which leaf node it ends up in.

One advantage of decision trees is that they can easily handle categorical as well as numerical variables. Another is that (depending on the algorithm you use and how many variables are involved) it can be a lot easier to interpret a tree than the workings of a "black box" neural network or an SVM.

Weka contains a number of different tree building algorithms. Another option linked to from the page above, though, is Shih Data Miner, which I hadn't heard of before: it seems to be quite well documented, and feedback is a bit more visual: maybe it would be a good place to start experimenting?

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Trackbacks:

0 Comments:

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008