Flags and Lollipops

Wednesday, October 19, 2005

Support Vector Machines: A Quick Tutorial

A recent paper by Noble et al. (watch out, PDF) in Bioinformatics spurred me on to finally getting to grips with support vector machines (SVMs). Afterwards I figured that it might be nice if there was a brief tutorial specifically geared towards sequence-based SVMs on the web: so here we are. I'm going to go through the steps involved in replicating (sort of) the results from that paper; given a PC and the ability to install some basic software packages you should be able to follow easily.

Some background

DNA is packaged into chromatin rather than floating around free and easy, so most of the time much of the DNA sequence isn't accessible to regulatory processes in the cell. Some bits of the sequence, however, are associated with "openings" in the chromatin structure. You can detect them in vivo as DNaseI hypersensitive sites (HSs). These hypersensitive sites are very strongly associated with cis regulatory elements in that almost every cis regulatory element is also an HSs.

Up until fairly recently it was fairly difficult to detect HSs in vivo. Recently more high-throughput methods have been developed, but wouldn't it be even better if there was a way to locate all HSs (and thus cis-regulatory regions...) in silico?

The Noble et al. paper talks about how they trained an SVM to differentiate between 280 known hypersensitive sites and 737 controls, acheiving a cross-validation accuracy of ~ 85%. This seems awfully high for a genomics machine learning problem. Then again, Bill Noble the first author wrote GIST, a well known SVM package, so he obviously knows what he's doing. Anyway, to prove that it's possible we're going to do a similar experiment.

Prerequisites

You'll need to download and install libsvm (the latest version) to follow this tutorial.

Libsvm comes as a .zip file which includes Windows binaries: on other platforms you'll need to compile it which is a bit outwith the scope of this tutorial (as is the Java version, which is also in the download package).

Optionally you'll need Python and Gnuplot, but I'll also provide the output of any scripts that use them along the way.

Getting Started

The first thing to do is to put our data into a format suitable for Libsvm. That format looks something like this:


Each line represents an example (in our case, a sequence). The first number is the class of the example, either positive (1) or negative (-1). Then there's some whitespace. Each pair of numbers seperated by a colon (e.g. 1:0.42, 2:0.4 ...) after that represent features. Features are just numbers that describe the example.

Imagine that we were building an SVM that would classify genes as being either housekeeping genes (we'll make that the positive class) or not (the negative class). We would collect, say, 100 examples of housekeeping genes and 100 non-housekeeping genes. As features we could pick gene length, the number of exons, tissue specificity based on some high-throughput set and so on.

There's a nice introduction to how SVMs actually work available at the DTREG homepage, but for the impatient: to create an SVM the relevant software plots the training set of examples we give it as vectors (points) on a graph with a high number of dimensions. It then mathemagically works out the hyperplane (i.e. a plane in those dimensions) which best seperates the positive class of examples from the negative. After that, to classify a new object with the same kind of features we just need to plot it on the graph and see which side of the hyperplane it's on. The support vectors are vectors which are on the "edge" of the space in-between the two classifications on the graph and which help define the hyperplane.

Before we can construct a training set, we need to decide what features should we use to describe our examples. The number of times that certain motifs appear in each sequence? What about sequences of different length?

Noble et al. use the spectrum kernel, which is a fancy name for a feature set representing the distribution of every possible k-mer in a sequence. For example, the spectrum kernel for K=2 would have 16 features (AA,AC,AT,AG,CA,CC,CT,CG, etc..). The value for each feature is the number of times that particular feature appears in the sequence divided by the number of times any feature of the same length appears in the sequence. The Noble et al. paper uses K=6, but we're going to go with K=3 to keep file sizes down and because it's quicker to train and optimize the SVM that way (extending things to K=6 is left as an exercise to you, the reader). Well, I say K=3, but actually we're going to use all 3mers, 2mers and 1mers as our features.

Download this file, which is the HS and non-HS examples from the Noble Lab homepage converted into libsvm format using the 3mer - 1mer spectrum kernel by a quick Perl script.

Then download this file which is 1000 HS clusters and 1000 non-HS examples from a high-throughput dataset, similarly converted. We'll use it for further testing.

Scaling Features

Preferably, the features of the training and testing sets of sequences that we supply the SVM will be scaled betwen -1 and 1. Furthermore, we should scale the test set in exactly the same way that we scaled the training set: so if 10 becomes 1 in the training set we need to scale 11 to 1.1 in the test set. Why? Because otherwise features with much larger variations than the others skew the results.

Luckily libsvm has a program that does all the scaling stuff for us. Copy the two files that you downloaded above into the same directory as the svmscale program and then type:

svmscale -s noble.range 3mer_noble.svm > noble.train

This will scale the contents of 3mer_noble.svm (our training file) and output the results in noble.train. It'll also save the max and min values for each feature in the noble.range file, which we'll now use to scale the testing data in the same way:

svmscale -r noble.range 3mer_ht.svm > noble.test

The -r flag tells svmscale to use the max / min range we saved from the training set. You should now have three new files in the directory - noble.range, noble.train and noble.test.

Training the SVM

To train an SVM you use the svmtrain command. There are a couple of options which we may want to play around with to optimize performance later on, but for now let's see what happens with the defaults. Type in:

svmtrain -v 10 noble.train

The -v option tells svmtrain to perform 10-fold cross validation on the training set. What's 10-fold cross validation? It's a way of predicting how good a classifier might perform on unseen data. The training set is split up into 10 sections that have the same distribution of positive and negative examples. Then an SVM is trained on 9/10 of the sections and the section left out is tested with it. This is repeated so that each section is the one tested against the other nine and the average accuracy is reported.

This is important in machine learning because of the danger of overfitting to the training data - producing a classifer that is too specific to your training set. Ideally we want a classifer to be able to generalize what it has learned from the training set when it encounters novel data.

In this case the cross validation accuracy is reported as being 84.75%. Cool. But how is performance on our test set? To find out, first we need to train an SVM with the entire training set:

svmtrain noble.train noble.model

This saves the SVM trained from the training set into the noble.model file. You can then use the svmpredict program to classify new data.

svmpredict noble.test noble.model noble.test.predictions

The first argument to svmpredict is the name of the test file. The second is the SVM to use to classify the test file and the third is where to put the predicted classifications that result. As you'll see from running the above program, accuracy on our testing set is 73.65%. Considering that the high-throughput set is pretty noisy - not everything marked as a positive class is positive and not all the negatives are negative - that's pretty good going. But can it get better?

Optimizing

One of the nice things about libsvm is that it tries to be friendly to people who've never trained an SVM before. To that end there are two python scripts - grid.py and easy.py - in the "tools" directory which do some useful things.

grid.py is an automated way of searching for the best gamma and cost options to use with svmtrain. Setting the cost option might be particularly important when the training set is imbalanced; that is to say that you have more positive examples than negative, or vice versa.

If you've got Python and Gnuplot installed, copy noble.train to the tools directory and run grid.py now.

grid.py -gnuplot [path to your gnuplot binaries] noble.train

The script may take a while to run (it'd be a lot worse with 6mers, which is why we opted for K=3 when preparing the data above). Once it finishes, you can get the values of C (cost) and gamma from the final line that it outputs along with the cross validation accuracy rate that those values achieve on the training set you supplied:

32.0 0.0078125 85.8407

The first value is the suggested cost, the second the suggested gamma and the third the cross validation accuracy, which has increased to 85.84% with the new parameters (from 84.75%).

This isn't necessarily the whole story, though. Go back to the directory with the svmtrain and svmpredict binaries and give the new values a go:

svmtrain -c 32 -g 0.0078125 noble.train noble.model
svmpredict noble.test noble.model noble.test.predictions

Notice that accuracy on the test set has falled to 72.1% (from 73.65%). Och well - you win some, you lose some. Perhaps we're overfitting to the training set, or maybe the test data is just noisy (a distinct possibility).

Conclusion & Further Exploration

So it looks like cross validation rates of ~ 85% are definitely possible on the Noble et al. HSs training data. This tutorial was a little rough but the best way to learn is by doing. It's not all that difficult to train an SVM, really: most of the work goes into preparing the data.

Libsvm has a lot of nifty features. Check out the -b option, which trains the model to output probability estimates instead of just positive / negative classes. It's a bad idea to rely on "accuracy" as your only performance measurement; unfortunately to get more detailed statistics you'll have to write your own scripts to interpret the predictions output by svmpredict (to determine precision and recall or the Kappa statistic, for example).

If you're looking for a good book that introduces other machine learning concepts, you could do a lot worse than Data Mining: Practical Machine Learning Tools & Techniques written by two of the authors of Weka.

Corrections and amendments welcome.

Comments and trackbacks Feel free to post your comments Anonymous Neil Anonymous skippy59FR Blogger Greg Tyrelle . This post has trackbacks.

Trackbacks:

3 Comments:

At October 20, 2005 5:07 AM, Anonymous Neil said...

This is great. Understanding SVMs has been on my to-do list for a long time, especially with application to signal peptides (not a new idea I know). I might just be able to tackle the problem with this information.

 
At May 16, 2006 7:18 PM, Anonymous skippy59FR said...

I am a student in statistics and I have to compare the data classification's technique of LIB-SVM with KXEN.

So, I have found lot's of tutorials and sure yours . I wanted to use gnuplot with the grid.py fonction, but I have some problems.

I've tryed to go on libsvm software for a month.
I've first download all the requiered softwares (i.e.: libsvm, python & gnuplot )
then I checked the forums on internet to understand how to use starts and stops.

In a first time, it appeared that programms were written for
linux'environment. I tryed into this environment , but unsuccessfully since
I don't know anything of administration.
Today I fully ran this tutorial until Python usage. For that, I had to translate linux instructions in DOS.
Starting from data I managed to execute svmtrain.exe (instead of ./svm-train in linux for exemple )
Then I arrived at: "grid.py -gnuplot [path gnuplot binaries ] in-file" that I cannot run.
I checked the grid.py scripts to understand [path gnuplot binaries ] and I
flagged the programm to follow step by step but unsuccessfully.
The other day svmtrain was running all PC process, the day before it was gnuplot that take all ressource of the computer: in the two cases, a very small part of memory was used , I ran it all night long but it gave nothing.

Could you tell me some search directions ?

Best Regards,


Je suis étudiante en statistique et
j'aimerais comparer la technique de classification des données de LIB-SVM avec KXEN.
J'ai donc trouvé beaucoup de tutoriels et bien sûr, le votre. Je voulais utiliser gnuplot avec la fonction grid.py, mais j'ai eu quelques problémes.

Cela fait un mois que je tâche d'avancer sur le projet libsvm.
J'ai tout d'abord collecté tous les logiciels requis (i.e.: libsvm, python & gnuplot )
Puis j'ai mené mon enquête sur les forums pour comprendre comment les
utiliser, les tenants et les aboutissants.

Dans un premier temps il s'est avéré qu'à priori ces programmes étaient
écrits pour Linux , j'ai essayé dans cet environnement, mais me suis heurtée à l'administration très vérouillée, et au fait que je ne connaisse pas assez
l'apprentissage de cet OS en plus des logiciels.
Aujourd'hui,j'ai réussi à dérouler complètement votre tutoriel jusqu'à l'utilisation de Python. Pour celà j'ai dû faire le
parrallèle entre les commandes linux et le DOS ( navigation, executables etc. )
En partant d'un ensemble de données j'ai su executer "svmtrain.exe" ( au
lieu de .\svm-train souos linux ), svm....
ainsi je suis arrivé à "grid.py -gnuplot [path gnuplot binaries ]
in-file" , que je n'arrive pas à lancer.

J'ai alors lu en diagonale le code de grid.py, pour comprendre ce que je devais mettre à la place de [path gnuplot binaries ] , j'ai inserré des print pour suivre le déroulé du programme. Mais en vain, il y a toujours un
problème.
L'autre jours c'était svmtrain_exe qui prenait toute la ressource système, le jour d'avant c'était wgnuplot qui prenait toute la ressource , mais dans
les deux cas peu de RAM respectivement ( 80K et rien ) j'ai laissé tourner
pendant toute la nuit mais celà n'a rien donné.

Pourriez vous me donner des pistes de recherches ?

merci beaucoup

 
At November 17, 2006 2:19 AM, Blogger Greg Tyrelle said...

This tutorial just saved me. I have to talk about SVMs in a presentation this afternoon. One of those horrible warm and fuzzy data analysis talks where you try and make complex analysis tasks friendly to people who don't know and don't care (i.e. lab people).

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008