Support Vector Machines: A Quick Tutorial
Some background
DNA is packaged into chromatin rather than floating around free and easy, so most of the time much of the DNA sequence isn't accessible to regulatory processes in the cell. Some bits of the sequence, however, are associated with "openings" in the chromatin structure. You can detect them in vivo as DNaseI hypersensitive sites (HSs). These hypersensitive sites are very strongly associated with cis regulatory elements in that almost every cis regulatory element is also an HSs.
Up until fairly recently it was fairly difficult to detect HSs in vivo. Recently more high-throughput methods have been developed, but wouldn't it be even better if there was a way to locate all HSs (and thus cis-regulatory regions...) in silico?
The Noble et al. paper talks about how they trained an SVM to differentiate between 280 known hypersensitive sites and 737 controls, acheiving a cross-validation accuracy of ~ 85%. This seems awfully high for a genomics machine learning problem. Then again, Bill Noble the first author wrote GIST, a well known SVM package, so he obviously knows what he's doing. Anyway, to prove that it's possible we're going to do a similar experiment.
Prerequisites
You'll need to download and install libsvm (the latest version) to follow this tutorial.
Libsvm comes as a .zip file which includes Windows binaries: on other platforms you'll need to compile it which is a bit outwith the scope of this tutorial (as is the Java version, which is also in the download package).
Optionally you'll need Python and Gnuplot, but I'll also provide the output of any scripts that use them along the way.
Getting Started
The first thing to do is to put our data into a format suitable for Libsvm. That format looks something like this:

Each line represents an example (in our case, a sequence). The first number is the class of the example, either positive (1) or negative (-1). Then there's some whitespace. Each pair of numbers seperated by a colon (e.g. 1:0.42, 2:0.4 ...) after that represent features. Features are just numbers that describe the example.
Imagine that we were building an SVM that would classify genes as being either housekeeping genes (we'll make that the positive class) or not (the negative class). We would collect, say, 100 examples of housekeeping genes and 100 non-housekeeping genes. As features we could pick gene length, the number of exons, tissue specificity based on some high-throughput set and so on.
There's a nice introduction to how SVMs actually work available at the DTREG homepage, but for the impatient: to create an SVM the relevant software plots the training set of examples we give it as vectors (points) on a graph with a high number of dimensions. It then mathemagically works out the hyperplane (i.e. a plane in those dimensions) which best seperates the positive class of examples from the negative. After that, to classify a new object with the same kind of features we just need to plot it on the graph and see which side of the hyperplane it's on. The support vectors are vectors which are on the "edge" of the space in-between the two classifications on the graph and which help define the hyperplane.
Before we can construct a training set, we need to decide what features should we use to describe our examples. The number of times that certain motifs appear in each sequence? What about sequences of different length?
Noble et al. use the spectrum kernel, which is a fancy name for a feature set representing the distribution of every possible k-mer in a sequence. For example, the spectrum kernel for K=2 would have 16 features (AA,AC,AT,AG,CA,CC,CT,CG, etc..). The value for each feature is the number of times that particular feature appears in the sequence divided by the number of times any feature of the same length appears in the sequence. The Noble et al. paper uses K=6, but we're going to go with K=3 to keep file sizes down and because it's quicker to train and optimize the SVM that way (extending things to K=6 is left as an exercise to you, the reader). Well, I say K=3, but actually we're going to use all 3mers, 2mers and 1mers as our features.
Download this file, which is the HS and non-HS examples from the Noble Lab homepage converted into libsvm format using the 3mer - 1mer spectrum kernel by a quick Perl script.
Then download this file which is 1000 HS clusters and 1000 non-HS examples from a high-throughput dataset, similarly converted. We'll use it for further testing.
Scaling Features
Preferably, the features of the training and testing sets of sequences that we supply the SVM will be scaled betwen -1 and 1. Furthermore, we should scale the test set in exactly the same way that we scaled the training set: so if 10 becomes 1 in the training set we need to scale 11 to 1.1 in the test set. Why? Because otherwise features with much larger variations than the others skew the results.
Luckily libsvm has a program that does all the scaling stuff for us. Copy the two files that you downloaded above into the same directory as the svmscale program and then type:
svmscale -s noble.range 3mer_noble.svm > noble.train
This will scale the contents of 3mer_noble.svm (our training file) and output the results in noble.train. It'll also save the max and min values for each feature in the noble.range file, which we'll now use to scale the testing data in the same way:
svmscale -r noble.range 3mer_ht.svm > noble.test
The -r flag tells svmscale to use the max / min range we saved from the training set. You should now have three new files in the directory - noble.range, noble.train and noble.test.
Training the SVM
To train an SVM you use the svmtrain command. There are a couple of options which we may want to play around with to optimize performance later on, but for now let's see what happens with the defaults. Type in:
svmtrain -v 10 noble.train
The -v option tells svmtrain to perform 10-fold cross validation on the training set. What's 10-fold cross validation? It's a way of predicting how good a classifier might perform on unseen data. The training set is split up into 10 sections that have the same distribution of positive and negative examples. Then an SVM is trained on 9/10 of the sections and the section left out is tested with it. This is repeated so that each section is the one tested against the other nine and the average accuracy is reported.
This is important in machine learning because of the danger of overfitting to the training data - producing a classifer that is too specific to your training set. Ideally we want a classifer to be able to generalize what it has learned from the training set when it encounters novel data.
In this case the cross validation accuracy is reported as being 84.75%. Cool. But how is performance on our test set? To find out, first we need to train an SVM with the entire training set:
svmtrain noble.train noble.model
This saves the SVM trained from the training set into the noble.model file. You can then use the svmpredict program to classify new data.
svmpredict noble.test noble.model noble.test.predictions
The first argument to svmpredict is the name of the test file. The second is the SVM to use to classify the test file and the third is where to put the predicted classifications that result. As you'll see from running the above program, accuracy on our testing set is 73.65%. Considering that the high-throughput set is pretty noisy - not everything marked as a positive class is positive and not all the negatives are negative - that's pretty good going. But can it get better?
Optimizing
One of the nice things about libsvm is that it tries to be friendly to people who've never trained an SVM before. To that end there are two python scripts - grid.py and easy.py - in the "tools" directory which do some useful things.
grid.py is an automated way of searching for the best gamma and cost options to use with svmtrain. Setting the cost option might be particularly important when the training set is imbalanced; that is to say that you have more positive examples than negative, or vice versa.
If you've got Python and Gnuplot installed, copy noble.train to the tools directory and run grid.py now.
grid.py -gnuplot [path to your gnuplot binaries] noble.train
The script may take a while to run (it'd be a lot worse with 6mers, which is why we opted for K=3 when preparing the data above). Once it finishes, you can get the values of C (cost) and gamma from the final line that it outputs along with the cross validation accuracy rate that those values achieve on the training set you supplied:
32.0 0.0078125 85.8407
The first value is the suggested cost, the second the suggested gamma and the third the cross validation accuracy, which has increased to 85.84% with the new parameters (from 84.75%).
This isn't necessarily the whole story, though. Go back to the directory with the svmtrain and svmpredict binaries and give the new values a go:
svmtrain -c 32 -g 0.0078125 noble.train noble.model
svmpredict noble.test noble.model noble.test.predictions
Notice that accuracy on the test set has falled to 72.1% (from 73.65%). Och well - you win some, you lose some. Perhaps we're overfitting to the training set, or maybe the test data is just noisy (a distinct possibility).
Conclusion & Further Exploration
So it looks like cross validation rates of ~ 85% are definitely possible on the Noble et al. HSs training data. This tutorial was a little rough but the best way to learn is by doing. It's not all that difficult to train an SVM, really: most of the work goes into preparing the data.
Libsvm has a lot of nifty features. Check out the -b option, which trains the model to output probability estimates instead of just positive / negative classes. It's a bad idea to rely on "accuracy" as your only performance measurement; unfortunately to get more detailed statistics you'll have to write your own scripts to interpret the predictions output by svmpredict (to determine precision and recall or the Kappa statistic, for example).
If you're looking for a good book that introduces other machine learning concepts, you could do a lot worse than Data Mining: Practical Machine Learning Tools & Techniques written by two of the authors of Weka.
Corrections and amendments welcome.
Neil
skippy59FR
Greg Tyrelle
. This post has trackbacks.
