Rough Sets: A Quick Tutorial
The latter paper, by Cao et al., is interesting in that rough sets theory is rarely used in bioinformatics despite the fact that the field is full of exactly the kind of noisy, imprecise datasets that rough sets are designed to handle. One reason for this is possibly the lack of freely available rough set software: Rosetta (site currently down)- which Cao et al. used - has a commercial licence, though there exists a free alternative in RSES from the Institute of Mathematics at Warsaw University.
Rough set theory was introduced in the 1980s by Zdzislaw Pawlak. Rough sets were designed to be used for the classification of imprecise, uncertain or incomplete information. Mathematically it's relatively simple (at least to mathematicians). Let's have a look at a toy example, avoiding greek letters and funny symbols as much as possible...
Imagine a table containing information about some proteins. Each row represents a different protein and each column contains a different attribute that describes the protein (the attributes could be, say, percentage glutamine, percentage proline and positive charge).
We'll describe five different proteins, each with three attributes (% Gln, % Pro and + Chg) and with an associated structure (all-alpha, all-beta, alpha / beta or alpha + beta).
| id | % Gln | % Pro | + Chg | Structure |
| 1 | 12% | 6% | 0.2 | all-a |
| 2 | 12% | 6% | 0.2 | all-a |
| 3 | 8% | 6% | 0.2 | all-b |
| 4 | 12% | 2% | 0.12 | a / b |
| 5 | 12% | 2% | 0.12 | a + b |
| 6 | 12% | 2% | 0.12 | a + b |
Using this training data we want to use rough sets to derive some rules that will enable us to determine the structure of a novel protein given the attributes describing that protein. In rough set speak, the structure will be the decision attribute and % Gln, % Pro and + Chg are the condition attributes.
The first step is to determine equivalence classes. These are groups of objects in which all of the condition attributes are the same for each object and thus cannot be distinguished between. In our example there are three equivalence classes: ids (1 & 2), (3) and (4, 5 & 6) - because the set of ids 1 and 2 share all the same condition attributes, as does the set 4, 5 & 6, etc. We'll name our equivalence classes E1, E2 and E3.
| Equivalence Class | % Gln | % Pro | + Chg |
| E1 (ids 1 & 2) | 12% | 6% | 0.2 |
| E2 (id 3) | 8% | 6% | 0.2 |
| E3 (ids 4, 5 & 6) | 12% | 2% | 0.12 |
Note that equivalence classes can contain ids that have different decision attributes (i.e. structures).
The next step is to construct a discernability matrix, a chart which in our case will look something like this:
| E1 | E2 | E3 | |
| E1 | - | % Gln | % Pro, + Chg |
| E2 | % Gln | - | % Gln, % Pro, + Chg |
| E3 | % Pro, + Chg | % Gln, % Pro, + Chg | - |
The axes are the equivalence classes and the cells contain the condition attributes that differentiate between those classes. For example, the % Gln attribute is the only thing that differentiates between equivalence classes E1 and E2, while all three condition attributes are different between equivalence classes E2 and E3.
Using the discernability matrix we can calculate relative discernability functions, which give the minimum set of attributes necessary to differentiate a given class from the others. To do this, simply take each row in turn and concatenate the cells with logical ANDs. Concatenate the attributes within each cell with logical ORs.
The relative discernability functions for our example will be:
- f(E1) = (% Gln) AND (% Pro OR +Chg)
- f(E2) = (% Gln) AND (% Gln OR % Pro OR +Chg)
- f(E3) = (% Pro OR +Chg) AND (% Gln OR % Pro OR +Chg)
Remember that we're looking for the minimum set of attributes necessary to differentiate each class. Take the relative discernability function for E1: to differentiate between E1 and E2 we need to look at the % Gln attribute, but to differentiate between E1 and E3 we could use either % Pro or +Chg.
We're getting close to deriving some rules from our discernability functions. However, it might be the case that not all of the attributes we've been looking at are required. To simplify things we can determine which attributes will come in useful by next calculating the relative reduct.
The relative reduct is calculated by taking the relative discernability functions and removing superfluous attributes. For example, in f(E2) we only really need to look at % Gln to satisfy the whole function. In f(E3) we only need to look at one of either % Pro or +Chg.
The relative reducts for our example, then, are:
- RED(E1) = (% Gln AND % Pro) OR (% Gln AND +Chg)
- RED(E2) = % Gln
- RED(E3) = % Pro OR +Chg
Let's derive some rules from our reducts. To do this we need to bind the condition attribute values of the equivalence class from which the reduct originated to the corresponding attributes of the reduct.
For example, all of the proteins making up equivalence class E1 have the condition attributes 12%, 6% and 0.2 for % Gln, % Pro and +Chg respectively. We can feed those values into RED(E1) to derive a relevant rule.
The rules for our example are below. If a rule contained an "or" it was split up into two rules, to keep things simple.
- from RED(E1) : if% Gln = 12%and% Pro = 6%==> structure = all-a
- from RED(E1) : if% Gln = 12%and+Chg = 0.2==> structure = all-a
- from RED(E2) : if% Gln = 8%==> structure = all-b
- from RED(E3) : if% Pro = 2%==> structure = ?
- from RED(E3) : if+Chg = 0.12==> structure = ?
Not that the last two rules don't specify a structure. This is because equivalence class E3 contained proteins with more than one structure: such classes are called vague classes. A question mark is a bit of a rubbish answer - instead we should return a probability and a class using something called the rough membership function. All the rough membership function does is look at the distribution of different decision attributes in a particular vague class. Looking back at our original table we can see that equivalence class E3 describes three proteins, two of which are a + b and the other a / b. Therefore we could replace the question mark with:
- from RED(E3) : if% Pro = 2%==> structure = (2/3 chance of a+b, 1/3 chance a / b)
- from RED(E3) : if+Chg = 0.12==> structure = (2/3 chance of a+b, 1/3 chance a / b)
Using the rules we have generated we can determine the structure of novel proteins. Each type of structure will have a lower approximation - the set of proteins which definitely have that structure - an upper approximation - the set of proteins which may possibly have that structure (though there's evidence that they have a different structure) and a boundary region, the set of proteins whose structure cannot be proven either way. Proteins can belong to more than one set.Like everything else in data mining rough sets are relatively simple but dressed up in complicated looking equations and peculiar terms. Hopefully the worked example above is enough to give you a firm grasp of what Cao et al. are doing: to explore rough sets further, check out the homepage of the International Rough Set Society or the RSES software package mentioned at the beginning of this post.
(credit is due to Helge Grenager Solheim, from whom I've copied much of the format of this text, the animated gif is by Michael Hadjimichael).
RPM
Stew
milo gardner
Anonymous
Anonymous
Piero Pagliani
Anonymous
Anonymous
. This post has trackbacks.
