Flags and Lollipops

Saturday, January 14, 2006

Rough Sets: A Quick Tutorial

There have been two interesting machine learning papers published in BMC Bioinformatics in the last week, one to do with guessing function from sequence using a fairly complex mix of different algorithms and one using rough sets to determine simple protein structure from amino acid composition (and some derived statistics).

The latter paper, by Cao et al., is interesting in that rough sets theory is rarely used in bioinformatics despite the fact that the field is full of exactly the kind of noisy, imprecise datasets that rough sets are designed to handle. One reason for this is possibly the lack of freely available rough set software: Rosetta (site currently down)- which Cao et al. used - has a commercial licence, though there exists a free alternative in RSES from the Institute of Mathematics at Warsaw University.

Rough set theory was introduced in the 1980s by Zdzislaw Pawlak. Rough sets were designed to be used for the classification of imprecise, uncertain or incomplete information. Mathematically it's relatively simple (at least to mathematicians). Let's have a look at a toy example, avoiding greek letters and funny symbols as much as possible...

Imagine a table containing information about some proteins. Each row represents a different protein and each column contains a different attribute that describes the protein (the attributes could be, say, percentage glutamine, percentage proline and positive charge).

We'll describe five different proteins, each with three attributes (% Gln, % Pro and + Chg) and with an associated structure (all-alpha, all-beta, alpha / beta or alpha + beta).

id% Gln% Pro+ ChgStructure
112%6%0.2all-a
212%6%0.2all-a
38%6%0.2all-b
412%2%0.12a / b
512%2%0.12a + b
6
12%
2%
0.12
a + b

Using this training data we want to use rough sets to derive some rules that will enable us to determine the structure of a novel protein given the attributes describing that protein. In rough set speak, the structure will be the decision attribute and % Gln, % Pro and + Chg are the condition attributes.

The first step is to determine equivalence classes. These are groups of objects in which all of the condition attributes are the same for each object and thus cannot be distinguished between. In our example there are three equivalence classes: ids (1 & 2), (3) and (4, 5 & 6) - because the set of ids 1 and 2 share all the same condition attributes, as does the set 4, 5 & 6, etc. We'll name our equivalence classes E1, E2 and E3.

Equivalence Class
% Gln% Pro+ Chg
E1 (ids 1 & 2)
12%6%0.2
E2 (id 3)
8%6%0.2
E3 (ids 4, 5 & 6)
12%2%0.12

Note that equivalence classes can contain ids that have different decision attributes (i.e. structures).

The next step is to construct a discernability matrix, a chart which in our case will look something like this:


E1
E2
E3
E1
-
% Gln
% Pro, + Chg
E2
% Gln
-
% Gln, % Pro, + Chg
E3
% Pro, + Chg% Gln, % Pro, + Chg-

The axes are the equivalence classes and the cells contain the condition attributes that differentiate between those classes. For example, the % Gln attribute is the only thing that differentiates between equivalence classes E1 and E2, while all three condition attributes are different between equivalence classes E2 and E3.

Using the discernability matrix we can calculate relative discernability functions, which give the minimum set of attributes necessary to differentiate a given class from the others. To do this, simply take each row in turn and concatenate the cells with logical ANDs. Concatenate the attributes within each cell with logical ORs.

The relative discernability functions for our example will be:

  • f(E1) = (% Gln) AND (% Pro OR +Chg)
  • f(E2) = (% Gln) AND (% Gln OR % Pro OR +Chg)
  • f(E3) = (% Pro OR +Chg) AND (% Gln OR % Pro OR +Chg)

Remember that we're looking for the minimum set of attributes necessary to differentiate each class. Take the relative discernability function for E1: to differentiate between E1 and E2 we need to look at the % Gln attribute, but to differentiate between E1 and E3 we could use either % Pro or +Chg.

We're getting close to deriving some rules from our discernability functions. However, it might be the case that not all of the attributes we've been looking at are required. To simplify things we can determine which attributes will come in useful by next calculating the relative reduct.

The relative reduct is calculated by taking the relative discernability functions and removing superfluous attributes. For example, in f(E2) we only really need to look at % Gln to satisfy the whole function. In f(E3) we only need to look at one of either % Pro or +Chg.

The relative reducts for our example, then, are:

  • RED(E1) = (% Gln AND % Pro) OR (% Gln AND +Chg)
  • RED(E2) = % Gln
  • RED(E3) = % Pro OR +Chg

Let's derive some rules from our reducts. To do this we need to bind the condition attribute values of the equivalence class from which the reduct originated to the corresponding attributes of the reduct.

For example, all of the proteins making up equivalence class E1 have the condition attributes 12%, 6% and 0.2 for % Gln, % Pro and +Chg respectively. We can feed those values into RED(E1) to derive a relevant rule.

The rules for our example are below. If a rule contained an "or" it was split up into two rules, to keep things simple.

  • from RED(E1) : if% Gln = 12%and% Pro = 6%==> structure = all-a
  • from RED(E1) : if% Gln = 12%and+Chg = 0.2==> structure = all-a
  • from RED(E2) : if% Gln = 8%==> structure = all-b
  • from RED(E3) : if% Pro = 2%==> structure = ?
  • from RED(E3) : if+Chg = 0.12==> structure = ?

Not that the last two rules don't specify a structure. This is because equivalence class E3 contained proteins with more than one structure: such classes are called vague classes. A question mark is a bit of a rubbish answer - instead we should return a probability and a class using something called the rough membership function. All the rough membership function does is look at the distribution of different decision attributes in a particular vague class. Looking back at our original table we can see that equivalence class E3 describes three proteins, two of which are a + b and the other a / b. Therefore we could replace the question mark with:

  • from RED(E3) : if% Pro = 2%==> structure = (2/3 chance of a+b, 1/3 chance a / b)
  • from RED(E3) : if+Chg = 0.12==> structure = (2/3 chance of a+b, 1/3 chance a / b)


    Using the rules we have generated we can determine the structure of novel proteins. Each type of structure will have a lower approximation - the set of proteins which definitely have that structure - an upper approximation - the set of proteins which may possibly have that structure (though there's evidence that they have a different structure) and a boundary region, the set of proteins whose structure cannot be proven either way. Proteins can belong to more than one set.

    Like everything else in data mining rough sets are relatively simple but dressed up in complicated looking equations and peculiar terms. Hopefully the worked example above is enough to give you a firm grasp of what Cao et al. are doing: to explore rough sets further, check out the homepage of the International Rough Set Society or the RSES software package mentioned at the beginning of this post.

    (credit is due to Helge Grenager Solheim, from whom I've copied much of the format of this text, the animated gif is by Michael Hadjimichael).

Comments and trackbacks Feel free to post your comments Blogger RPM Blogger Stew Blogger milo gardner Anonymous Anonymous Anonymous Anonymous Anonymous Piero Pagliani Anonymous Anonymous Anonymous Anonymous . This post has trackbacks.

Trackbacks:

8 Comments:

At January 15, 2006 2:31 PM, Blogger RPM said...

It seems like the last step (where you apply probabilities to the reducts) would be subject to ascertainment bias. For example, if the data set from which you are training your algorithm is not an accurate representation of all proteins. Of course, this type of problem will affect the entire algorithm, but the probabilities stood out to me for some reason.

 
At January 15, 2006 3:55 PM, Blogger Stew said...

True... as I understand it rough sets theory tries to deal with this by allowing objects to belong to more than one class: so novel proteins classed with rules derived from E3 would belong to the boundary regions of two sets of proteins: those with structure a+b and those with structure a/b.

If we are looking for a solid decision either way then the 2/3 a+b probability applies: perhaps it's a bit dodgy, but then all the algorithm has to go on is the training data. You'd get the same problem with other rule based systems, I think?

 
At August 05, 2006 12:11 PM, Blogger milo gardner said...

The U. of Washington genetics
department is working on a related
matter. On DishTV, a 2006 program showed that 500 man-years of computer time created 19 of 20 'gene strings', based on an algorithm or set of algorithms that employed 'equivalence class' relationships. Were all 20 algorithms the same?

My question goes to the issue of the one exception, the 20th case, where the algorithm or algorithms did not work. Does this mean that nature is using non-algorithmic relationships? I know of many non-algorithms, one being remainder baed arithmetic working in cycylical two-part or three part systems.

Modeling nature is subtle work. Confirming a given 'brute force' created gene model, based on any situation, needs to be confirmed,
algorithmically vs non-algorithm,
right?

Best Regards,

Milo Gardner

 
At August 05, 2006 12:12 PM, Anonymous Anonymous said...

The U. of Washington genetics
department is working on a related
matter. On DishTV, a 2006 program showed that 500 man-years of computer time created 19 of 20 'gene strings', based on an algorithm or set of algorithms that employed 'equivalence class' relationships. Were all 20 algorithms the same?

My question goes to the issue of the one exception, the 20th case, where the algorithm or algorithms did not work. Does this mean that nature is using non-algorithmic relationships? I know of many non-algorithms, one being remainder baed arithmetic working in cycylical two-part or three part systems.

Modeling nature is subtle work. Confirming a given 'brute force' created gene model, based on any situation, needs to be confirmed,
algorithmically vs non-algorithm,
right?

Best Regards,

Milo Gardner

 
At September 19, 2006 1:14 PM, Anonymous Anonymous said...

The ROSETTA system is freely available for academic users: http://rosetta.lcb.uu.se/general/

 
At June 08, 2007 10:28 AM, Anonymous Piero Pagliani said...

A minor comment.

Curiously enough, the animation looks like the old animations of mine at www.surf.it/logic.


Best wishes.

 
At March 23, 2008 10:14 AM, Anonymous Anonymous said...

oh, man, that's a great post!

well, the rosetta is really hard to use, anyone has idea about any other package?

 
At May 14, 2008 7:59 AM, Anonymous Anonymous said...

Try RSES http://logic.mimuw.edu.pl/~rses/

 

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008