Remember Eugene Koonin et al. and Biology Direct, announced last October?
Interestingly [Biology Direct are] embracing an open peer review model where authors pick their reviewers, who aren't anonymous and can choose to have their comments published alongside the paper.
Biology Direct today carries an editorial (PDF, watch out) which discusses information overload, why reviewers like to remain anonymous and the journal's lofty goals. It's a good read.
The editorial suggests that the reason reviewers like to remain anonymous is because they're afraid that giving a review critical enough to sink a paper will, basically, stir up ill feeling in the authors (there's the hassle involved in rewriting and reformatting the paper, the delays involved in resubmitting and, well, feelings of rejection).
Biology Direct's solution (sort of - it's intended to complement existing peer review based journals) is to publish anything interesting enough to be accepted for review by three of its editorial board, no matter how good or bad the reviews. This way, in theory, there shouldn't be any rejection involved in giving a bad review and thus there's no need for reviewers remain anonymous.
This also helps with information overload as everything published - even if it carries universally bad reviews - is, in theory, at least interesting enough to hold the attention of three well qualified researchers. In this way it's sort of like the Faculty of 1000, except with proper critical assessment built in.
(via BBGM) There's an interesting paper about carbon nanotubes wrapped in DNA being used as sensors in living cells in Science this week (subscription required). Carbon nanotubes are tiny - as the name might suggest - cylindrical carbon molecules that are incredibly strong. Amongst other things they also display distinctive near-infrared photoluminescence when in aqueous solution.
Researchers at the University of Illinois wrapped particular double stranded oligonucleotides around the nanotubes. When exposed to certain ions the negative charges along the DNA backbone get neutralized and the oligonucleotide shifts from B to Z form, which decreases the nanotube's near-infrared emission energy: these changes can be readily detected.
'This is really one of the most imaginative applications of nanotubes in the life sciences arena,' said Tobias Hertel, a physical chemist at Vanderbilt University in Nashville. These nanotube sensors are cool because they work in "strongly scattering or absorbing media" - they were tested in blood, black ink and living mammalian cells and tissues. So far the authors have used oligonucleotides sensitive to mercury ions, but the hope is that eventually the sensors will be able to detect small drugs or chemicals associated with different diseases, like cancers (a new scientific discovery pitched as helping to cure cancer? Novel PR approach). Other fluorescent dyes used to tag molecules in experiments aren't very stable when exposed to light: nanotubes are. They're also less toxic than quantum dots, another possible nanotech sensor technology.
Moving swiftly on: the second neat thing I read today was (via Boing Boing) about a Dance Dance Revolution like game (for the uninitiated: it's this big floormat with pressure sensors on it which you dance on in time to music played by a games console) in which you build strings of DNA. It's called "Codon Hoedown" and it's part of the Sea of Genes exhibit at the Birch Aquarium in La Jolla, California. Tying in loosely with an earlier post, Sea of Genes is funded by a grant by the National Science Foundation to Brian Palenik and Ian Paulsen at TIGR. Yes, it's Venter again. There is no escape (at least not on this blog, seemingly).
Genetic algorithms are algorithms that model natural selection - sort of - to help solve search and optimization problems. They were originally mooted by John Holland at the University of Michigan back in the 70s, and arose out of earlier studies of cellular automata (in the in-silico sense).
In a nutshell, the idea is that we start off with a "gene pool" of randomly generated solutions (note that "solution" in this context doesn't mean the right answer, just an answer) to a particular problem. As they were randomly generated, these solutions will no doubt be fairly crap. We assess their crapness with something called a fitness function, which takes a solution as input and outputs a score, where the higher the score the better the solution. After assessing all of the solutions we take the ones with the highest fitness scores and "breed" them to produce the next generation of the gene pool. Occasionally a solution is "mutated" and tweaked in some way at random. The whole process is then repeated. Over time the overall fitness of the gene pool starts to improve until eventually one of the solutions in the genepool has a high enough fitness to be a real, practical solution to whatever our problem is.
Why should anybody care when there are so many other search and optimization procedures available? Genetic algorithms are easy to implement, easy to understand and perfectly suited to certain types of problem, like feature selection for machine learning purposes. On a more personal note they're also the only algorithms I've ever had fun implementing - and how often can you say that (do say: not very, don't say: you need to get out more)?
Classically, genetic algorithms (GAs for short, from now on) operate on fixed strings of bits - 1s and 0s - which represent the real parameters of the problem in some way (in feature selection, for example, they would represent the presence or absence of a particular feature). GAs are flexible beasts, though, and as such in this tutorial we'll try something different: operating on variable length strings representing sequence motifs.
Check out this messy Perl code (I mean it: it's hacky) which, when supplied with two Fasta files - a set of test sequences and a set of control sequences (the ones supplied are sets of DNaseI hypersensitive sequences and controls, as in the SVM tutorial)- looks for overrepresented motifs in the test set. I'll explain how it does this below (I've included the code for completeness and so that you can try out GAs for yourself quickly, but you'll be able to follow the rest of this post even if you don't take a look at it).
Our first step is to define what kind of data the GA will be operating on. In our case, we want to generate and test lots of different consensus motifs, which are strings of arbitrary length made up of an alphabet defined by the IUPAC : A, T, G and C, obviously, but also N to represent any base, R to represent a purine, Y a pyrimidine and so on.
alphabet array = ("A", "T", "G", "C", "N", "Y", "R", "S", "B", "D", "H", "V", "M", "K", "W"); We then need to seed the genepool. To do this, we simply create a new array and loop from 1 to k times, where k is a user defined variable specifying how big the initial genepool should be. On each loop we push a randomly generated string made up of letters from the alphabet we defined above onto the genepool array.
The next step is to assess the fitness of the genepool. To do this we'll need to implement a fitness function. Remember that a fitness function takes a solution as input and returns a score: the higher the score, the better the solution meets our needs. In our case, the fitness function will be taking a consensus motif as input and we need to return how overrepresented (or not) that motif is in the test set, compared to the controls.
function fitness(motif) { testset_count = count(motif, testset sequences array) control_count = count(motif, control sequences array)
# in what percentage of each set of sequences is the motif found? testset_percentage = ( testset_count / number of testset sequences ) * 100 control_percentage = ( control_count / number of control sequences ) * 100
return testset_percentage - control_percentage }
function count (motif, sequences array) { count = 0 foreach sequence in sequences array { if motif in sequence then count++ } return count }
In the psudeocode above, the fitness function calculates the percentage of test set sequences and the percentage of control sequences that contain the motif supplied to the function as input. It then returns the difference between them such that a motif found in all of the test set sequences but none of the controls will be scored 100 and a motif found in all of the controls but none of the test set will be scored -100.
When the genepool contains a motif with a high enough fitness, the algorithm will end. As endpoints depend on the fitness function that we are using we need to define how high "high enough" is. This will depend on the data, but in our case let's set it to 75 so that we're looking for motifs that appear in at least 75% of the test set sequences and no more than 25% of the control set (i.e. the difference between the test set percentage and the control set percentage should be at least 75)
If the genepool does contain any motifs with a fitness of 75 or more, we should finish the algorithm and print out that motif. Otherwise we move on to the next step, which is sorting the motifs in the genepool in order of fitness. This is simply to make it easier for us to pick which motifs we are going to keep for the next generation of solutions and which motifs are going to die off (it's a cruel world, but survival of the fittest rules in GA).
There are lots of different ways of picking solutions to carry on to the next generation. In general it's a bad idea simply to skim the cream from the sorted genepool array - that is, all of the top scoring motifs and none of the motifs that scored poorly. This is because keeping in some of the motifs that aren't particularly good solutions keeps the gene pool diverse and prevents premature convergence (the point where a GA can't get any better because of the genepool it has to work with) on a poor solution. One method of choosing the next generation of the genepool is to pick out a certain number of motifs with probabilities commensurate with their rank in the sorted array: so the top ranking motif has a much higher chance of being picked than the bottom ranking motif.
We remove all of the motifs not chosen to make up the next generation from the genepool array. To make up for their loss we need to "breed" new solutions by repeatedly picking a pair of motifs in the genepool at random and then producing a child motif by "crossover".
function make_child(parent_motif_a, parent_motif_b) { half_a = random half of parent_motif_a half_b = random half of parent_motif_b
return concat(half_a, half_b) }
Our breeding function picks one half of each parent sequence at random, combines them and returns the result - a child motif. It's not necessary for the two parent motifs to be of the same length or fitness or anything else.
Once we've produced a certain number of child solutions, we determine whether or not any of the solutions in the genepool will mutate at random. The chance of any individual solution mutating in any one generation should be small - 2%, say. There's a balance to be struck here between introducing diversity into the genepool - giving the algorithm more to work with - and keeping the genepool static enough for a solution to evolve (as in real life, evolution works over many generations).
If a motif is to be mutated then we should change it somehow. The simplest change is to pick a letter of the sequence at random and to change it (in classical GAs it would be to pick a bit at random and to flip it). This is roughly analogous to a SNP.
That's not the only possibility, though. We could also perform operations analogous to:
- point insertions : insert a single new letter somewhere in the motif
- point deletions : delete a single letter picked at random from the motif
- inversions : select a substring of the motif at random and invert it (so GAC becomes CAG)
- gross deletions : select a substring of the motif at random and delete it
- duplications : select a substring of the motif at random and duplicate it (so GAC becomes GACGAC)
In any case, once we're finished mutating any motifs unlucky enough to be selected we start the process of assessing fitnesses all over again with our new genepool and the algorithm repeats itself.
There's always the possibility that we'll never reach our endpoint, for whatever reason (the GA reaches a local maxima - that is, it keeps picking solutions that look good in the short term but can't be mutated in the long term to reach a high enough fitness - or perhaps our endpoint is just too high). We should therefore put a cap on the number of generations to run for. Again, this is fairly arbitrary as it depends on the nature of the data that we're working with.
What to do with any consensus motifs found by GAs like this one? Well, for one thing you could collect several of them and then use their presence / absence as features for an SVM classifier capable of discriminating between sequences similar to your test set and controls. Or perhaps you just want to find a single consensus motif for a particular binding site, or you want to find overrepresented sequences to assess their biological significance.
That's all there is to it. The interesting thing about genetic algorithms is that the real work isn't in implementing the algorithm per se but in designing the input format, fitness function and in picking the endpoint. The fitness function is particularly important: the suitability of any results that a GA might produce is inextricably linked to the suitability of the fitness function and it gets called once for every solution every generation - so it needs to be fairly quick to calculate.
Hope you find the above useful. Check out the code for more. Feel free to email with any comments, suggestions, questions or changes.
Back in December I posted links to some online games which were designed to get kids interested in bioinformatics (in a roundabout way). Unfortunately, preliminary analysis of new research by the University of London's Institute of Education - via BBC News - shows that a little bit of Flash probably isn't going to cut it.
The research was prompted by the fall in numbers of British pupils taking science courses at school. Apparently the number of pupils taking maths has dropped by 22% between 1991 and 2004, while the number of pupils taking chemistry has dropped by 16%.
11,000 school children were asked for their views on science and scientists. Some of the statistics mentioned in the BBC report:
70% of 11-15 year olds said they did not picture scientists as "normal young and attractive men and women" 80% think that scientists do "very important work" 70% think that scientists work "creatively and imaginatively" 40% think that scientists do "boring and repetitive work" Reasons for not wanting to become a scientists included "Because you would constantly be depressed and tired and not have time for family" and "because they all wear big glasses and white coats and I am female".
At first glance this all seems very depressing. Take a short break from, ahem, working creatively and imaginatively on your very important research to think about what the kids are saying, though, and you realise that: let's face it, it's all fair comment. The only place where the kids went wrong is that 70-80% of them think that the work we do as individual scientists (as opposed to science in general) is very important.
While you (probably) and I are contact-lens wearing *, attractive, normal human beings who maintain a strict separation between work and home and who never get depressed or tired about the many frustrations of a life in science, many other researchers aren't so lucky, as I'm sure you know. My wife works in genetics too, and on the odd occasions that I accompany her to her lab on the weekend while she does whatever wetlab people do with cells and flasks and things there are always, always other people there, toiling away. Even late at night. I'm pretty sure that this is normal for the field. Lab nights out confirm that the kids are right about scientists not always being normal, young and attractive, too (but then appearances are skin deep).
Anyway, I've no doubt that if you repeated this study and replaced "scientist" with "computer programmer" you'd get even more depressing results, but there's no shortage of comp sci graduates. I'd imagine that when applying for university kids go more on aptitude than public perception (though that's not to say that it doesn't count for anything).
Isn't the drop in kids taking hard science just down to pupils having more choice over what to learn? Certainly the list of vocational qualifications in British schools has expanded since my day. I only took biology because it meant I didn't have to do geography...
* This is not actually true, I do wear big glasses.
I'm of two minds about Craig Venter. Undoubtedly he is a egotistical glory hound who, at one point, could have seriously compromised the public human genome project (would computational genomics research today be the same if we all had to go through Celera? Probably not.... I'm thinking subscription based database access... I'm thinking HGMD... I'm thinking the amount of hair pulling and fruitless mouse clicking that was involved the last time I tried to get some information about a set of Celera SNPs off the web).
On the other hand, the man has cojones. He also has a fondness for high-throughput data collection that as a bioinformatician I have to admire.
Anyway, Venter has been in the news again recently. Bio-IT World is carrying a story about Google's involvement with Venter (first seen in the Sunday Times in November last year - Snowdeal picked up on it back then). There's also a story making the rounds about a new collaboration between the Venter Institute and UC San Diego division of the California Institute for Telecommunications and Information Technology (Calit2).
As you may know, right now Craig Venter is circumnavigating the globe on his high tech superyacht Sorcerer II, currently sitting off the coast of Madagascar. In between BBQs on the poop deck and having bikinied lovelies massage suncream onto his sensitive pate he collects litres of seawater, strains all of the living organisms out of them and then sends the resulting crates of freeze dried micro-organisms back to the States to get shotgun sequenced. It's this project that Calit2 is interested in.
As Sorcerer II is collecting vast amounts of information:
Scientists aboard the vessel are identifying about 40,000 new species at every 200-mile stop along their route, [Venter] said it's anticipated that to store, annotate and analyse all of the sequence data that is going to be generated scientists will need a fair amount of computing power.
The proposed solution is to set up a Grid on top of the National LambdaRail (a high speed academic network that runs over fiber optic lines). Calit2 is calling this a "Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis", or CAMERA for short. Calit2 and the San Diego Supercomputer Centre will provide the hardware while the Venter Institute will contribute sequences and "community developed genome analysis software" to help researchers make sense of the new data.
The infrastructure involved isn't trivial:
the CAMERA complex will have a thousand processors of dedicated local cluster computing and several hundred terabytes of replicated data storage. Fair enough.
The press release makes for fun reading: bombast from the Venter Institute, the natural predilection for tortured acronyms shared by all Grid researchers and the love PR officers have for making science sound complex have all come together: the whole project is swaddled in fantastically meaningless jargon. For the record, it's not a Grid, it's an environmental metagenomics data storage and computational complex run over the OptIPuter high-performance 'collaboratory'.
Oh, and it'll help cure cancer.
The new resource will greatly enhance [UCSD's] health science researchers' ability to advance the development of new drugs and therapies from the ocean's resources to combat cancer and neurodegenerative and other diseases.
It's vulgar to talk about, it's rude to bring up, it's a question without a good answer. Yet people keep asking (well, googling and reaching here): what's the money like in bioinformatics?
The average salary of bioinformaticians is, I guess, pretty similar to that of bench biologists. According to The Scientist's 2002 salary survey
the median income for academic positions in bioinformatics is $75,000. This is comparable to the numbers for clinical biologists and slightly better than the numbers for cell biologists. This certainly seems to be the case in sunny California. The Silicon Valley Metro says that:
Salaries can range from $60,000 at the entry level to more than $100,000 for Ph.D.s. A recent salary survey [..] reported that the average salary in 2003 for life scientists whose primary area of specialization is bioinformatics was $75,845. Here in the UK I'd suggest from experience that current starting salaries in bioinformatics are around £20,000 (~ $35,000), rising to around £30,000 (~ $53,000) with, say, five or six years experience - in academia, at least.
Add £2k or £3k per year (~ $4,000) to that if you're working for a charity or a research council, or £5k per year (~ $9,000) if you're working in the commercial sector (IT Jobs Watch lists an average salary of £37,500 for the bioinformatics jobs that it has listed in the past few months, most of which seem to be pharma related).
I don't think that having a PhD affects your starting salary much (feel free to correct me if you're an HR guru who knows otherwise), but a PhD (or lack thereof) will affect how quickly your salary reaches a ceiling. A masters degree might help you stand out from other jobseekers, but again it won't affect your salary much.
Those of you who have experience in the IT industry and who are looking to see how a switch to bioinformatics might affect your finances should bear in mind the wise words of James Tisdall when selling your services to biologists:
be prepared for them to have sticker shock when it comes to salaries. Maybe it's getting a little better now, but I've often found that biologists want to pay you about half of what you're worth on the market. Their pay level is just lower than that in computer programming. My advice? As Lincoln Stein suggests, do it because you love it.
There have been two interesting machine learning papers published in BMC Bioinformatics in the last week, one to do with guessing function from sequence using a fairly complex mix of different algorithms and one using rough sets to determine simple protein structure from amino acid composition (and some derived statistics).
The latter paper, by Cao et al., is interesting in that rough sets theory is rarely used in bioinformatics despite the fact that the field is full of exactly the kind of noisy, imprecise datasets that rough sets are designed to handle. One reason for this is possibly the lack of freely available rough set software: Rosetta (site currently down)- which Cao et al. used - has a commercial licence, though there exists a free alternative in RSES from the Institute of Mathematics at Warsaw University.
Rough set theory was introduced in the 1980s by Zdzislaw Pawlak. Rough sets were designed to be used for the classification of imprecise, uncertain or incomplete information. Mathematically it's relatively simple (at least to mathematicians). Let's have a look at a toy example, avoiding greek letters and funny symbols as much as possible...
Imagine a table containing information about some proteins. Each row represents a different protein and each column contains a different attribute that describes the protein (the attributes could be, say, percentage glutamine, percentage proline and positive charge).
We'll describe five different proteins, each with three attributes (% Gln, % Pro and + Chg) and with an associated structure (all-alpha, all-beta, alpha / beta or alpha + beta).
| id | % Gln | % Pro | + Chg | Structure | | 1 | 12% | 6% | 0.2 | all-a | | 2 | 12% | 6% | 0.2 | all-a | | 3 | 8% | 6% | 0.2 | all-b | | 4 | 12% | 2% | 0.12 | a / b | | 5 | 12% | 2% | 0.12 | a + b | 6
| 12%
| 2%
| 0.12
| a + b
|
Using this training data we want to use rough sets to derive some rules that will enable us to determine the structure of a novel protein given the attributes describing that protein. In rough set speak, the structure will be the decision attribute and % Gln, % Pro and + Chg are the condition attributes.
The first step is to determine equivalence classes. These are groups of objects in which all of the condition attributes are the same for each object and thus cannot be distinguished between. In our example there are three equivalence classes: ids (1 & 2), (3) and (4, 5 & 6) - because the set of ids 1 and 2 share all the same condition attributes, as does the set 4, 5 & 6, etc. We'll name our equivalence classes E1, E2 and E3.
Equivalence Class
| % Gln | % Pro | + Chg | E1 (ids 1 & 2)
| 12% | 6% | 0.2 | E2 (id 3)
| 8% | 6% | 0.2 | E3 (ids 4, 5 & 6)
| 12% | 2% | 0.12 |
Note that equivalence classes can contain ids that have different decision attributes (i.e. structures).
The next step is to construct a discernability matrix, a chart which in our case will look something like this:
| E1
| E2
| E3
| E1
| -
| % Gln
| % Pro, + Chg
| E2
| % Gln
| -
| % Gln, % Pro, + Chg
| E3
| % Pro, + Chg | % Gln, % Pro, + Chg | -
|
The axes are the equivalence classes and the cells contain the condition attributes that differentiate between those classes. For example, the % Gln attribute is the only thing that differentiates between equivalence classes E1 and E2, while all three condition attributes are different between equivalence classes E2 and E3.
Using the discernability matrix we can calculate relative discernability functions, which give the minimum set of attributes necessary to differentiate a given class from the others. To do this, simply take each row in turn and concatenate the cells with logical ANDs. Concatenate the attributes within each cell with logical ORs.
The relative discernability functions for our example will be:
- f(E1) = (% Gln) AND (% Pro OR +Chg)
- f(E2) = (% Gln) AND (% Gln OR % Pro OR +Chg)
- f(E3) = (% Pro OR +Chg) AND (% Gln OR % Pro OR +Chg)
Remember that we're looking for the minimum set of attributes necessary to differentiate each class. Take the relative discernability function for E1: to differentiate between E1 and E2 we need to look at the % Gln attribute, but to differentiate between E1 and E3 we could use either % Pro or +Chg.
We're getting close to deriving some rules from our discernability functions. However, it might be the case that not all of the attributes we've been looking at are required. To simplify things we can determine which attributes will come in useful by next calculating the relative reduct.
The relative reduct is calculated by taking the relative discernability functions and removing superfluous attributes. For example, in f(E2) we only really need to look at % Gln to satisfy the whole function. In f(E3) we only need to look at one of either % Pro or +Chg.
The relative reducts for our example, then, are:
- RED(E1) = (% Gln AND % Pro) OR (% Gln AND +Chg)
- RED(E2) = % Gln
- RED(E3) = % Pro OR +Chg
Let's derive some rules from our reducts. To do this we need to bind the condition attribute values of the equivalence class from which the reduct originated to the corresponding attributes of the reduct.
For example, all of the proteins making up equivalence class E1 have the condition attributes 12%, 6% and 0.2 for % Gln, % Pro and +Chg respectively. We can feed those values into RED(E1) to derive a relevant rule.
The rules for our example are below. If a rule contained an "or" it was split up into two rules, to keep things simple.
- from RED(E1) : if% Gln = 12%and% Pro = 6%==> structure = all-a
- from RED(E1) : if% Gln = 12%and+Chg = 0.2==> structure = all-a
- from RED(E2) : if% Gln = 8%==> structure = all-b
- from RED(E3) : if% Pro = 2%==> structure = ?
- from RED(E3) : if+Chg = 0.12==> structure = ?
Not that the last two rules don't specify a structure. This is because equivalence class E3 contained proteins with more than one structure: such classes are called vague classes. A question mark is a bit of a rubbish answer - instead we should return a probability and a class using something called the rough membership function. All the rough membership function does is look at the distribution of different decision attributes in a particular vague class. Looking back at our original table we can see that equivalence class E3 describes three proteins, two of which are a + b and the other a / b. Therefore we could replace the question mark with:
- from RED(E3) : if% Pro = 2%==> structure = (2/3 chance of a+b, 1/3 chance a / b)
- from RED(E3) : if+Chg = 0.12==> structure = (2/3 chance of a+b, 1/3 chance a / b)
Using the rules we have generated we can determine the structure of novel proteins. Each type of structure will have a lower approximation - the set of proteins which definitely have that structure - an upper approximation - the set of proteins which may possibly have that structure (though there's evidence that they have a different structure) and a boundary region, the set of proteins whose structure cannot be proven either way. Proteins can belong to more than one set.
Like everything else in data mining rough sets are relatively simple but dressed up in complicated looking equations and peculiar terms. Hopefully the worked example above is enough to give you a firm grasp of what Cao et al. are doing: to explore rough sets further, check out the homepage of the International Rough Set Society or the RSES software package mentioned at the beginning of this post.
(credit is due to Helge Grenager Solheim, from whom I've copied much of the format of this text, the animated gif is by Michael Hadjimichael).
I'm back from hospital. I don't know what I was expecting, but my biggest problem was boredom; it was pretty much a pain-free experience (there was a lot of discomfort, on the other hand). Despite the media being full of stories about hospital acquired infections, uncaring nurses and falling down hospital buildings I've got nothing but praise for my NHS experience.
Anyway, lots of interesting stuff on other blogs to catch up on - this tongue in cheek post about opting for fraud at Notes from the Biomass, for example (an earlier post there also links to Pierre's bioinformatics blog. Pierre has lots of good posts, like this one about drawing Genbank features in Firefox using XUL and SVG now that 1.5 supports it properly).
Over at Nodal there are lots of comments on the previously mentioned Hopes and Fears thread - open source Endnotes and robot controlling monkeys (via Pedro), that kind of thing.
I'm going into hospital tommorrow to have surgery to fix a problem with my hip, so I won't be posting for the next week or two. There's plenty of reading to be had if you follow some of the links to the right though.
Neil has a post over at Nodalpoint about New Year hopes and fears (in the bioinformatics sense). My two pence are:
Two things that I think'll be hot (in my homo sapiens centric view): - Human structural genomics - in the chromatin architecture sense - finally getting some high-throughput (sort of) data, from ENCODE if nothing else (oh yeah, ENCODE is next year's HapMap).
- High recall, high precision regulatory element prediction should happen sometime in the next year. Then we'll just be stuck with the same problem as we have with genes i.e. we know where they are, but not what do they do...
My hopes are that people will stop writing about protein interaction networks until they get some new datasets and that a community builds around some open source LIMS so that we finally have a fully featured one that is continually supported and patched, etc etc. Hope everybody had a merry christmas (if that's appropriate) and a happy new year, in any case.
See all posts from:
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
|
|