As bioinformaticians we spend our time solving biological problems through computation (yeah, so occasionally wet lab people help out). There's a flipside to this coin: Natural Computation, the field in which systems inspired by biology are applied to computational problems.
Genetic algorithms, used to quickly find approximate solutions in search and optimization problems, are probably the most well known form of natural computation. Inspired by genetics, the idea behind them is relatively simple: you represent solutions by strings of parameters (genomes), which are ranked according to fitness (where fitness measures how good the solution is). Selected genomes - a set biased towards but not exclusively made up of those genomes with the highest fitness - then "breed" to produce a new generation of solutions. Breeding involves recombining the parents to produce child solutions whose genome contains a mix of both parents genetic material. Each generation a small amount of mutation will also take place, with parameters randomly changed in some children. In theory, the overall fitness of each generation gradually improves until you end up with a set of good solutions to your problem. GAs are commonly used to solve timetabling problems, amongst other things.
Artificial immune systems are based loosely on the human immune system, which has all sorts of properties that software engineers like:
The immune system is highly distributed, highly adaptive, self-organizing in nature, maintains a memory of past encounters and has the ability to continually learn about new encounters. AIS research has been applied mostly to feature selection and change or anomaly detection for the last fifteen years - probably because it's an easy metaphor to understand if you're, say, interested in detecting viruses on a computer network. Jason Brownlee at the Swinburne University of Technology has some papers on the subject and a Weka implementation, if you're interested in learning more.
Finally, swarm intelligence leverages the sort of swarm behavior you see in nature - ant colonies, flocks of birds, herds of wildebeest and so on - where complex global behavior emerges from the interactions of many simple agents interacting locally with one another and with the environment. Ant Colony Optimization is one popular swarm intelligence algorithm, capable of quickly finding good paths through complex graphs (like the Traveling Salesman problem). ACO works by simulating "ants" walking randomly through the graph. Ants start off at the colony - the start point - and wander around until they find food - the desired end point. They then return to the colony while laying down a pheromone trail, which evaporates over time. If other wandering ants come across a pheromone trail they are likely to follow it to the end point instead of taking a random path through the graph, depending on how strong the pheromone density is. Shorter paths are more likely to have dense pheromone trails and positive feedback eventually leads to all of the ants following the same short path.
Apparently swarm intelligence is relatively popular in the bioinformatics machine learning community (though I can't think of any papers which refer to it: possibly because AI researchers don't publish in genetics journals)...
Related to the previous post - Andy Law at the Roslin Institute (which you may remember as the home of Dolly the Sheep, if nothing else) has a set of astute observations on genetic analysis programs and experiments...
Law's First Law The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.
Actually, I went to see Dolly recently (in stuffed form) at the Royal Museum in Edinburgh. It seemed like an undignified end for her: stuck in a corner behind some mouldy monkeys. Shouldn't she get a statue or something?
There's an interesting paper in NAR this month by Fadiel et al about farm animal genomics and informatics. It's worth a look, especially if you spend all your time working with a single genome (amalgamated 8-way conservation scores don't count).
It seems like a tough area to work in. Supposedly there's very little money available for basic research and barnyard biotech companies have a hard time finding venture capital. I guess all the money from the sort of people who are most prepared in invest in science without any short term return - public bodies and charities - gets sucked up by human genetics projects.
Speaking of public bodies, here in the UK the BBSRC - a state funding body for biological sciences - recently concluded a consultation exercise which highlighted three areas of potential: animal health, animal production and animal biology.
Animal health is to do with understanding things like susceptibility to disease and pest resistance, with the aim of developing new drugs, vaccines and diagnostics. Animal biology refers to exploring the possibilities of using animals as models of human disease: pigs and cows, for example, make much better models for obesity than mice do. Further down the line there's also the possibility of xenotransplantation, organ transfer from animals to humans.
Animal production is presumably where biotech executives see big bucks. Finding the QTLs associated with higher quality beef in cattle so that you can design better breeding programs is a simple example. A more "hands-on" approach is to design transgenic animals like the AquaBounty salmon - salmon designed to express growth hormone all year round instead of just during summer months.
The BBSRC report also highlights a skills shortage in the field when it comes to bioinformatics and quantitative genetics (isn't there a quantitative genetics skills shortage everywhere?). Of course, given the aforementioned funding problems that agricultural geneticists face, maybe this isn't surprising.
While doing a literature search on sequence visualization I came across this paper from 1986 by JE Cowin et al. which describes a "new method of representing DNA sequences which combines ease of visual analysis with machine readability".
My computer science background means that unfortunately I tend to treat papers over, say, ten years old rather skeptically (I mean, they're obsolete, aren't they? Basic discoveries are allowed as exceptions). This one, though, has an old skool charm about it.
It's based on the idea that instead of a sequence of letters DNA could be represented as points on staves (like the horizontal lines you see on sheet music). The top line is G, the one below it A, the next T and the bottom one C. In theory this makes the sequence both machine readable by light pen and more amenable to analysis by eye. Also it means you can play your favourite gene on the trombone at parties and get all the girls.
I'm not sure I'd want to look for complex binding motifs or anything by eye but it does let you see purine / pyrimidine tracts, GC rich regions and that sort of thing fairly easily. The paper also notes that in the case of palindromes "the symbols form a pattern of perfect dyad rotational symetry about an axis perpendicular to the centre line of the stave". Uh huh. I'm actually interested to know what the thing about the light pen is: do the authors mean machine readable from a piece of paper, or the screen, or what? Sadly I lack the necessary technochronological context.
Anyway, the program - in a language that I'm not familiar with but which looks a bit like some sort of more complicated BASIC (feel free to identify it if you can) - was included as an appendix and published along with the paper. Not quite a compendium, but still.
Adhanom Tewolde at the Katholieke Universiteit Leuven in Belgium has put up a nice resource for people interested in getting started with the Weka machine learning platform (which deals with a lot of different machine learning algorithms - see this previous post on the subject) and decision trees in general.
Decision trees are used for all sorts of things in bioinformatics: predicting genetic regulatory response to different experiments in yeast, determining the extent of resistance to antiretroviral drugs in HIV patients and protein annotation, for example.
Adhanom's guide explains how decision trees work, how they are created and what the advantages of using them are.
Simplistically: imagine that you have a training set of data labeled either class A or B. A tree building algorithm will start off with a single node, representing the entire training set. It'll then decide on a "split" which produces two child nodes, each representing a subset of the training data. The goal of each split is to maximize the purity of the child nodes: a node is purest when it only contains one class. Each child node can have further splits, producing child nodes of child nodes... and so on and so forth. Eventually you end up with a tree like structure with a "root" node at the base representing your entire training set (a mix of As and Bs) and lots of pure leaf nodes at the top (some of which are all As and some of which are all Bs, ideally).
You can then feed a different, unlabeled dataset into that tree and by following the splits, classify each element of that dataset on the basis of which leaf node it ends up in.
One advantage of decision trees is that they can easily handle categorical as well as numerical variables. Another is that (depending on the algorithm you use and how many variables are involved) it can be a lot easier to interpret a tree than the workings of a "black box" neural network or an SVM.
Weka contains a number of different tree building algorithms. Another option linked to from the page above, though, is Shih Data Miner, which I hadn't heard of before: it seems to be quite well documented, and feedback is a bit more visual: maybe it would be a good place to start experimenting?
A good link via Notes from the Biomass: an editorial by Pernille Rørth at The EMBO Journal about her work there. It's a nicely written piece, with some particularly interesting sections on rebuttals and on the differences between full time editors and scientists who are editors part time.
Sometimes [rebuttals are] done in a civil and reasonable manner, sometimes not. A few unpleasant letters from authors of rejected manuscript even contain more or less veiled insults and threats.
Hi, and welcome to Tangled Bank #41. Tangled Bank is a biweekly collection of science and medicine related blog posts submitted by their authors (in most cases, anyway). It's a smorsgasbord of articles on many different aspects of science and a good way to find blogs that you wouldn't normally come across.
Being a carnival barker is very similar to being a scientist: 18 hour days, crap pay, the only way to survive is to convince credulous marks to part with their money on crapshoots... so it was a no-brainer to volunteer to host TB this fortnight. Hope you enjoy it as much as I enjoyed putting it together.
Astronomy
Let's start at the beginning - the very beginning - with a post from Bad Astronomy: First Light, about the background light from the very first stars (perhaps).
Biology & Genetics
Skipping forward a few eons, the Hairy Museum of Natural History tells us of Scales From Kyrgystan and the engimatic Longisquama, whose scales have been brought up by some as an example of a transitional structure between scales and feathers. Speaking of feathers, 10,000 Birds has a post about waxwings and why they are so called.
I can't believe that we've been working with mice for so long and yet never realised that they sing to each other (sort of). Science Made Cool has a post about that.
BotanicalGirl, meanwhile, has written about carnivorous plants. I was terrorized as a child by the idea of the Triffids invading my house and eating my family, and this brought back a lot of memories. Thanks BG.
Discovering Biololgy in a Digital World describes a great learning activity for high school students: Head, Shoulders, Knees and Toes, a hands-on way of discovering that different genes are expressed in different places and at different times.
Here at Flags and Lollipops (what? When you host Tangled Bank you can submit to yourself too) I wrote about how it seems that we humans differ from one another genetically in more ways that we first thought.
Medicine
Are you one of those people who bought Tamilflu from eBay? Been steering clear of pidgeons and south east asian poultry markets recently? If so, buck up! Ruminating Dude asks why are we so bent over bird flu?... pointing out that if we really want something to be worried about, being eaten alive by MRSA while in hospital for a routine operation probably fits the bill.
Thinking for Food, on the other hand, has been hit pretty hard by whooping cough, which sounds pretty nasty - and you thought that you were safe now that you're all grown up.
Cognition
Cognitive Daily has an interesting post about how gut feelings influence memory, describing a recent experiment neatly summed up by the title of the paper written about it: remembering by the seat of your pants (go read the post and see).
The Science of Science
There's a bumper crop of writing about science the subject this Tangled Bank. First of all, Mike the Mad Biologist talks about Trust Versus Belief. Is there an equivalence between the "beliefs" of scientists and the beliefs of those Intelligent Design guys?
Speaking of Intelligent Design, Jakobische Rants mounts a Defense of Reasoned Inquiry and the scientific method, and Adventures in Ethics and Science asks if it a good idea for scientists to give permission to Creation Magazine to reproduce figures and videos for their own purposes. What's more important: sharing your findings with the scientific community or ensuring that your findings aren't misrepresented?
Hsien-Lei over at Genetics and Public Health, meanwhile, has written about the Strong-Inference-Plus experimental paradigm, which sounds particularly interesting to us bioinformaticians (slow moving experiments are sooo last century).
Literary Darwinism
I'd never heard of Literary Darwinism before, and in come two posts discussing it. As a public service to others, then, here's a page about it to get a handle of what on earth is going on.
So, on to the posts, both of which, um, have some problems with evolutionary psychology as applied to human culture. Jerry Monaco has submitted Literature as Experience - A Hope for Literary Darwinism, while over at FrinkTank they wonder why the world's intellectuals accept that the most proximate answer to "why do we do that" can be found in evolution.
Engineering
Here in the UK our reputation for mighty feats of engineering suffered a little when we had to close the Millenium Bridge in London two days after it opened (it was wobbling too much). Political Calculations examines the post-mortem study.
Weird Science
 To finish up with, three posts from the fringes of science: the first about the UK again. Apparently British women have the biggest breasts in Europe. The keenest among you may notice that unfortunately the study that "discovered" this doesn't seem to have taken average weight into consideration: the French and Italians are, I'm guessing, just skinnier than us. British men probably have the second biggest breasts in Europe.
Meanwhile, over at Respectful Insolence, Orac knows a lot about the science of tinfoil hats - they're perhaps not such a good defence against alien mind control rays after all.
When I think of an invasive species I tend to think of an insect or a particularly evil microbe. The Japanese seem more worried about giant, frog eating hamsters. The Invasive Species Weblog brings you a relevant pamphlet.
See you next time
That's (almost) all for this fortnight. The next Tangled Bank is at Dogged Blog on the 30th of November - you can send your entries to host@tangledbank.net or to P.Z Myers at pzmyers@pharyngula.org.
But we started at the beginning of the universe so it seems fitting that we finish up with the end of the world. A quirk of cyberspace has allowed the Science Musings blog an inside look at how it's all going down...
I recently came across the Open Regulatory Annotation Database (ORegAnno, at not too much of a stretch). Anyway, I haven't delved very deeply into it, but it looks promising. It's a database of regulatory regions obtained from the literature.
Anybody can add new records or fix old ones (there's a validation process in place), it has a SOAP interface and the data and the software that serves up the data are all available under the LGPL. This is what a database should be...
Unfortunately it doesn't seem all that big. I guess it's a new project and time will tell if it develops into a valuable resource or gets forgotten like the hundreds of other all-conquering new databases that spring up every year (usually just before the deadline for NAR's database issue).
Good luck to them, anyway.
The National Human Genome Research Institute (NHGRI) announced recently that it's going to divert some of the throughput of its large scale sequencing program towards medical genetics.
Despite falling costs, large scale sequencing is still far beyond the budget of your everyday researcher, especially those researchers working on obscure Mendelian disorders that commercially minded funders aren't really interested in.
NHGRI is planning on tackling some of those obscure disorders first. If a disease is particularly rare then it's difficult to assemble a big enough family of patients for linkage analysis to be able to pinpoint the gene in a useful way and investigators can be left knowing only that the "broken" gene responsible for the disease lies somewhere in a stretch of the genome many megabases long. NHGRI's approach is to simply step in and sequence the whole stretch in all of the patients, then pass on the data to the relevant researchers.
NHGRI's also planning on looking at unmapped X-linked disorders; they say that there are at least 130 of these, based on data from OMIM. This would involve sequencing all of the exons on the X chromosome from patients afflicted by a particular disorder and looking for the variations that they had in common, on the basis that those variations might be worthy of further study.
Even more ambitiously, they outline a plan to sequence the loci surrounding and containing genes involved in common complex disorders like epilepsy and diabetes from samples taken from thousands of patients whose physiological parameters have been accurately measured. Variants with medically interesting correlations can then be detected with statistical methods.
Much of the work that they talk about is presumably designed more to drive technology and to generate protocols and methods for similar work in the future, with any potentially positive outcomes (i.e. mapping some disease genes) being simply a bonus. It seems unlikely that NHGRI would invest so much money on a brute force approach in an area like Mendelian disorders otherwise. There'll be a lot of very happy M.Ds whose pet Mendelian trait is just about to benefit from federal funding.
A lot of data is going to be generated from projects like this and the various national Biobanks. A top tip, if you're just finishing up your undergraduate degree (or looking for a career switch): there was never a better time to become a biostatistician.
I'm lazy, so on the off chance that you don't already subscribe to the other blogs listed to the right, here are some links that I thought were interesting recently:
Incidentally, when I check my web logs to see who's linking here (I'm not just lazy, I'm vain, too) I can see the Google queries that lead to this page.
It turns out that people like lollipops far more than they like bioinformatics. How disappointed the person who typed "Where can I find really big lollipops" into Google must've been when they arrived here (were they looking for this?) . Most people are obviously searching for lollipop recipes, like the one found here.
I'm thinking of jacking in my job and opening an online sweet store, instead.
If you follow science related news or blogs then you may have noticed an upsurge in the number of stories about large scale variation in the human genome recently. This news feature at Nature (subscription required, unfortunately) talks about why this is.
About 5% of the human genome is made up of segmental duplications - also known as low copy repeats (LCRs). These are stretches of DNA which at some point over the last 35 million years have been duplicated - or triplicated, or quadruplicated... in this sense the name can be a little misleading. They can be up to 400kb long, depending on whose definition you use, and they're interspersed around the genome.
In general, LCRs of a particular sequence which occur on the same chromosome are seperated by less than 10Mb of intervening sequence. That intervening sequence is prone to all sorts of abnormal chromosomal rearrangments - the figure to the left (from the Nature article) demonstrates some of the possibilities.
This isn't a new discovery: we've known about structural variation for a while. What has surprised some geneticists, though, is that so many of us have large amounts of structural variation. Structural abnormalities that can be detected cytogenetically - essentially by looking at chromosomes under the microscope - are usually associated with disease and most of the basic research done in the field has been driven by people focused on specific diseases that are caused by chromosomal abnormalities, like DiGeorge syndrome (caused by a large deletion on chromosome 22). This seems to have engendered an unspoken assumption that people with chromosomal variations are invariably afflicted with disease (check out this post at Evolgen for a perspective on this from RPM).
But it turns out that we all have lots of relatively small chromosomal variations rather than one or two major disease causing deletions, duplications or inversions, and the phenotypic effect of all these variations can be subtle. Sharp et al. from the Eichler lab at the University of Washington undertook a study earlier this year of copy number polymorphisms (CNPs) - the "copy number variants" on the figure. They screened a panel of 47 people who came from a variety of ethnic backgrounds and found 119 different regions where copy number variation had occurred - i.e. regions where a particular sequence has been repeated or deleted. 66 of those 119 regions had copy number variation in more than one person but none of the regions were associated with any particular ethnic group, indicating that they were old - that they'd become established in the population before humans started spreading out across the globe.
Remember that these are substantial regions, not single nucleotides. Their results support the conclusion reached by Sebat et al. - who undertook a similar study the year before - that large scale copy number polymorphisms contribute substantially to genomic variation between normal humans.
Some of the duplicated regions highlighted by the Sharp & Sebat studies contain genes, or parts of genes. Gene duplication is a good thing, in evolutionary terms, as duplicates are freed from evolutionary constraints (if one copy mutates and ceases to perform as it should there's a backup waiting in the wings, so the mutant is free to develop new functions over time - or to fall by the wayside). Indeed, many of the genetic differences between humans and other primates are the result of large duplications and deletions.
The phenotypic differences that arise from having different gene copy numbers is a hot topic for investigation, especially given that many of the association studies to try and find the single point mutations influencing particular complex diseases haven't really lived up to the hopes and hype surrounding them. Sebat et al. are currently involved in exploring potential relationships between CNPs and autism, based on a hypothesis that alterations in gene dosage influence many neurological disorders. More famously, Gonzalez et al. recently showed that the number of copies of the CCL3L1 you carry influences your susceptibility to HIV / AIDS.
If you're interested, bioinformatics wise, a good place to start is the Eichlerlab's Human Structural Variation database, where the data from the Sharp and Sebat studies (amongst others) has been collected.
A while back I wrote a post about biohacking and Eduardo Kac (the guy who genetically engineered a GFP bunny). This week there's another "art meets genetics" type story in Wired entitled DNA Dose Seeds Living Tombstones.
A concise description of what it's all about can be found on Georg Tremmel's profile at NESTA (a UK funding body):
Georg Tremmel, along with colleague Shiho Fukuhara, plans to grow trees containing the genetic identity of humans. Their innovative coding method allows the encryption of human DNA within a tree's DNA without affecting the resulting tree. It would mean a person’s DNA could live on, with the tree, as a memorial for life, or ‘transgenic tombstone’.
More quotes from that profile:
“Life is DNA. If you can pass your DNA into a tree, you will live on within the tree.”
and, appallingly:
“Implanting grandmother Smith's DNA into an apple tree brings a whole new meaning to the phrase ‘Granny Smith’!” Georg and Shiho have founded a company / art venture called Biopresence that will create these trees for around $35,000 a pop, according to Wired. The "innovative coding method" referred to comes from Joe Davis, another artist interested in art meeting science:
[in the 1980s] Davis led a quasi-covert operation that recorded the vaginal contractions of ballerinas with the Boston Ballet and other women, then translated this impetus of human conception into text, music, phonetic speech and ultimately into radio signals, which were beamed from MIT's Millstone radar to Epsilon Eridani, Tau Ceti, and two other nearby star systems. I kid you not. Incidentally, the piece in the Scientific American that I got that paragraph from is some of the worst writing in a science magazine I've ever seen.
Anyway, Joe Davis' "DNA Manifold" idea is to harness the awesome power of codon redundancy; in theory and to an extent you can change the nucleotide composition of a gene without affecting the protein that it encodes. By assigning specific binary strings to specific codons you could shuffle around nucleotides to encode arbitrary data without changing the string of amino acids produced by a gene. So if Biopresence do that, it'll means the tree will remain unchanged, right? Some people might mention "codon bias" and "unpredictable consequences" at this point, but hey, shut up, spoilsports.
I'm also not sure exactly what "the essence of a human being" is, in genetic terms. It's that information that'll be encoded in the trees (the blurb at NESTA says something about "a person's DNA living on" but there's obviously considerably more human DNA - even just the bits with a known purpose - than there is space in a cherry blossom's coding regions). Whatever this essence is, it can be extracted from skin cells for less than $35,000.
But anyway, talking about the science of it is besides the point. Biopresence is an art project - at least I hope it is. The whole point of it is to stimulate discussion and to fire the imagination.
It's still all impractical, metaphysical guff, though.
Via Gene Expression and The Genius, two nice visualisations: one computer rendered and the other obtained through traditional microscopy.
The computer rendered one is this animated tour of RNA interference at Nature. Let me say first that I think it's very cool, it looks great, kudos to Arkitek Studios (who made it). However, like most computer renderings of cell nuclei this one is faintly reminiscent of a space battle (I'm sure that I've seen that RNAi silencing complex model before on Babylon 5). And who knew that Pacman spends his spare time disposing of aberrant mRNA?
The second one is of the H5N1 strain of the avian flu virus, taken by Swedish science photographer Lennart Nilsson. Apparently the blue balls are the virus attacking heathly pink cells. High resolution versions are available at Dagens Nyhete. I don't know what the yellow thing in the right hand photograph is: maybe it's Pacman again, in dire straits.
Via Ars Technica: Amazon announced two new features this morning. The title of the press release ...
Amazon.com Announces Plans for Innovative Digital Book Programs That Will Enable Customers to Purchase Online Access to Any Page, Section, or Chapter of a Book, as Well as the Book in Its Entirety ... says it all, really. In theory it means that you'll be able to buy just the chapters that interest you from, say, scientific textbooks (in practice there's presumably an "opt out" option for publishers worried about profits).
I'd like to see this happen. I.T books are expensive - unfortunately that's par for the course for any scientific textbook - but they also become obsolete very quickly. At least this way you're only paying for what you need.
Speaking of which we almost certainly don't need the obligatory man pages and APIs included as appendices in many such tomes. What's with the "my hardback is bigger than yours" school of thought in computer science publishing? You think your readers are incapable of printing things out for themselves?
There's a new Tangled Bank up at the Examining Room of Dr. Charles. Good stuff this fortnight includes funky MacGyver style garage genetics (DNA extraction and electrophoresis using everyday materials - gel boxes made out of lego and plastic wrap, etc.). It's aimed at high school teachers to use in class:
As a bonus, the precipitation of DNA into a ‘snot’ like form adds an added ‘wow’ factor to the activity. (that should keep them happy). There's also a discussion of Peter Andolfatto's letter to Nature concerning the evolution of non-coding drosophila DNA at Evolgen and a post by Dr. Andy about the relationship between left-handedness and breast cancer, amongst many other things.
Tangled Bank is a good way to find new science related blogs and there's always at least a few interesting stories - check it out.
And now the blatant plug: F&L is hosting Tangled Bank #41 on the 16th November. If you've got a science or medicine related post that you're particularly proud of, or think might be interesting, or funny, or well written... submit it to me at stew@flagsandlollipops.com or to PZ at pzmyers@pharyngula.org (in which case you should put "Tangled Bank" somewhere in the subject line).
Don't hesitate, don't be shy, don't wonder if your work is good enough—flit right into the bank with the rest of us elaborately constructed forms. This is an egalitarian activity. You do not have to be a Ph.D., you don't have to write articles with ten-syllable words, you don't have to discuss esoteric details. All you have to do is express some enthusiasm for the natural world or encourage study of the same. See the tangledbank.net site for more details and some very rough guidelines.
See all posts from:
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
|
|