Flags and Lollipops

Monday, August 01, 2005

Restricted Access & the HGMD

I've used the Human Gene Mutation Database as a data source fairly frequently. In case you haven't come across it before, it does pretty much what it says on the tin - it's a database of various (disease linked) mutations grouped by gene. If you wanted to get a set of disease causing SNPs or a list of translocation breakpoints that happen within genes, for example, it'd be great.

The drawback is that there's no easy way to get at the data. Visiting the website, your only option is to search by gene; you'll then get a list of mutations that the gene contains. There's no form of advanced search and no way to bulk download the contents of the database (via a condoned channel; shadily, there's always wget configured with a time delay).

This is obviously something that the authors have considered. However, in their paper in NAR they mention that:
Since HGMD is partly dependent upon industrial funding and involves considerable editorial work over and above mere literature screening (e.g. to ensure the consistency of nucleotide sequence information, amino acid residue numbering and gene symbol usage), unsolved copyright problems have so far precluded HGMD from being downloadable in its entirety.
It disturbs me slightly that this sort of thing is an issue. I think that it's because as opposed to lab based genetics, bioinformatics resources are usually free; programming languages (Java, Perl), libraries (Bio*, NCBI's API, Seqhound) and data (Ensembl, PubMed abstracts, GNF expression data...). Free is a tricky concept nowadays, of course, but I mean in the sense that they are usually free to obtain and to use in an academic environment.

Just to be clear, I'm not disparaging the work of the people involved in the HGMD, just the politics behind some of their policies. The fact remains that the HGMD is a good database. It has the potential to be even better, though.

Why not release copyright on this kind of data, or allow researchers to use the relevant information after signing a release to ensure that they stick to your terms and conditions? Restricting access in this way (especially without explanation, unless you've read the relevant part of the paper) surely just annoys scientists. There's no corporate peer pressure anymore; even Celera has given up trying to hoard genomic data that goes out of date by the time you've worked out how to charge for it.

Let open access work for you. The mutations in HGMD are often culled from literature and relating them to reference sequences is remarkably difficult. An internal database identifier and "Asn351 to Asp" is only great if people know which transcript is being talked about. Make the first condition of using HGMD data that any derived analyses be made publically available too. Presumably the first thing that some people will do is take a Perl script and dbSNP and start mapping. Start including annotation derived from HGMD by places like SNPs3D.

There's a note of hope in the next paragraph of the paper in NAR.
Once the closer cooperation with publically funded bioinformatics institutions currently envisaged has been put in place, unrestricted access to the database will become possible.
Publically funded bioinformatics institutions? In the UK?

Back to OMIM and dbSNP it is.

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Trackbacks:

0 Comments:

Post a Comment

<< Home


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008