About us

Related Links

Press Coverage

text2genome: Annotating genes and genomes with DNA sequences extracted from open-access biomedical articles

text2genome is using a unique way to map scientific articles to genomic locations: From a full-text scientific article and it's supplementary data files, all words that resemble DNA sequences are extracted and then mapped to public genome sequences. They can then be displayed on genome browser websites and used in data-mining applications.

Example image to illustrate the idea of text2genome

The publication describing the text2genome system on open-access publications is: Haeussler, Gerner and Bergman (2011) Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics 27:980-6.

Source code for the text2genome application can be found at the project's SourceForge repository.

This website demonstrates how the results from the 2011 article can be used. You can search, browse and download data obtained from running text2genome on more than 150,000 open-access articles from PubMed Central.

Data can be overlayed onto the Ensembl and UCSC genome browsers. For some examples, please see the Search page and the links on the Browse page.

Update: The text2genome project is now being extended to include a larger part of the scientific literature by Maximilian Haeussler and David Haussler at the Center for Biomolecular Science and Engineering at the University of California-Santa Cruz and Casey Bergman at the University of Manchester, UK.

The results of this collaboration are native tracks on the UCSC Genome Browser with mapped sequences from PubMed Central and Science Direct full text articles. Click here for an example.

For further updates on the project, see the UCSC Genocoding project for current developments and progress with non-open-access publishers and extensions to the original text2genome system.