Call for Papers: PLoS Text Mining Collection

Posted 21 May 2012 — by caseybergman
Category text mining

[For a some background on this initiative, please see this Blog post.  -CMB]

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

As part of this Text Mining Collection, we are making a call for high quality submissions that advance the field of text-mining research, including:

  • New methods for the retrieval or extraction of published scientific facts
  • Large-scale analysis of data extracted from the scientific literature
  • New interfaces for accessing the scientific literature
  • Semantic enrichment of scientific articles
  • Linking the literature to scientific databases
  • Application of text mining to database curation
  • Approaches for integrating text mining into workflows
  • Resources (ontologies, corpora) to improve text mining research

Please note that all submissions submitted before October 30th, 2012 will be considered for the launch of the collection (expected early 2013); submissions after this date will still be considered for the collection, but may not appear in the collection at launch.

Submission Guidelines
If you wish to submit your research to the PLoS Text Mining Collection, please consider the following when preparing your manuscript:

All articles must adhere to the submission guidelines of the PLoS journal to which you submit.
Standard PLoS policies and relevant publication fees apply to all submissions.
Submission to any PLoS journal as part of the Text Mining Collection does not guarantee publication.

When you are ready to submit your manuscript to the collection, please log in to the relevant PLoS manuscript submission system and mention the Collection’s name in your cover letter. This will ensure that the staff is aware of your submission to the Collection. The submission systems can be found on the individual journal websites.

Please contact Samuel Moore ( if you would like further information about how to submit your research to the PLoS Text Mining Collection.

Casey Bergman (University of Manchester)
Lawrence Hunter (University of Colorado-Denver)
Andrey Rzhetsky (University of Chicago)

Cross posted at the PLoS Blog

Compendium of Drosophila Genome Assemblies

Posted 28 Apr 2012 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

[UPDATE: The UCSD Drosophila Species Stock Center now maintains a list of information about Drosophila genome sequences: The compendium here will no longer be updated. To ensure a single, up-to-date source of information, please send suggested links to the DSSC.]

Since the publication of the Drosophila melanogaster genome, the D. pseudoobscura genome, and the remaining 10 species from the Drosophila 12 Genomes Project, a number of other Drosophila genomes have been sequenced and assembled using whole genome shotgun approaches. Periodically, I find myself needing to find versions of these assemblies from the data producer, NCBI, Flybase, or UCSC, and so to help streamline this process I’ve collected these links together into a single compendium here, along with links to the original stocks used for sequencing. Hopefully this will be of use to others as well.

Data Producer
UCSC Genome Browser
Stock Center
D. melanogaster Berkeley Drosophila Genome Project AABU00000000 dmel dm3 1522
D. simulans Washington University Genome Sequencing Center AAGH00000000 dsim droSim1 n.a.
D. simulans Andolfatto Lab n.a. n.a. n.a. 1380
D. mauritiana Schlotterer Lab n.a. n.a. n.a. E-18912
D. sechellia Broad Institute AAKO00000000 dsec droSec1 1373
D. yakuba Washington University Genome Sequencing Center AAEU00000000 dyak droYak2 1385
D. santomea Andolfatto Lab n.a. n.a. n.a. 2367
D. erecta Agencourt Biosciences AAPQ00000000 dere droEre2 1374
D. ficusphila Baylor College of Medicine Genome Center AFFG00000000 n.a. n.a. 2332
D. eugracilis Baylor College of Medicine Genome Center AFPQ00000000 n.a. n.a. 2350
D. suzukii University of Edinburgh GenePool CAKG00000000 n.a. n.a. n.a
D. biarmipes Baylor College of Medicine Genome Center AFFD00000000 n.a. n.a. 2312
D. takahashii Baylor College of Medicine Genome Center AFFI00000000 n.a. n.a. 2325
D. elegans Baylor College of Medicine Genome Center AFFF00000000 n.a. n.a. 2334
D. rhopaloa Baylor College of Medicine Genome Center AFPP00000000 n.a. n.a. 2327
D. kikkawai Baylor College of Medicine Genome Center AFFH00000000 n.a. n.a. 2330
D. bipectinata Baylor College of Medicine Genome Center AFFE00000000 n.a. n.a. 2329
D. ananassae Agencourt Biosciences AAPP00000000 dana droAna3 1368
D. pseudoobscura Baylor College of Medicine Genome Center AADE00000000 dpse dp4 1277
D. persimilis Broad Institute AAIZ00000000 dper droPer1 1369
D. miranda Bachtrog Lab AJMI00000000 n.a. n.a. n.a.
D. willistoni J. Craig Venter Institute AAQB00000000 dwil droWil1 1375
D. virilis Agencourt Biosciences AANI00000000 dvir droVir3 1371
D. americana Vieira Lab n.a. n.a. n.a. n.a.
D. mojavensis Agencourt Biosciences AAPU00000000 dmoj droMoj3 1370
D. grimshawi Agencourt Biosciences AAPT00000000 dgri droGri2 1225
D. albomicans Wang/Bachtrog Labs ACVV00000000 n.a. n.a. n.a.

Wolbachia and Mitochondrial Genomes from Drosophila melanogaster Resequencing Projects

As part of ongoing efforts to characterize complete genome sequences of microbial symbioints of Drosophila species, the Bergman Lab has been involved in mining complete genomes of the Wolbachia endosymbiont from whole-genome shotgun sequences of D. melanogaster. This work is inspired by Steve Salzberg and colleagues’ pioneering paper in 2005 showing that Wolbachia genomes can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. We have adapted this technique to utilize short-read next generation sequencing data from population genomics projects in D. melanogaster, together with the reference Wolbachia genome published by Wu et al. (2005).

Currently we have identified infection status and extracted nearly-complete genomes from the two major resequencing efforts in D. melanogaster: the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP). We employ a fairly standard BWA/SAMtools  pipeline, with a few tricks that are essential for getting good quality assemblies and consensus sequences. Our first iteration of this pipeline was used to predict infection status in the DGRP strains. These results have been published in the DGRP main paper published in early 2012, and can be found in Supplemental Table 9 of the DGRP paper or more usefully in a comma-separated file at the DGRP website.

We have now applied an update version of our pipeline to both the DGRP and DPGP datasets and have used these “essentially complete” genome sequences to study the recent evolutionary history of Wolbachia in D. melanogaster. In the process of generating these data we received several inquiries about the status of this project so, in the spirit of Open Science that makes genomics such a productive field, we have released the consensus sequences and alignments of 179 Wolbachia and 290 mitochondrial genomes from the DGRP and DPGP prior to publication of our manuscript.

Since we are not the primary data producer for these assemblies, it is not approporiate to employ a data release policy based on the NHGRI guidelines. Instead, we have chosen to release these data under a Creative Commons Attribution 3.0 Unported License. If you use these assemblies or alignments in your projects, please cite the main DGRP and DPGP papers for the raw data and cite the following reference for the mtDNA or Wolbachia assemblies:

Richardson, M.F., L.A. Weinert, J.J. Welch, R.S. Linheiro, M.M. Magwire, F.M. Jiggins & C.M. Bergman (2012) Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics 8:e1003129.

A tar.gz archive contaning a reference-based multiple sequence alignment of release 1.0 of the complete D. melanogaster mtDNA and Wolbachia assemblies can be obtained here. If you have questions about the methods used to produce these assemblies, any associated meta-data from these assemblies, or the content of our manuscript please contact Casey Bergman for details.

Creative Commons License

Customizing a Local UCSC Genome Browser Installation III: Loading Custom Data

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

In the second post of this series we examined the grp and trackDb tables within an organism database in the UCSC genome browser schema, and how they are used to build a UCSC genome browser display by specifying: (i) what sort of data a track contains, (ii) how it should be displayed, and (iii) how tracks should be organized within the display. In this last post of three, we will look at how to load and subsequently display custom data into a UCSC genome browser instance. As an example, we’ll be using a set of data from the modENCODE project formatted for loading into the UCSC browser. For this you’ll need a working browser install and the dm3 organism database (see the first post in this series for details).

An important feature of the genome browser schema that permits easy customization is that you can have multiple grp and trackDb tables within an organism database.  This allows separation of tracks for different purposes, such as trackDb_privateData, trackDb_chipSeq, trackDb_UserName etc. The same goes for grp tables. Thus, you can pull in data from the main UCSC site, leave it untouched, and simply add new data by creating new trackDb and grp tables.

This post outlines the five steps for adding custom data into a local UCSC genome browser:

1) Getting the data into the correct format
2) Building the required loader code
3) Loading the data
4) Configuring the track’s display options
5) Creating the new trackDb and grp tables and configuring the browser to use it.

1) The UCSC browser supports a number of different data formats, which are described at the UCSC format FAQ. Defining where your data comes from, and what sort of data it is (e.g. an alignment, microarray expression data etc.) will go a long way towards determining what sort of file you need to load. In many cases your data will already be in one of the supported formats so the decision will have been made for you. If your data isn’t in one of the supported formats, you’ll need to decide which one is the most appropriate and then reformat the data using a suitable tool, e.g. a perl or bash script. The example data we will be loading is taken from this paper. The supplementary data contains an Excel spreadsheet that we need to reformat into a number of BED files and then load into the UCSC database.

2) Each of the data formats that can be loaded into the browser system has its own loader program. These are found in the same source tree that we used to build the browser CGIs. You will find the source code for the loaders in the Kent source tree, the BED loader is in the directory kent/src/hg/makeDb/hgLoadBed. Assuming you have followed the instructions in the first part of this blog post then to compile the BED loader you simply cd to its directory and type make.

cd kent/src/hg/makeDb/hgLoadBed

The hgLoadBed executable will now be in ~/bin/$MACHTYPE. If you haven’t already it is a good idea to add this directory to your path, add the following line to ~/.bashrc:

export PATH=$PATH:~/bin/$MACHTYPE

The other loader we’ll require for the example custom data included in this blog is hgBbiDbLink:

cd kent/src/hg/makeDb/hgBbiDbLink

3) To load data, e.g. a BED file, the command is simply:

hgLoadBed databaseName tableName pathToBEDfile

The data-loaders supplied with the browser can be divided into two broad groups. The first of these, as represented by hgLoadBed, actually load the data into a table within the organism database. For example, one of the histone-modification datasets looks like this as a BED file:

chr2L   5982    6949    ID=Embryo_0_12h_GAF_ChIP_chip.GAF       5.54
chr2L   65947   68135   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       10.54
chr2L   72137   73344   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       6.55
chr2L   107044  109963  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       8.93
chr2L   127917  130678  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       9.17

When this BED file is loaded using hgLoadBed it produces a table that has the following structure:

DESC MdEncHistonesGAFBG3DccId2651;
| Field      | Type                 | Null | Key | Default | Extra |
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
4 rows in set (0.00 sec)

And contains (results set truncated):

select * from MdEncHistonesGAFBG3DccId2651;
| 754 | chrX     |   22237479 | 22239805 |
| 754 | chrX     |   22255450 | 22257285 |
| 754 | chrX     |   22257996 | 22259538 |
| 754 | chrX     |   22259653 | 22261174 |
| 754 | chrX     |   22264098 | 22265194 |
| 585 | chrYHet  |      84149 |    85164 |
| 585 | chrYHet  |     104864 |   105879 |
| 586 | chrYHet  |     150067 |   151212 |
5104 rows in set (0.06 sec)

The second type of loader doesn’t insert the data into a database table directly. Instead data in formats such as bigWig and bigBed are deposited in a particular location on the filesystem and a table with a single row simply points to them. For example:

desc mdEncReprocCbpAdultMaleTreat;
| Field    | Type         | Null | Key | Default | Extra |
| fileName | varchar(255) | NO   |     | NULL    |       |
1 row in set (0.00 sec)

Which contains:

select * from mdEncReprocCbpAdultMaleTreat;
| fileName                                     |
| /gbdb/dm3/modencode/ |
1 row in set (0.02 sec)

The loader hgBbiDbLink is responsible for producing this simple table that points to the location of the data on the filesystem. When the browser comes to display a track that is specified as bigWig in trackDb’s type field it knows that the table specified in trackDb’s tableName field will give point it at the bigWig file rather than containing the data itself. Supposed we have placed our bigWig file in the /gbdb directory:

cp /gbdb/dm3/modencode/.

Then we must execute hgBbiDbLink as follows:

hgBbiDbLink databaseName desiredTableName pathToDataFile


hgBbiDbLink dm3 mdEncReprocCbpAdultMaleTreat /gbdb/dm3/modencode/

4) Once the data is loaded into the database we need to tell the browser how it should be displayed, this is the role of the trackDb table. We create a new trackDb table for our data using the command hgTrackDb. This is found in the same branch of the source tree and the individual loaders themselves. The way in which tracks and grouped and displayed by the browser is specified in a file which you pass to hgTrackDb called a .ra file, which specifies the contents of the different fields in the trackDb table, as well as the parent-child relationship for composite tracks. The example data provided in this blog post contains a trackDb.ra file that looks like this:

track dnaseLi
shortLabel DNAse Li 2011
longLabel DNAse Li Eisen Biggin Genome Biol 2011
group blog
priority 100
visibility hide
color 200,20,20
altcolor 200,20,20
compositeTrack on
type bed 3

    track dnaseLiStage5
    shortLabel Stage 5
    longLabel Li Dnase Stage 5
    parent dnaseLi
    priority 1

    track dnaseLiStage9
    shortLabel Stage 9
    longLabel Li Dnase Stage 9
    parent dnaseLi
    priority 2

    track dnaseLiStage10
    shortLabel Stage 10
    longLabel Li Dnase Stage 10
    parent dnaseLi
    priority 3

    track dnaseLiStage11
    shortLabel Stage 11
    longLabel Li Dnase Stage 11
    parent dnaseLi
    priority 4

    track dnaseLiStage14
    shortLabel Stage 14
    longLabel Li Dnase Stage 14
    parent dnaseLi
    priority 5

The example data (e.g. /root/ucsc_example_data) comes from 5 different stages of embryonic development so we have 5 BED files. The trackDb.ra file groups these as a composite track so we have the top level track (dnaseLi) which is the parent of the 5 individual tracks. To load these definitions into the database we execute hgTrackDb as follows:

~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

The complete set of commands to download the example data and load it are is therefore:

tar xzvf blogData.tar.gz
cd ucsc_example_data
~/bin/i386/hgLoadBed dm3 dnaseLiStage5 stage5.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage9 stage9.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage10 stage10.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage11 stage11.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage14 stage14.bed
~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

5) Tell the browser to use our new trackDb table by editing the hg.conf configuration file in /var/www/cgi-bin and adding the new trackDB table name to the db.trackDb line. The db.TrackDb line takes a comma-separated list of trackDb tables that the browser will display. Assuming that our new trackDb table is called trackDb_NEW then we must edit hg.conf from this:


to this:


You’ll notice in the supplied trackDb.ra file that the tracks belong to a group called ‘blog’. This group doesn’t currently exist in the database. There are two ways we can go about rectifying this. The first is to add blog to the existing grp table like so:

INSERT INTO grp (name,label,priority) VALUES('blog','Example Data',11);

Or you can create an entirely new grp table. The browser handles multiple grp tables in the same fashion as multiple trackDb tables, i.e. they are specified as a comma-separated list in hg.conf. To create a new grp table, execute:

INSERT INTO grp_NEW (name,label,priority) VALUES('blog','Example Data',11);

And then edit hg.conf to add grp_NEW to the db.grp line:


If you now refresh your web browser the cgi scripts should create a new session with the new data tracks displayed as configured in the grp table.

Customizing a Local UCSC Genome Browser Installation II: The UCSC Genome Browser Database Structure

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

This is the second of three posts on how to set up a local mirror of the UCSC genome browser hosting custom data. In the previous post, I covered how to perform a basic installation of the UCSC genome In this post, I’ll provide an overview of the mySQL database structure that serves data to the genome browser using the Bergman Lab mirror ( as an example. In the next post, I focus on how to load custom data into a local mirror.

General databases

hgcentral – this is the central control database of the browser. It holds information relating to each assembly available on the server, organizing them into groups according to their clade and organism. For instance the mammalian clade may contain human and mouse assemblies, and for mouse we may have multiple assemblies (e.g. mm9, mm8 etc.).

The contents of the clade, genomeClade, dbDb and defaultDb tables control the three dropdown menus on the gateway page of the browser. The clade and genomeClade tables have a priority field that controls the order of the dropdown menus (see figure 1).

The defaultDb table specifies which assembly is the default for a particular organism. This is usually the latest assembly e.g. mm9 for mouse. You can alter this if, for whatever reason, you want to use a different assembly as your default:

UPDATE defaultDb SET name = 'mm8' WHERE genome = 'mouse';

Databases for each organism/assembly are listed in the dbDb table within hgcentral (see more below). For example the latest mouse assembly (mm9) has an entry that contains information relating to this build, an example query (not retrieving all the information contained within dbDb) looks like this:

> select name,description,organism,sourceName,taxId from dbDb where name = 'mm9';</pre>
 | name | description | organism | sourceName | taxId |
 | mm9 | July 2007 | Mouse | NCBI Build 37 | 10090 |

hgFixed – this database contains a bunch of human related information, predominately microarray stuff. If you are running a human mirror then you may wish to populate this database. If not, then it can be left empty.

Genome databases

We will now look at the structure of a genome database for a specific organism/assembly using mm9 as an example.

The first table to consider is chromInfo, this simple table specifies the name of each chromosome, its size and the location of its sequence file. Mouse has 19 chromosomes (named chr1 – chr19) plus the two sex chromosomes (chrX and chrY), mitochondrial DNA (chrM) and unmapped clone contigs (chrUn). If we list the tables within our minimal mm9 install we will see that most of the tables can clearly be identified as belonging to a particular chromosome by virtue of their names, e.g. all the tables with names starting chr10_* contain data relating to chromosome 10.

Of the remainder, the two tables of most interest to understanding how the browser works, and leading us towards the ability to load custom data, are the grp and trackDb tables. These two tables control the actual genome browser display, as we discussed above for the hgcentral tables that control the gateway web interface. Not surprisingly, the trackDb table holds information about all the data tracks you can see (see figure 2), and grp tells the browser how all these tracks should be organized (e.g. we may want a separate group for different types of experimental data, another for predictions made using some computational technique etc.).

Figure 2 shows a screenshot of the mm9 browser we developed in the Bergman Lab as part of the CISSTEM project.

Here, we can see that the tracks are grouped into particular types (Mapping & Sequence Tracks, Genes and Gene Prediction tracks etc.) – these are specified in mm9’s grp table and, like the dropdowns on the gateway page, ordered using a priority field

> select * from grp order by priority asc
| name        | label                            | priority |
| user        | Custom Tracks                    |        1 |
| map         | Mapping and Sequencing Tracks    |        2 |
| genes       | Genes and Gene Prediction Tracks |        3 |
| rna         | mRNA and EST Tracks              |        4 |
| regulation  | Expression and Regulation        |        5 |
| compGeno    | Comparative Genomics             |        6 |
| varRep      | Variation and Repeats            |        7 |
| x           | Experimental Tracks              |       10 |
| phenoAllele | Phenotype and Allele             |      4.5 |

The tracks that constitute these groups are specified in the trackDb table, we can see the tracks for a particular group like so:

> select longLabel from trackDb where grp = 'map' order by priority asc;
| longLabel                                                                   |
| Chromosome Bands Based On ISCN Lengths (for Ideogram)                       |
| Chromosome Bands Based On ISCN Lengths                                      |
| Mapability - CRG GEM Alignability of 36mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 40mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 50mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 75mers with no more than 2 mismatches  |
| STS Markers on Genetic and Radiation Hybrid Maps                            |
| Mapability - CRG GEM Alignability of 100mers with no more than 2 mismatches |
| Physical Map Contigs                                                        |
| Assembly from Fragments                                                     |
| Gap Locations                                                               |
| BAC End Pairs                                                               |
| Quantitative Trait Loci From Jackson Laboratory / Mouse Genome Informatics  |
| GC Percent in 5-Base Windows                                                |
| Mapability or Uniqueness of Reference Genome                                |

These two tables provide information to the browser to know which groups to display, which tracks belong to a particular group, and in which order they should be displayed. The remaining fields in the trackDb table fill in the gaps: what sort of data the track contains and how the browser is to display it. I won’t say anymore about trackDb here – we will naturally cover the details of how trackDb functions when we talk about loading custom data in the next post.

You will probably have noticed that there is not an exact equivalence between the query above and the tracks in the browser (e.g. the query returns a number of Mapability tracks whereas there is only a single Mapability dropdown in the browser). This is because some tracks are composite tracks — if you click Mapability in the browser you will see that the multiple tracks are contained within it. Composite tracks are specified in trackDb, we will investigate this in the next part of this blog post.

[This tutorial continues in Part III: Loading Custom Data]

Customizing a Local UCSC Genome Browser Installation I: Introduction and CentOS Install

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

The UCSC Genome Bioinformatics site is widely regarded as one of the most powerful bioinformatics portals for experimental and computational biologists on the web. While the UCSC Genome Browser provides excellent functionality for loading and visualizing custom data tracks, there are cases when your lab may have genome-wide data it wishes to share internally or with the wider scientific community, or when you might want to use the Genome Browser for an assembly that is not in the main or microbial Genome Browser sites. If so then you might need to install a local copy of the genome browser.

A blog post by noborujs on E-notações explains how to install the UCSC genome browser on an Ubuntu system, and the UCSC Genome genome browser team provides a general walk-through on their wiki and an internal presentation. Here I will provide a similar walkthrough for installing it on a CentOS system. The focus of these posts is go beyond the basic installation to explain the structure of the databases that underlie the browser; how these database table are used to create the web front-end, and how to customize a local installation with your own data. Hopefully having a clearer understanding of the database and browser architecture should make the process of loading your own data far easier.

This blog entry has grown to quite a size, so I’ve decided to split it into 3 more manageable parts and you can find a link to the next part at the end of this post.

The 3 parts are as follows:

1) Introduction and CentOS install (this post)
2) The databases & web front-end
3) Loading custom data

The walk-through presented here installs CentOS 5.7 using VirtualBox so you can follow this on your desktop if you have sufficient disk space. Like the Ubuntu walk-through linked to above, this will speed you through the install process with little or no reference to what the databases actually contain, or how they relate to the browser, this information will be included in part 2.

CentOS Installation

• Grab CentOS 5.7 i386 CD iso’s from one of the mirrors. Discs 1-5 are required for this install.
• Start VirtualBox and select ‘New’. Settings chosen:
• Name = CentOS 5.7
• OS = linux
• Version = Red Hat
• Virtual memory of 1GB
• Create a new hard disk with fixed virtual storage of 64GB
• Start the new virtual machine and select the first CentOS iso file as the installation source
• Choose ‘Server GUI’ as the installation type when prompted
• Set the hostname, time etc. and make sure you enable http when asked about services
• When the install prompts you for a different CD, select “Devices -> CD/DVD devices -> Choose a virtual CD/DVD disk file” and choose the relevant ISO file.
• Login as root, run Package Updater, reboot.

UCSC Browser Install

• Install dependencies:

yum install libpng-devel
yum install mysql-devel

• Set environment variables, the following were added to root’s .bashrc:

export MACHTYPE=i386
export MYSQLLIBS="/usr/lib/mysql/libmysqlclient.a -lz"
export MYSQLINC=/usr/include/mysql
export WEBROOT=/var/www

Each time you open a new terminal these environment variables will be set automatically.

• Download the Kent source tree from, unpack it and then build it:

mkdir -p ~/bin/$MACHTYPE # only used when you build other utilities from the source tree
cd kent/src/lib
cd ../jkOwnLib
cd ../hg/lib
cd ..

You will now have all the cgi’s necessary to run the browser in /var/www/cgi-bin and some JavaScript and CSS in /var/www/html. We need to tell SELinux to allow access to these:

restorecon -R -v '/var/www'

• Grab the static web content. First edit kent/src/product/scripts/browserEnvironment.txt to reflect your environment (MACHTYPE, DOCUMENTROOT etc.) then

cd kent/src/product/scripts
./ browserEnvironment.txt

• Create directories to hold temporary files for the browser:

mkdir $WEBROOT/trash
chmod 777 $WEBROOT/trash
ln -s $WEBROOT/trash $WEBROOT/html/trash

• Set up MySQL for the browser. We need an admin user with full access to the databases we’ll be creating later, and a read-only user for the cgi’s.


  ON hgcentral.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

  ON hgcentral.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

The above commands will need repeating for each of the databases that we subsequently create.

We also need a third user that has read/write permissions to the hgcentral  database only:

  ON hgcentral.*
  TO ucsc_readwrite@localhost
  IDENTIFIED BY 'readwrite_password';

Note: You should replace the passwords listed here with something sensible!

The installation README (path/to/installation/readme) suggests granting FILE to the admin user, but FILE can only be granted globally (i.e. GRANT FILE ON *.*) , so we must do this as follows:

GRANT FILE ON *.* TO ucsc_admin@localhost;

We now have the code for the browser and a database engine ready to receive some data. As an example, getting a human genome assembly installed requires the following:

• Create the primary gateway database for the browser:

mysql -u ucsc_admin -p  -e "CREATE DATABASE hgcentral"
mysql -u ucsc_admin -p hgcentral < hgcentral.sql

• Create the main configuration file for the browser hg.conf and save it in /var/www/cgi-bin:

cp kent/src/product/ex.hg.conf /var/www/cgi-bin/hg.conf

Then edit /var/www/cgi-bin/hg.conf to reflect the specifics of your installation.

• Admin users should also maintain a copy of this file saved as ~/.hg.conf, since the data loader applications look in your home directory for hg.conf. It is a good idea for ~/.hg.conf to be made private (i.e. only the owner can access it) otherwise your database admin password will be accessible by other users:

cp /var/www/cgi-bin/hg.conf ~/.hg.conf
chmod 600 ~/.hg.conf

When we issue commands to load custom data (see part 3) it is this copy of hg.conf that will supply the necessary database passwords.

The browser is now installed and functional, but will generate errors because the databases specified in the hgcentral database are not there yet. The gateway page needs a minimum human database in order to function even if the browser is being built for the display of other genomes.

To install a human genome database, the minimal set of mySQL tables required within the hg19 database is:
• grp
• trackDb
• hgFindSpec
• chromInfo
• gold – for performance this table is split by chromosome so we need chr*_gold*
• gap – split by chromosome as with gold so we need chr*_gap*

To install minimal hg19:

  ON hg19.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

  ON hg19.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "CREATE DATABASE hg19"
cd /var/lib/mysql/hg19
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/grp* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/trackDb* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/hgFindSpec* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chromInfo* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gold* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gap* .

The DNA sequence can be downloaded thus:

cd /gbdb
mkdir -p hg19/nib
cd hg19/nib
rsync -avP rsync://*.nib .

Again we will need to tell SELinux to allow the webserver to access these files:

semanage fcontext -a -t httpd_sys_content_t "/gbdb"

Create hgFixed database:

  ON hgFixed.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

  ON hgFixed.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "create database hgFixed"

The browser now functions properly and we can browse the hg19 assembly.

The 3rd part of this blog post looks at loading custom data, for this we will use some Drosophila melanogaster data taken from the modENCODE project. Therefore we will need repeat the steps we used to mirror the hg18 assembly to produce a local copy of the dm3 database. Alternatively, if you have sufficient disk space on your virtual machine you can grab all the D.melanogaster data like so:

mysql -u ucsc_admin -p -e "create database dm3"
cd /var/lib/mysql/dm3
rsync -avP rsync://* .
cd gbdb
mkdir dm3
cd dm3
rsync -avP rsync://* .

At present these commands will download approximately 30Gb.

[This tutorial continues in Part 2: The UCSC genome browser database structure]

Tutorial: Using the UCSC Genome Browser and Galaxy to study regulatory sequence evolution in Drosophila.

One of the most enjoyable parts of teaching genomics and bioinformatics introducing people to the UCSC Genome Browser and Galaxy systems.  Both systems are powerful, intuitive, reliable and user-friendly services, and lend themselves easily to student practicals, as the good folks at Open Helix have amply demonstrated. Over the last few years, I’ve developed a fairly reliable advanced undergraduate/early graduate teaching practical that uses the UCSC Genome Browser and Galaxy to  study regulatory sequence evolution in Drosophila, based on data I curated a few years back into the Drosophila DNAse I footprint database.  As part of opening up the resources from my group, I thought I would post this tutorial with the hope that someone else can use it for teaching purposes or their own edification. The UCSC half can be done pretty reliably in a 50 minute session, and the Galaxy part is much more variable – some people take 20 min or less and others up to an hour.  Feedback and comments are most welcome.



  • Become familiar with the UCSC Genome Browser.
  • Become familiar with the UCSC Table Browser.
  • Become familiar with Galaxy Analysis Tools and Histories.
  • Study the conservation of transcription factor binding sites in Drosophila.


This lab is an exercise in performing a comparative genomic analysis of transcription factor binding sites (TFBSs) in Drosophila. You will use the UCSC Genome Browser, Table Browser and Galaxy to identify TFBSs that are conserved across multiple Drosophila species. Highly conserved TFBSs are likely to play important roles in Drosophila development. TFBSs that are not conserved may represent regulatory sequences that have contributed to developmental evolution across Drosophila species, or those that simply have been lost as part of the process of TFBSs turnover in a conserved regulatory element. The skill you learn in this lab will be generally applicable to many questions for a wide range of species with genome sequences in the UCSC Genome Database.

Finding the even-skipped region at the UCSC Genome Browser

1) Go to the UCSC Genome Bioinformatics site at

2) Select the “Genome Browser” option from the light blue panel on the left hand side of the page.

3) From the Gateway page, select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus. Take a minute to read the text at the bottom of the page “About the D. melanogaster Apr. 2004 (dm2) assembly”. [Note: it is essential that you make sure you have selected the correct genome assembly at this step.]

4) Enter the gene “eve” (an abbreviated name for the gene even-skipped) into the “position or search term” box and click the “submit” button.

5) From the search results page, select the top link under the “FlyBase Protein-Coding Genes” header taking you to “eve at chr2R:5491054-5492592”. This will take you to a graphical view of the even-skipped gene with boxes indication exons and lines indication introns. (Note: the thick part of the boxes denote the parts of exons are translated into protein, the thinner parts are UTRs)

Customising the UCSC Genome Browser to display transcription factor binding sites in the even-skipped region

1) The Genome Browser page displays information about a selected genomic region and contains several components. They are, from top to bottom: a list of links to complementary tools and resources on the dark blue background; buttons for navigating left and right and zooming in and out on the chromosome; search boxes for jumping to new regions of the genome; a pictoral representation of the region shown on the chromosome; the main display window, comprising several tracks showing different types of information about the genomic region on display; a set of buttons to modify the main display; and several rows of pull-down menus, each controlling the display status of an individual track in the main display window.

2) Click the “hide all” button in the second panel of display controls to remove all tracks from the browser. Then select the “full” option from the drop-down FlyBase Genes menu under “Genes and Gene Prediction Tracks” and click refresh.

3) In the main display window, the top feature in light blue is the 2-exon gene eve. Click on one of the blue exons of the eve gene. This will send you to a detailed page about the eve gene. Scroll down to see the data linked to eve, including information imported from external sources like FlyBase, links to other resources like “Orthologous Genes in Other Species” also at the UCSC genome browser, and links to other resources outside the UCSC Database such as “Protein Domain and Structure Information” at the InterPro database. [Note: you can minimise subsections of this page by clicking the “-” button next to each major heading.]

3) Click the white “Genome Browser” link on the top blue banner to return to the main browser page. Scroll down the page and read the names of the various tracks available for this genome sequence. Sets of tracks are organised into logical groups (e.g. “Genes and Gene Prediction Tracks”). To find out more information about the data in any track, you can click on the blue title above each track’s pulldown menu, which leads to a detail page explaining the track and additional track-specific display controls. [Note: the same detail page can be displayed by clicking on the blue/grey rectangle at the very left of the track display in the main display window.]

4) Click on the blue title for the “FlyReg” track under the “Expression and regulation” heading. Take a minute to read about the FlyReg track and where the data comes from. Set the display mode pull-down menu to “pack” and then click “refresh”. [Note: visit for more information.]

5) Click the “zoom out 10x” button in the top right corner of the page to expand your window to display ten times the current sequence length. Each annotated regulatory element you are seeing in the FlyReg track is an experimentally-validated transcription factor binding site (TFBS) that regulates eve.

6) Click directly on one of the brown TFBSs features in the FlyReg track in the 5′ intergenic region upstream of the eve gene. As with all data in different tracks, clicking on a feature will send you to a detail page about the feature, in this case the TFBS. In the detail page, click on the “Position” link and this will return you to the main Genome Browser window window showing just the coordinates of the TFBS you just selected.

Investigating the conservation of individual TFBSs using the UCSC Genome Browser

1) Select the “full” option from the Conservation drop-down menu under “Comparative genomics” and click refresh.

2) The browser should now be displaying exactly one TFBS and conservation of this TFBS using the “12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores” track. Is this TFBS conserved across the Drosophila species? If so, which ones? Is this TFBS conserved in Anopheles gambiae, Tribolium castaneum or Apis mellifera?

3) Click the “zoom out 10x” button twice and select a different TFBS from the FlyReg Track, click on the “Position” link in the detail page and evaluate if this TFBS is conserved and in which species. Now repeat for every TFBS in the genome (only joking!).

4) To see the general correspondence between TFBS and highly conserved sequences, set the pull-down menu to “pack” for the “Most Conserved” Track. Zoom out 10x so you are displaying a few hundred bp. How well do the most highly conserved sequences correspond the TFBSs? Are TFBSs typically longer or shorter than a highly conserved region? Zoom out 10x so you are displaying a ~1 Kbp. Are there more TFBSs than conserved sequences or vice versa? Why might this be the case?

Investigating the conservation of all known TFBSs using the UCSC Table Browser

1) Click the “Tools” menu on the dark blue background at the top of the Genome Browser window and select “Table Browser”. This will send you to an alternative interface to access data in the UCSC Genome Database called the “Table Browser.” The pull-down menus you see here correspond to the same tracks you saw in the Genome Browser.

2) Select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus.

3) Select “Expression and Regulation” and “FlyReg” from the “group” and “track” pull-down menus, respectively.

4) Click the radio button (the circle with a dot indicating which option is selected) next to “genome” as the “region” you will analyse. This will select the whole genome for inclusion in your analysis.

5) Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with >1,300 hyperlinks that send you to all the annotated TFBS in Drosophila, each of which corresponds to one row in the FlyReg data track. [Note: this is a general method to export data from any track the whole genome or a specific region.]

6) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. We are now going to use the Table Browser to ask “how many TFBS in Drosophila are found in highly conserved sequences?” We will do this by using the Table Browser to overlap all of the TFBS with all of the Most Conserved segments of the genome.

7) Click the “create” button next to the “intersection” option. This will send you to a page where you can set conditions for the overlap analysis of the FlyReg TFBS with the Most Conserved regions.

8 ) Select “Comparative Genomics” and “Most Conserved” from the “group” and “track” pull-down menus. Click the “All FlyReg records that have at least X% overlap with Most Conserved”. Set the X% value to “100” and click “submit” to return to the main Table Browser page.

9) Notice that the Table Browser now shows the “intersection with phastConsElements15way” option is selected. Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with hyperlinks that send you to all the annotated TFBS in Drosophila that are 100% contained in the Most Conserved regions of the genome. Click on a few links to convince yourself that this analysis has generated the correct results. At this point you have a great result – all fully conserved TFBS in Drosophila – but it is difficult to quantitatively summarize these data in the Table Browser, so let’s move on to Galaxy where analysis like this is a lot easier.

Quantifying TFBS conservation using Galaxy

1) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. Select “BED – browser extensible data” from the “output format” pull-down menu and check the “Galaxy” check box. Now click the “get output” button at the bottom of the page. This will send you to a detail page. Leave all the default options set here and click “Send query to Galaxy”.

2) This will launch a new Galaxy session as an anonymous user. [For more information on Galaxy, visit] The results of this Table Browser query will be executed as a Galaxy “analysis” which will appear in the right hand “History” pane of the Galaxy browser window as “1: UCSC Main on D. melanogaster: flyreg2 (genome)”. While the Table Browser query is running, this analysis will be grey in the history pane, but will turn green when it is completed.

3) When the analysis has completed, you will have created a new dataset numbered “1” in your Galaxy history. Click on the link entitled “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the history to reveal a dropdown which contains a snapshot of this dataset. [This dataset should contain 533 regions. If it doesn’t stop here and get help before moving on.] Now click on the eye icon to reveal the contents of this data set in the middle pane of the Galaxy window. Use the scroll bar on the right hand side of the middle pane to browse the contents of this data set.

4) Click on the pencil icon to “Edit attributes” of this data set. In the middle pane replace “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the “Name” text box with something shorter and more descriptive like “conserved TFBS”. Click the “Save” button at the bottom of the middle pane.

5) Now let’s get the complete set of FlyReg TFBS by querying the UCSC Table Browser from inside Galaxy. Click “Get Data” under “Tools” in the left hand pane of the Galaxy page. This will explode a list of options to get data into Galaxy. Click “UCSC Main” which will bring up the Table Browser inside the middle pane of the Galaxy page. Click the “clear” button next to “intersection with phastConsElements15way:”. Make sure the “Galaxy” check box is selected and click “get output” and then click “Send query to Galaxy” on the next page. This will create a new dataset “2: UCSC Main on D. melanogaster: flyreg2 (genome)” which you can rename “all TFBS” as above using the pencil icon. [Note: this dataset should have 1,362 regions in it; if your does not, please stop and ask for help.]

6) You can now perform many different analyses on these datasets using the many “Tools” available in the left hand pane of the Galaxy window. Let’s start by summarizing how many TFBS are present in the “1: conserved TFBS” and “2: all TFBS” datasets. To do this, click the “Statistics” link on the left hand side, which will open up a set of other analysis tools including the “Count” tool. Click on the “Count” tool, and in the middle pane select “1: conserved TFBS” in the “from dataset” menu and click on “c4” to activate the counting to occur on column 4 containing the name of the TFBS in the “1: conserved TFBS” dataset. Repeat this analysis for the “2: all TFBS” dataset. This should lead to two more datasets of 70 and 88 lines, respectively, which you should again rename to something more meaningful than the default values, such as “3: conserved TFBS counts” and 4: all TFBS counts”.

7) Now let’s use the “Join, Subtract and Group->Join two Datasets” tool to join the results of the two previous analyses into one merged dataset. Click “Join, Subtract and Group->Join two Datasets”, select “4: all TFBS counts” in the “Join” drop-down menu, “c2” for the “using column” drop-down menu, “3: conserved TFBS counts” in the “with” drop-down menu and “c2” for the “and column” drop-down menu. Set the “Keep lines of first input that do not join with second input:”, “Keep lines of first input that are incomplete:”, and “Fill empty columns:” drop-down menus to “yes”. Setting the ” Fill empty columns” menu to yes will pull up a few more menus, which should be set as follows: “Only fill unjoined rows:” set to “Yes”; “Fill Columns by:” set to “single Fill Value”, and “Fill value” to “0”. Now click “Execute”. What you have just achieved is one of the trickier basic operations on bioinformatics, and is the underlying process in most relational database queries.  Pat yourself on the back!

8 ) Now let’s try to do some science and ask the question: “what is the proportion of conserved TFBS for each transcription factor?” This will give us some insight into whether TFBS turnover is the same for all TFs or might be higher or lower for some TFs. To do this, use the “Text manipulation->compute” Tool and set the “Add expression” box to “1-((c1-c3)/c1)” and “as a new column to:” to “5: Join two Datasets on data 3 and data 4” and click “Execute”. This will add a new column to the joined dataset with the proportion of conserved TFBS for each transcription factor.

For Further Exploration…

1) Format the output of your last analysis in a more meaningful manner using the “Filter and Sort->Sort” tool.

2) Plot the distribution of the proportion of conserved TFBS per TF using the “Graph/Display Data->Histogram” tool.

3) Go back to the original TFBS datasets and derive new datasets to investigate if different chromosomes have different rates of TFBS evolution?

4) Develop your own question about TFBS evolution, create a custom analysis pipeline in Galaxy and wow us with your findings.

Mapping between SRA IDs and accessing metadata with the Bioconductor SRAdb package

Posted 12 Oct 2011 — by caseybergman
Category drosophila, high throughput sequencing, population genomics, R

NCBI’s Sequence Read Archive (SRA) is one of the primary repositories for next-generation sequencing data.  Submissions to SRA are often complex, with multiple samples from a given submission, or multiple sequencing runs associated with a given sample. As a result, a given SRA submission “envelope” can be associated with database records at several levels, each with corresponding identifiers and nomenclature. Valid levels in the SRA data model include: the study (prefixed with SRP), the experiment (prefixed with SRX), the sample (prefixed with SRS), and the run (prefixed with SRR). Mapping identifiers between the different parts of an SRA submission is important when you need to access meta-data from different parts of a submission or you need to aggregate data across different levels of a submission (e.g. sequencing runs from the same biological sample).

In the past, we’ve been guilty of using PERL to screen-scrape meta-data from the SRA html pages to aggregate accessions from different sequencing runs conducted by the Drosophila Genetic Reference Panel (DGRP) project. However, this is a non-optimal strategy because of changes to the SRA html format that make this approach brittle. Recently, I stumbled across a much better way to map IDs and access meta-data in SRA, using the SRAdb package from Bioconductor. The developers of SRAdb provide a nicely repackaged version of all SRA meta-data in SQLlite that can easily be accessed using a few lines of R.

The following example shows how to map IDs at all levels for SRA accessions from the two major Drosophila resequencing projects the DGRP (SRA study SRP000694) and the Drosophila Population Genomics Project (SRA studies SRP000224 & SRP005599).

# Script to generate SRA metadata for Drosophila #
# melaogaster population genomics projects       #
# Casey M. Bergman (University of Manchester)    #

#install the SRAdb bioconductor package
#download and connect to the SRA SQLlite database
sqlfile <- getSRAdbFile()
sra_connection <- dbConnect(SQLite(), "SRAmetadb.sqlite")
# make linking table between SRP, SRA, SRS, SRX & SRR for DPGP & DGRP projects
conversion <- sraConvert(c("SRP000694","SRP000224", "SRP005599"), sra_con = sra_connection)
# open outfile & print header
outFile <- file("DrosophilaPopulationGenomicData.tsv", "w")
cat("Project\tAccession\tSample\tExperiment\tRun\tSampleAlias\tSampleDescription\n", file=outFile)
# loop through conversion table and look-up meta-data for each sample
for (i in 1:nrow(conversion)) {
sample_metadata_sql <- paste("select sample_alias, description from sample where sample_accession like '", conversion&#91;i,3&#93;, "'", sep="")
sample_metadata_results <- dbGetQuery(sra_connection, sample_metadata_sql)
cat(conversion&#91;i,1&#93;, conversion&#91;i,2&#93;, conversion&#91;i,3&#93;, conversion&#91;i,4&#93;, conversion&#91;i,5&#93;, sample_metadata_results&#91;1,1&#93;, sample_metadata_results&#91;1,2&#93;, "\n", file=outFile, sep = "\t")
#close filehandle

#clean up
system("rm SRAmetadb.sqlite")
&#91;/sourcecode&#93;<div class="wp-git-embed" style="margin-bottom:10px; border:1px solid #CCC; text-align:right; width:99%; margin-top:-13px; font-size:11px; font-style:italic;"><span style="display:inline-block; padding:4px;">getDrosPopGenData.R</span><a style="display:inline-block; padding:4px 6px;" href="" target="_blank">view raw</a><a style="display:inline-block; padding:4px 6px; float:left;" href="" target="_blank">view file on <strong>GitHub</strong></a></div>

This script can be downloaded and run as follows:

$ wget
$ R --no-save &gt; getDrosPopGenSraMetaData.R

For those interested in the actual output of this script please find the DrosophilaPopulationGenomicData.tsv file here.

Credits: Thanks to Pedro Olivares-Chauvet for R syntax help and Jack Zhu for a prompt bug fix in the SRAdb files.

Installing libsequence and analysis tools on OSX

After being relegated to the margins of biology for almost a century, the field of population genetics is now moving closer the the center of mainstream biology in the era of next-generation sequencing (NGS). In addition to providing genome-wide DNA variation data to study classical questions in evolution that are motivated by population genetic theory, NGS now permits population-genetic based techniques like GWAS (genome-wide association studies) and eQTL (expression quantitative trait locus) analysis to allow biologists to identify and map functional regions across the genome.

Handling big data requires industrial strength bioinformatics tool-kits, which are increasingly available for many aspects of genome bioinformatics (e.g. the Kent source tree, SAMtools, BEDtools, etc.). However, for molecular population genetics at the command line, researchers have a more limited palette that can be used for genome-scale data, including: VariScan, the PERL PopGen modules, and the libsequence C++ library.

Motivated by a recent question on BioStar and a request from a colleague to generate a table of polymorphic sites across a bacterial genome, I’ve recently had a play with the last of these — Kevin Thornton’s libsequence and accompanying analysis toolkit — which includes a utility called compute which is a “mini-DNAsp for the Unix command-line.” As ever, getting this installed on OSX was more of a challenge than desired, with a couple of dependencies (including Boost and GSL), but in the end was do-able as follows:

$ wget
$ tar -xvzf boost_1_47_0.tar.gz
$ cd boost_1_47_0
$ ./
$ sudo ./b2 install
$ cd ..

$ wget
$ tar -xvzf libsequence-1.7.3.tar.gz
$ cd libsequence-1.7.3
$ ./configure
$ make
$ sudo make install
$ cd ..

$ wget
$ tar -xvzf gsl-1.15.tar.gz
$ cd gsl-1.15
$ ./configure --disable-shared --disable-dependency-tracking
$ make
$ sudo make install
$ cd ..

$ wget
$ tar -xvzf analysis-0.8.0.tar.gz
$ cd analysis-0.8.0
$ ./configure
$ make
$ sudo make install
$ cd ..

With this recipe, compute and friends should now be happily installed in /usr/local/bin/.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8.

Enhanced by Zemanta

Running Taverna workflows in Galaxy on OSX

Posted 28 Jul 2011 — by caseybergman
Category galaxy, OSX hacks, taverna

Recently I’ve been bitten by the Galaxy bug, primarily because I needed a mechanism this year to supervise final year undergraduate projects of students without a strong background in bioinformatics. This was a great success, since students seem to pick up the interface really easily and I was able to track and comment on their progress explicitly via shared histories and workflows.

Because of this experience, I’ve become much more interested in using workflow systems  to run and manage my bioinformatics pipelines in my research projects rather than relying on READMEs and UNIX shell scripts. Recent news that Kostas Karasavvas from NBIC has developed eGalaxy, a mechanisms to run Taverna 2 workflows using Galaxy is in my view a game-changer for the more widespread use of workflows by practicing bioinformaticians like myself, since it will permit mash-ups between the two main workflow systems and deployment of the large pre-established library of Taverna workflows in myExperiment to be used in a local Galaxy installation.

The easiest way of getting a Taverna workflow running in Galaxy is to search myExperiment for Taverna 2 workflows, and click the “Download Workflow as a Galaxy tool” button in the “Download” section of the page. This will send you to a “Galaxy tool download” page with instruction on how to get the Taverna workflow installed as a tool in Galaxy. The instructions are a bit spare at the moment and require familiarity with installing Galaxy locally and adding tools to a local Galaxy installation. They also only have have installation notes for Debian-based systems, but with the help of Rob Haines from the Taverna team, I’ve been able to get a stable protocol working for OSX as well.

To give a bit more context, Taverna workflows are run in Galaxy is as Ruby scripts that are added to your Galaxy tools directory like any other custom tool.  Executing the Ruby script tool launches a connection to a remote Taverna 2 server, where the workflow is run. Results are then returned back to the Ruby script and thence to Galaxy. Like all Galaxy tools, installing a Taverna tools requires the tool itself (a script or other executable program) and a description of the tools’ inputs/outputs in XML format to be placed in the “tools” directory, plus a notification to Galaxy that the tool exists in the tool_conf.xml file in Galaxy main directory.

The Ruby script generated by myExperiment requires a few ruby packages (aka “gems”) that are installed by the RubyGems. Both Ruby and RubyGems are installed by default on OSX (in /usr/bin) so your kit is nearly complete. The following steps should allow you to run a test Taverna workflow to make sure your configuration is working properly on a OSX 10.6 machine. To help consolidate install notes for the entire process in one place, I’m copying the key steps for a local Galaxy installation here as well.

1) Install Mercurial version control system for OSX from here, and add make sure /usr/local/bin/ is in your path.

2) Checkout the Galaxy codebase using Mercurial in your home directory ($HOME):

$ hg clone

3) Create a Taverna tools directory in your Galaxy distribution:

$ mkdir $HOME/galaxy-dist/tools/tavernaTools

4) Install the RubyGems needed for the Taverna tool to run. The critical gems to install are t2-server (which is needed to connect to the taverna server that runs the workflow) and rubyzip (which is needed for compression of Galaxy results). Installation of t2-server will automatically install the libxml-ruby and hirb gems it is dependent on. libxml-ruby calls on the the libxml2 C XML parser, which is also installed by default on OSX in /usr/include/libxml2/

$ sudo gem install t2-server
$ sudo gem install rubyzip

5) Select a Taverna 2 workflow from myExperiment and download Ruby script and XML file. For testing, use a workflow that does not require any input files, e.g.

6) Paste “” into the “Taverna server URL:” textbox.

7) Click the “Download Galaxy tool” button, e.g. to your Downloads folder.

8) Unzip the Taverna 2 Galaxy tool and move the Ruby script and XML file into your Taverna tools directory, e.g.

$ unzip $HOME/Downloads/
$ mv $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.xml $HOME/galaxy-dist/tools/tavernaTools
$ mv $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.rb $HOME/galaxy-dist/tools/tavernaTools

9) Edit your tool_conf.xml file to include a new section for Taverna tools, e.g.

 <section name="Taverna Tools" id="tavernaTools">
    <tool file="tavernaTools/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.xml" />

10) Start the Galaxy server by running the script:

$ sh $HOME/galaxy-dist/

11) Open in your web browser and you should see a “Taverna Tools” tool heading above the “Get Data” Tools heading, which when clicked on reveals the newly installed “Fetch PDB flatfile from RCSB server” tool. You can now run this Taverna tool like any other analysis tool in Galaxy. For this particular tool, this involves inputting a PDB id and clicking “execute”. Successful completion of the job should return a PDB file in the main Galaxy window, e.g.

In early stages testing this protocol, I selected a different Taverna workflow (916) that did not run out of the box and gave less-than-helpful error messages in Galaxy. Trouble shooting with Rob Haines pinpointed this problem to the Taverna server not being able to execute aspects of this workflow.  When tested on an alternate Taverna server, the workflow did run and completed with expected results. So if you experience a failed attempt at running a Taverna workflow on Galaxy it may not have anything to do with your kit or the workflow in question.

From this initial experience, looking forward (from mid-2011) I’d like to see the eGalaxy system include a mechanism to generate tests for each Taverna tool automatically, either at the time of tool download from myExperiment or as a part of the Galaxy testing system. I’d also like to see a industrial-scale Taverna sever hosted somewhere (preferably by the Galaxy or Taverna teams) so all Taverna tools can be used reliably out of the box on at least one tested server. In any event, I’m now convinced that the eGalaxy project is what it says on the tin, and can only improve with more folks trying it out and contributing feedback.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8. Credits to Rob Haines for help trouble shooting andgiving me a detailed walk through on the mechanics of the Taverna-Galaxy integration as well as to Kostas Karasavvas of NBIC for having the inspiration to intiate the eGalaxy project.