Archive for the ‘genome bioinformatics’ Category

Customizing a Local UCSC Genome Browser Installation II: The UCSC Genome Browser Database Structure

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

This is the second of three posts on how to set up a local mirror of the UCSC genome browser hosting custom data. In the previous post, I covered how to perform a basic installation of the UCSC genome In this post, I’ll provide an overview of the mySQL database structure that serves data to the genome browser using the Bergman Lab mirror (http://genome.smith.man.ac.uk/) as an example. In the next post, I focus on how to load custom data into a local mirror.

General databases

hgcentral – this is the central control database of the browser. It holds information relating to each assembly available on the server, organizing them into groups according to their clade and organism. For instance the mammalian clade may contain human and mouse assemblies, and for mouse we may have multiple assemblies (e.g. mm9, mm8 etc.).

The contents of the clade, genomeClade, dbDb and defaultDb tables control the three dropdown menus on the gateway page of the browser. The clade and genomeClade tables have a priority field that controls the order of the dropdown menus (see figure 1).

The defaultDb table specifies which assembly is the default for a particular organism. This is usually the latest assembly e.g. mm9 for mouse. You can alter this if, for whatever reason, you want to use a different assembly as your default:

UPDATE defaultDb SET name = 'mm8' WHERE genome = 'mouse';

Databases for each organism/assembly are listed in the dbDb table within hgcentral (see more below). For example the latest mouse assembly (mm9) has an entry that contains information relating to this build, an example query (not retrieving all the information contained within dbDb) looks like this:

> select name,description,organism,sourceName,taxId from dbDb where name = 'mm9';</pre>
 +------+-------------+----------+---------------+-------+
 | name | description | organism | sourceName | taxId |
 +------+-------------+----------+---------------+-------+
 | mm9 | July 2007 | Mouse | NCBI Build 37 | 10090 |
+------+-------------+----------+---------------+-------+

hgFixed – this database contains a bunch of human related information, predominately microarray stuff. If you are running a human mirror then you may wish to populate this database. If not, then it can be left empty.

Genome databases

We will now look at the structure of a genome database for a specific organism/assembly using mm9 as an example.

The first table to consider is chromInfo, this simple table specifies the name of each chromosome, its size and the location of its sequence file. Mouse has 19 chromosomes (named chr1 – chr19) plus the two sex chromosomes (chrX and chrY), mitochondrial DNA (chrM) and unmapped clone contigs (chrUn). If we list the tables within our minimal mm9 install we will see that most of the tables can clearly be identified as belonging to a particular chromosome by virtue of their names, e.g. all the tables with names starting chr10_* contain data relating to chromosome 10.

Of the remainder, the two tables of most interest to understanding how the browser works, and leading us towards the ability to load custom data, are the grp and trackDb tables. These two tables control the actual genome browser display, as we discussed above for the hgcentral tables that control the gateway web interface. Not surprisingly, the trackDb table holds information about all the data tracks you can see (see figure 2), and grp tells the browser how all these tracks should be organized (e.g. we may want a separate group for different types of experimental data, another for predictions made using some computational technique etc.).

Figure 2 shows a screenshot of the mm9 browser we developed in the Bergman Lab as part of the CISSTEM project.

Here, we can see that the tracks are grouped into particular types (Mapping & Sequence Tracks, Genes and Gene Prediction tracks etc.) – these are specified in mm9’s grp table and, like the dropdowns on the gateway page, ordered using a priority field

> select * from grp order by priority asc
+-------------+----------------------------------+----------+
| name        | label                            | priority |
+-------------+----------------------------------+----------+
| user        | Custom Tracks                    |        1 |
| map         | Mapping and Sequencing Tracks    |        2 |
| genes       | Genes and Gene Prediction Tracks |        3 |
| rna         | mRNA and EST Tracks              |        4 |
| regulation  | Expression and Regulation        |        5 |
| compGeno    | Comparative Genomics             |        6 |
| varRep      | Variation and Repeats            |        7 |
| x           | Experimental Tracks              |       10 |
| phenoAllele | Phenotype and Allele             |      4.5 |
+-------------+----------------------------------+----------+

The tracks that constitute these groups are specified in the trackDb table, we can see the tracks for a particular group like so:

> select longLabel from trackDb where grp = 'map' order by priority asc;
+-----------------------------------------------------------------------------+
| longLabel                                                                   |
+-----------------------------------------------------------------------------+
| Chromosome Bands Based On ISCN Lengths (for Ideogram)                       |
| Chromosome Bands Based On ISCN Lengths                                      |
| Mapability - CRG GEM Alignability of 36mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 40mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 50mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 75mers with no more than 2 mismatches  |
| STS Markers on Genetic and Radiation Hybrid Maps                            |
| Mapability - CRG GEM Alignability of 100mers with no more than 2 mismatches |
| Physical Map Contigs                                                        |
| Assembly from Fragments                                                     |
| Gap Locations                                                               |
| BAC End Pairs                                                               |
| Quantitative Trait Loci From Jackson Laboratory / Mouse Genome Informatics  |
| GC Percent in 5-Base Windows                                                |
| Mapability or Uniqueness of Reference Genome                                |
+-----------------------------------------------------------------------------+

These two tables provide information to the browser to know which groups to display, which tracks belong to a particular group, and in which order they should be displayed. The remaining fields in the trackDb table fill in the gaps: what sort of data the track contains and how the browser is to display it. I won’t say anymore about trackDb here – we will naturally cover the details of how trackDb functions when we talk about loading custom data in the next post.

You will probably have noticed that there is not an exact equivalence between the query above and the tracks in the browser (e.g. the query returns a number of Mapability tracks whereas there is only a single Mapability dropdown in the browser). This is because some tracks are composite tracks — if you click Mapability in the browser you will see that the multiple tracks are contained within it. Composite tracks are specified in trackDb, we will investigate this in the next part of this blog post.

[This tutorial continues in Part III: Loading Custom Data]

Customizing a Local UCSC Genome Browser Installation I: Introduction and CentOS Install

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

The UCSC Genome Bioinformatics site is widely regarded as one of the most powerful bioinformatics portals for experimental and computational biologists on the web. While the UCSC Genome Browser provides excellent functionality for loading and visualizing custom data tracks, there are cases when your lab may have genome-wide data it wishes to share internally or with the wider scientific community, or when you might want to use the Genome Browser for an assembly that is not in the main or microbial Genome Browser sites. If so then you might need to install a local copy of the genome browser.

A blog post by noborujs on E-notações explains how to install the UCSC genome browser on an Ubuntu system, and the UCSC Genome genome browser team provides a general walk-through on their wiki and an internal presentation. Here I will provide a similar walkthrough for installing it on a CentOS system. The focus of these posts is go beyond the basic installation to explain the structure of the databases that underlie the browser; how these database table are used to create the web front-end, and how to customize a local installation with your own data. Hopefully having a clearer understanding of the database and browser architecture should make the process of loading your own data far easier.

This blog entry has grown to quite a size, so I’ve decided to split it into 3 more manageable parts and you can find a link to the next part at the end of this post.

The 3 parts are as follows:

1) Introduction and CentOS install (this post)
2) The databases & web front-end
3) Loading custom data

The walk-through presented here installs CentOS 5.7 using VirtualBox so you can follow this on your desktop if you have sufficient disk space. Like the Ubuntu walk-through linked to above, this will speed you through the install process with little or no reference to what the databases actually contain, or how they relate to the browser, this information will be included in part 2.

CentOS Installation

• Grab CentOS 5.7 i386 CD iso’s from one of the mirrors. Discs 1-5 are required for this install.
• Start VirtualBox and select ‘New’. Settings chosen:
• Name = CentOS 5.7
• OS = linux
• Version = Red Hat
• Virtual memory of 1GB
• Create a new hard disk with fixed virtual storage of 64GB
• Start the new virtual machine and select the first CentOS iso file as the installation source
• Choose ‘Server GUI’ as the installation type when prompted
• Set the hostname, time etc. and make sure you enable http when asked about services
• When the install prompts you for a different CD, select “Devices -> CD/DVD devices -> Choose a virtual CD/DVD disk file” and choose the relevant ISO file.
• Login as root, run Package Updater, reboot.

UCSC Browser Install

• Install dependencies:

yum install libpng-devel
yum install mysql-devel

• Set environment variables, the following were added to root’s .bashrc:

export MACHTYPE=i386
export MYSQLLIBS="/usr/lib/mysql/libmysqlclient.a -lz"
export MYSQLINC=/usr/include/mysql
export WEBROOT=/var/www

Each time you open a new terminal these environment variables will be set automatically.

• Download the Kent source tree from http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, unpack it and then build it:

wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
unzip jksrc.zip
mkdir -p ~/bin/$MACHTYPE # only used when you build other utilities from the source tree
cd kent/src/lib
mkdir $MACHTYPE
make
cd ../jkOwnLib
make
cd ../hg/lib
make
cd ..
make install DESTDIR=$WEBROOT CGI_BIN=/cgi-bin DOCUMENTROOT=$WEBROOT/html

You will now have all the cgi’s necessary to run the browser in /var/www/cgi-bin and some JavaScript and CSS in /var/www/html. We need to tell SELinux to allow access to these:

restorecon -R -v '/var/www'

• Grab the static web content. First edit kent/src/product/scripts/browserEnvironment.txt to reflect your environment (MACHTYPE, DOCUMENTROOT etc.) then

cd kent/src/product/scripts
./updateHtml.sh browserEnvironment.txt

• Create directories to hold temporary files for the browser:

mkdir $WEBROOT/trash
chmod 777 $WEBROOT/trash
ln -s $WEBROOT/trash $WEBROOT/html/trash

• Set up MySQL for the browser. We need an admin user with full access to the databases we’ll be creating later, and a read-only user for the cgi’s.

In MySQL:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hgcentral.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hgcentral.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

The above commands will need repeating for each of the databases that we subsequently create.

We also need a third user that has read/write permissions to the hgcentral  database only:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER
  ON hgcentral.*
  TO ucsc_readwrite@localhost
  IDENTIFIED BY 'readwrite_password';

Note: You should replace the passwords listed here with something sensible!

The installation README (path/to/installation/readme) suggests granting FILE to the admin user, but FILE can only be granted globally (i.e. GRANT FILE ON *.*) , so we must do this as follows:

GRANT FILE ON *.* TO ucsc_admin@localhost;

We now have the code for the browser and a database engine ready to receive some data. As an example, getting a human genome assembly installed requires the following:

• Create the primary gateway database for the browser:

mysql -u ucsc_admin -p  -e "CREATE DATABASE hgcentral"
wget http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql
mysql -u ucsc_admin -p hgcentral < hgcentral.sql

• Create the main configuration file for the browser hg.conf and save it in /var/www/cgi-bin:

cp kent/src/product/ex.hg.conf /var/www/cgi-bin/hg.conf

Then edit /var/www/cgi-bin/hg.conf to reflect the specifics of your installation.

• Admin users should also maintain a copy of this file saved as ~/.hg.conf, since the data loader applications look in your home directory for hg.conf. It is a good idea for ~/.hg.conf to be made private (i.e. only the owner can access it) otherwise your database admin password will be accessible by other users:

cp /var/www/cgi-bin/hg.conf ~/.hg.conf
chmod 600 ~/.hg.conf

When we issue commands to load custom data (see part 3) it is this copy of hg.conf that will supply the necessary database passwords.

The browser is now installed and functional, but will generate errors because the databases specified in the hgcentral database are not there yet. The gateway page needs a minimum human database in order to function even if the browser is being built for the display of other genomes.

To install a human genome database, the minimal set of mySQL tables required within the hg19 database is:
• grp
• trackDb
• hgFindSpec
• chromInfo
• gold – for performance this table is split by chromosome so we need chr*_gold*
• gap – split by chromosome as with gold so we need chr*_gap*

To install minimal hg19:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hg19.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hg19.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "CREATE DATABASE hg19"
cd /var/lib/mysql/hg19
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/grp* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/trackDb* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/hgFindSpec* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chromInfo* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gold* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gap* .

The DNA sequence can be downloaded thus:

cd /gbdb
mkdir -p hg19/nib
cd hg19/nib
rsync -avP rsync://hgdownload.cse.ucsc.edu/gbdb/hg18/nib/chr*.nib .

Again we will need to tell SELinux to allow the webserver to access these files:

semanage fcontext -a -t httpd_sys_content_t "/gbdb"

Create hgFixed database:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hgFixed.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hgFixed.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "create database hgFixed"

The browser now functions properly and we can browse the hg19 assembly.

The 3rd part of this blog post looks at loading custom data, for this we will use some Drosophila melanogaster data taken from the modENCODE project. Therefore we will need repeat the steps we used to mirror the hg18 assembly to produce a local copy of the dm3 database. Alternatively, if you have sufficient disk space on your virtual machine you can grab all the D.melanogaster data like so:

mysql -u ucsc_admin -p -e "create database dm3"
cd /var/lib/mysql/dm3
rsync -avP rsync://hgdownload.cse.ucsc.edu/mysql/dm3/* .
cd gbdb
mkdir dm3
cd dm3
rsync -avP rsync://hgdownload.cse.ucsc.edu/gbdb/dm3/* .

At present these commands will download approximately 30Gb.

[This tutorial continues in Part 2: The UCSC genome browser database structure]

Tutorial: Using the UCSC Genome Browser and Galaxy to study regulatory sequence evolution in Drosophila.

One of the most enjoyable parts of teaching genomics and bioinformatics introducing people to the UCSC Genome Browser and Galaxy systems.  Both systems are powerful, intuitive, reliable and user-friendly services, and lend themselves easily to student practicals, as the good folks at Open Helix have amply demonstrated. Over the last few years, I’ve developed a fairly reliable advanced undergraduate/early graduate teaching practical that uses the UCSC Genome Browser and Galaxy to  study regulatory sequence evolution in Drosophila, based on data I curated a few years back into the Drosophila DNAse I footprint database.  As part of opening up the resources from my group, I thought I would post this tutorial with the hope that someone else can use it for teaching purposes or their own edification. The UCSC half can be done pretty reliably in a 50 minute session, and the Galaxy part is much more variable – some people take 20 min or less and others up to an hour.  Feedback and comments are most welcome.

Enjoy!

Aims

  • Become familiar with the UCSC Genome Browser.
  • Become familiar with the UCSC Table Browser.
  • Become familiar with Galaxy Analysis Tools and Histories.
  • Study the conservation of transcription factor binding sites in Drosophila.

Introduction

This lab is an exercise in performing a comparative genomic analysis of transcription factor binding sites (TFBSs) in Drosophila. You will use the UCSC Genome Browser, Table Browser and Galaxy to identify TFBSs that are conserved across multiple Drosophila species. Highly conserved TFBSs are likely to play important roles in Drosophila development. TFBSs that are not conserved may represent regulatory sequences that have contributed to developmental evolution across Drosophila species, or those that simply have been lost as part of the process of TFBSs turnover in a conserved regulatory element. The skill you learn in this lab will be generally applicable to many questions for a wide range of species with genome sequences in the UCSC Genome Database.

Finding the even-skipped region at the UCSC Genome Browser

1) Go to the UCSC Genome Bioinformatics site at http://genome.ucsc.edu.

2) Select the “Genome Browser” option from the light blue panel on the left hand side of the page.

3) From the Gateway page, select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus. Take a minute to read the text at the bottom of the page “About the D. melanogaster Apr. 2004 (dm2) assembly”. [Note: it is essential that you make sure you have selected the correct genome assembly at this step.]

4) Enter the gene “eve” (an abbreviated name for the gene even-skipped) into the “position or search term” box and click the “submit” button.

5) From the search results page, select the top link under the “FlyBase Protein-Coding Genes” header taking you to “eve at chr2R:5491054-5492592”. This will take you to a graphical view of the even-skipped gene with boxes indication exons and lines indication introns. (Note: the thick part of the boxes denote the parts of exons are translated into protein, the thinner parts are UTRs)

Customising the UCSC Genome Browser to display transcription factor binding sites in the even-skipped region

1) The Genome Browser page displays information about a selected genomic region and contains several components. They are, from top to bottom: a list of links to complementary tools and resources on the dark blue background; buttons for navigating left and right and zooming in and out on the chromosome; search boxes for jumping to new regions of the genome; a pictoral representation of the region shown on the chromosome; the main display window, comprising several tracks showing different types of information about the genomic region on display; a set of buttons to modify the main display; and several rows of pull-down menus, each controlling the display status of an individual track in the main display window.

2) Click the “hide all” button in the second panel of display controls to remove all tracks from the browser. Then select the “full” option from the drop-down FlyBase Genes menu under “Genes and Gene Prediction Tracks” and click refresh.

3) In the main display window, the top feature in light blue is the 2-exon gene eve. Click on one of the blue exons of the eve gene. This will send you to a detailed page about the eve gene. Scroll down to see the data linked to eve, including information imported from external sources like FlyBase, links to other resources like “Orthologous Genes in Other Species” also at the UCSC genome browser, and links to other resources outside the UCSC Database such as “Protein Domain and Structure Information” at the InterPro database. [Note: you can minimise subsections of this page by clicking the “-” button next to each major heading.]

3) Click the white “Genome Browser” link on the top blue banner to return to the main browser page. Scroll down the page and read the names of the various tracks available for this genome sequence. Sets of tracks are organised into logical groups (e.g. “Genes and Gene Prediction Tracks”). To find out more information about the data in any track, you can click on the blue title above each track’s pulldown menu, which leads to a detail page explaining the track and additional track-specific display controls. [Note: the same detail page can be displayed by clicking on the blue/grey rectangle at the very left of the track display in the main display window.]

4) Click on the blue title for the “FlyReg” track under the “Expression and regulation” heading. Take a minute to read about the FlyReg track and where the data comes from. Set the display mode pull-down menu to “pack” and then click “refresh”. [Note: visit http://www.flyreg.org/ for more information.]

5) Click the “zoom out 10x” button in the top right corner of the page to expand your window to display ten times the current sequence length. Each annotated regulatory element you are seeing in the FlyReg track is an experimentally-validated transcription factor binding site (TFBS) that regulates eve.

6) Click directly on one of the brown TFBSs features in the FlyReg track in the 5′ intergenic region upstream of the eve gene. As with all data in different tracks, clicking on a feature will send you to a detail page about the feature, in this case the TFBS. In the detail page, click on the “Position” link and this will return you to the main Genome Browser window window showing just the coordinates of the TFBS you just selected.

Investigating the conservation of individual TFBSs using the UCSC Genome Browser

1) Select the “full” option from the Conservation drop-down menu under “Comparative genomics” and click refresh.

2) The browser should now be displaying exactly one TFBS and conservation of this TFBS using the “12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores” track. Is this TFBS conserved across the Drosophila species? If so, which ones? Is this TFBS conserved in Anopheles gambiae, Tribolium castaneum or Apis mellifera?

3) Click the “zoom out 10x” button twice and select a different TFBS from the FlyReg Track, click on the “Position” link in the detail page and evaluate if this TFBS is conserved and in which species. Now repeat for every TFBS in the genome (only joking!).

4) To see the general correspondence between TFBS and highly conserved sequences, set the pull-down menu to “pack” for the “Most Conserved” Track. Zoom out 10x so you are displaying a few hundred bp. How well do the most highly conserved sequences correspond the TFBSs? Are TFBSs typically longer or shorter than a highly conserved region? Zoom out 10x so you are displaying a ~1 Kbp. Are there more TFBSs than conserved sequences or vice versa? Why might this be the case?

Investigating the conservation of all known TFBSs using the UCSC Table Browser

1) Click the “Tools” menu on the dark blue background at the top of the Genome Browser window and select “Table Browser”. This will send you to an alternative interface to access data in the UCSC Genome Database called the “Table Browser.” The pull-down menus you see here correspond to the same tracks you saw in the Genome Browser.

2) Select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus.

3) Select “Expression and Regulation” and “FlyReg” from the “group” and “track” pull-down menus, respectively.

4) Click the radio button (the circle with a dot indicating which option is selected) next to “genome” as the “region” you will analyse. This will select the whole genome for inclusion in your analysis.

5) Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with >1,300 hyperlinks that send you to all the annotated TFBS in Drosophila, each of which corresponds to one row in the FlyReg data track. [Note: this is a general method to export data from any track the whole genome or a specific region.]

6) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. We are now going to use the Table Browser to ask “how many TFBS in Drosophila are found in highly conserved sequences?” We will do this by using the Table Browser to overlap all of the TFBS with all of the Most Conserved segments of the genome.

7) Click the “create” button next to the “intersection” option. This will send you to a page where you can set conditions for the overlap analysis of the FlyReg TFBS with the Most Conserved regions.

8 ) Select “Comparative Genomics” and “Most Conserved” from the “group” and “track” pull-down menus. Click the “All FlyReg records that have at least X% overlap with Most Conserved”. Set the X% value to “100” and click “submit” to return to the main Table Browser page.

9) Notice that the Table Browser now shows the “intersection with phastConsElements15way” option is selected. Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with hyperlinks that send you to all the annotated TFBS in Drosophila that are 100% contained in the Most Conserved regions of the genome. Click on a few links to convince yourself that this analysis has generated the correct results. At this point you have a great result – all fully conserved TFBS in Drosophila – but it is difficult to quantitatively summarize these data in the Table Browser, so let’s move on to Galaxy where analysis like this is a lot easier.

Quantifying TFBS conservation using Galaxy

1) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. Select “BED – browser extensible data” from the “output format” pull-down menu and check the “Galaxy” check box. Now click the “get output” button at the bottom of the page. This will send you to a detail page. Leave all the default options set here and click “Send query to Galaxy”.

2) This will launch a new Galaxy session as an anonymous user. [For more information on Galaxy, visit http://usegalaxy.org] The results of this Table Browser query will be executed as a Galaxy “analysis” which will appear in the right hand “History” pane of the Galaxy browser window as “1: UCSC Main on D. melanogaster: flyreg2 (genome)”. While the Table Browser query is running, this analysis will be grey in the history pane, but will turn green when it is completed.

3) When the analysis has completed, you will have created a new dataset numbered “1” in your Galaxy history. Click on the link entitled “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the history to reveal a dropdown which contains a snapshot of this dataset. [This dataset should contain 533 regions. If it doesn’t stop here and get help before moving on.] Now click on the eye icon to reveal the contents of this data set in the middle pane of the Galaxy window. Use the scroll bar on the right hand side of the middle pane to browse the contents of this data set.

4) Click on the pencil icon to “Edit attributes” of this data set. In the middle pane replace “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the “Name” text box with something shorter and more descriptive like “conserved TFBS”. Click the “Save” button at the bottom of the middle pane.

5) Now let’s get the complete set of FlyReg TFBS by querying the UCSC Table Browser from inside Galaxy. Click “Get Data” under “Tools” in the left hand pane of the Galaxy page. This will explode a list of options to get data into Galaxy. Click “UCSC Main” which will bring up the Table Browser inside the middle pane of the Galaxy page. Click the “clear” button next to “intersection with phastConsElements15way:”. Make sure the “Galaxy” check box is selected and click “get output” and then click “Send query to Galaxy” on the next page. This will create a new dataset “2: UCSC Main on D. melanogaster: flyreg2 (genome)” which you can rename “all TFBS” as above using the pencil icon. [Note: this dataset should have 1,362 regions in it; if your does not, please stop and ask for help.]

6) You can now perform many different analyses on these datasets using the many “Tools” available in the left hand pane of the Galaxy window. Let’s start by summarizing how many TFBS are present in the “1: conserved TFBS” and “2: all TFBS” datasets. To do this, click the “Statistics” link on the left hand side, which will open up a set of other analysis tools including the “Count” tool. Click on the “Count” tool, and in the middle pane select “1: conserved TFBS” in the “from dataset” menu and click on “c4” to activate the counting to occur on column 4 containing the name of the TFBS in the “1: conserved TFBS” dataset. Repeat this analysis for the “2: all TFBS” dataset. This should lead to two more datasets of 70 and 88 lines, respectively, which you should again rename to something more meaningful than the default values, such as “3: conserved TFBS counts” and 4: all TFBS counts”.

7) Now let’s use the “Join, Subtract and Group->Join two Datasets” tool to join the results of the two previous analyses into one merged dataset. Click “Join, Subtract and Group->Join two Datasets”, select “4: all TFBS counts” in the “Join” drop-down menu, “c2” for the “using column” drop-down menu, “3: conserved TFBS counts” in the “with” drop-down menu and “c2” for the “and column” drop-down menu. Set the “Keep lines of first input that do not join with second input:”, “Keep lines of first input that are incomplete:”, and “Fill empty columns:” drop-down menus to “yes”. Setting the ” Fill empty columns” menu to yes will pull up a few more menus, which should be set as follows: “Only fill unjoined rows:” set to “Yes”; “Fill Columns by:” set to “single Fill Value”, and “Fill value” to “0”. Now click “Execute”. What you have just achieved is one of the trickier basic operations on bioinformatics, and is the underlying process in most relational database queries.  Pat yourself on the back!

8 ) Now let’s try to do some science and ask the question: “what is the proportion of conserved TFBS for each transcription factor?” This will give us some insight into whether TFBS turnover is the same for all TFs or might be higher or lower for some TFs. To do this, use the “Text manipulation->compute” Tool and set the “Add expression” box to “1-((c1-c3)/c1)” and “as a new column to:” to “5: Join two Datasets on data 3 and data 4” and click “Execute”. This will add a new column to the joined dataset with the proportion of conserved TFBS for each transcription factor.

For Further Exploration…

1) Format the output of your last analysis in a more meaningful manner using the “Filter and Sort->Sort” tool.

2) Plot the distribution of the proportion of conserved TFBS per TF using the “Graph/Display Data->Histogram” tool.

3) Go back to the original TFBS datasets and derive new datasets to investigate if different chromosomes have different rates of TFBS evolution?

4) Develop your own question about TFBS evolution, create a custom analysis pipeline in Galaxy and wow us with your findings.

Installing libsequence and analysis tools on OSX

After being relegated to the margins of biology for almost a century, the field of population genetics is now moving closer the the center of mainstream biology in the era of next-generation sequencing (NGS). In addition to providing genome-wide DNA variation data to study classical questions in evolution that are motivated by population genetic theory, NGS now permits population-genetic based techniques like GWAS (genome-wide association studies) and eQTL (expression quantitative trait locus) analysis to allow biologists to identify and map functional regions across the genome.

Handling big data requires industrial strength bioinformatics tool-kits, which are increasingly available for many aspects of genome bioinformatics (e.g. the Kent source tree, SAMtools, BEDtools, etc.). However, for molecular population genetics at the command line, researchers have a more limited palette that can be used for genome-scale data, including: VariScan, the PERL PopGen modules, and the libsequence C++ library.

Motivated by a recent question on BioStar and a request from a colleague to generate a table of polymorphic sites across a bacterial genome, I’ve recently had a play with the last of these — Kevin Thornton’s libsequence and accompanying analysis toolkit — which includes a utility called compute which is a “mini-DNAsp for the Unix command-line.” As ever, getting this installed on OSX was more of a challenge than desired, with a couple of dependencies (including Boost and GSL), but in the end was do-able as follows:

$ wget http://sourceforge.net/projects/boost/files/boost/1.47.0/boost_1_47_0.tar.gz
$ tar -xvzf boost_1_47_0.tar.gz
$ cd boost_1_47_0
$ ./bootstrap.sh
$ sudo ./b2 install
$ cd ..

$ wget http://molpopgen.org/software/libsequence/libsequence-1.7.3.tar.gz
$ tar -xvzf libsequence-1.7.3.tar.gz
$ cd libsequence-1.7.3
$ ./configure
$ make
$ sudo make install
$ cd ..

$ wget ftp://ftp.gnu.org/gnu/gsl/gsl-1.15.tar.gz
$ tar -xvzf gsl-1.15.tar.gz
$ cd gsl-1.15
$ ./configure --disable-shared --disable-dependency-tracking
$ make
$ sudo make install
$ cd ..

$ wget http://molpopgen.org/software/analysis/analysis-0.8.0.tar.gz
$ tar -xvzf analysis-0.8.0.tar.gz
$ cd analysis-0.8.0
$ ./configure
$ make
$ sudo make install
$ cd ..

With this recipe, compute and friends should now be happily installed in /usr/local/bin/.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8.

Enhanced by Zemanta

On genome coordinate systems and transposable element annotation

[Update: For an extended discussion of the issues in this post, see: Bergman (2012) A proposal for the reference-based annotation of de novo transposable element insertions” Mobile Genetic Elements 2:51 – 54]

Before embarking on any genome annotation effort, it is necessary to establish the best way to represent the biological feature under investigation. This post discusses how best to represent the annotation of transposable element (TE) insertions that are mapped to (but not present in) a reference genome sequence (e.g. from mutagenesis or re-sequencing studies), and how the standard coordinate system in use today causes problems for the annotation of TE insertions.

There are two major coordinate systems in genome bioinformatics, that differ primarily in whether they anchor genomic feature to (“base coordinate systems”) or between (“interbase coordinate systems”) nucleotide positions. Most genome annotation portals (e.g. NCBI or Ensembl), bioinformatics software (e.g. BLAST) and annotation file formats (e.g. GFF) use the base coordinate system, which represents a feature starting at the first nucleotide as position 1. In contrast, a growing number of systems (e.g. UCSC, Chado, DAS2) employ the interbase coordinate system, whereby a feature starting at the first nucleotide is represented as position 0. Note, the UCSC genome bioinformatics team actually use both systems and refer to the base coordinate system as “one-based, fully-closed” (used in the UCSC genome browser display) and interbase coordinate system as “zero-based, half-open” (used in their tools and file formats), leading to a FAQ about this issue by users. The interbase coodinate system is also referred to as “space-based” by some authors.

The differences between base (bottom) and interbase (top) coordinate system can be visualized in the following diagram (taken from the Chado wiki).

There are several advantage for using the interbase coordinate system including: (i) the ability to represent features that occur between nucleotides (like a splice site), (ii) simpler arithmetic for computing the length of features (length=end-start) and overlaps (max(start1,start2), min(end1,end2)) and (iii) more rational conversion of coordinates from the positive to the negative strand (see discussion here).

So why is the choice of coordinate system important for the annotation of TEs mapped to a reference sequence? The short answer is that TEs (and other insertions) that are not a part of the reference sequence occur between nucleotides in the reference coordinate system, and therefore it is difficult to accurately represent the location of a TE on base coordinates. Nevertheless, base coordinate systems dominate most of genome bioinformatics and are an established framework that one has to work within.

How then should we annotate TE insertions on base coordinates that are mapped precisely to a reference genome? If a TE insertion in reality occurs between positions X and X+1 in a genome, do we annotate the start and end position both at the same nucleotide? If so, do we annotate the start/stop coordinate at position X, or both at position X+1? If we chose to annotate the insertion at position X, then we need to invoke a rule that the TE inserts after nucleotide X. However this solution breaks down if the insertion is on the negative strand, since we either need to map a negative strand insertion to X+1 or have a different rule for interpreting the placement of the TE on positive and negative strands. Alternatively, do we annotate the TE as starting at X and ending at X+1, attempting to fake interbase coordinates on a base coordinate system, but at face value implying that the TE insertion is not mapped precisely and spans 2 bp in the genome.

After grappling with this issue for some time, it seems that neither of these solutions is sufficient to deal with the complexities of TE insertion and reference mapping. To understand why, we must consider the mechanisms of TE integration and how TE insertions are mapped to the genome. Most TEs create staggered cuts to the genomic DNA that are filled on integration into the genome leading to short target site duplications (TSDs). Most TEs also target a palindromic sequence, and insert randomly with respect to orientation. A new TE insertion is typically mapped to the genome by sequencing a fragment that spans the TE into unique flanking DNA, either by directed (e.g. inverse/linker PCR) or random (e.g. shotgun re-sequencing) approaches. The TE-flank fragment can be obtained from the 5′ or 3′ end of the TE. However, where one places the TE insertion depends on whether one uses the TE-flank from the 5′ or 3′ end and the orientation of the TE insertion in the genome. As shown in the following diagram, for an insertion on the positive strand (>>>), a TE-flank fragment from the 5′ end is annotated to occur at the 3′ end of the TSD (shown in bold), whereas a 3′ TE-flank fragment is placed at the 5′ end of the TSD.  For an insertion on the negative strand (<<<), the opposite effect occurs. In both cases, TE-flank fragments from the 5′ and 3′ end map the TE insertion to different locations in the genome.

Thus, where one chooses to annotate a TE insertion relative to the TSD is dependent on the orientation of the TE insertion and which end is mapped to the genome. As a consequence, both the single-base and two-base representations proposed above are flawed, since TE insertions into the same target site are annotated at two different locations on the positive and negative strand. This issue lead us (in retrospect) to misinterpret some aspects of the P-element target site preference in a recent paper, since FlyBase uses a single-base coordinate system to annotate TE insertions.

As an alternative, I propose that the most efficient and accurate way to represent TE insertions mapped to a reference genome on base coordinates is to annotate the span of the TSD and label the orientation of the TE in the strand field. This formulation allows one to bypass having to chose where to locate the TE relative to the TSD (5′ vs. 3′, as is required under the one-base/two-base annotation framework), and can represent insertions into the same target site that occur on different strands. Furthermore, this solution allows one to use both 5′ and 3′ TE-flank information. In fact, the overlap between coordinates from the 5′ and 3′ TE-flank fragments defines the TSD. Finally, this solution requires no prior information about TSD length for a given TE family, and also accommodates TE families that generate variable length TSDs since the TSD is annotated on a per TE basis.

The only problem left open by this proposed solution is for TEs that do not create a TSD, which have been reported to exist. Any suggestions for a general solution that also allows for annotation of TE insertions without TSDs would be much appreciated….

Linux packages for UCSC genome source tools

Posted 09 Apr 2010 — by maxhaussler
Category genome bioinformatics, linux, UCSC genome browser

Some time ago I’ve created packages for the UCSC genome source tools for 64Bit Intel/AMD processors.

A .rpm version can be downloaded here; a .deb version can be downloaded here.

On Fedora/RedHat/CentOs, install the rpm with:

sudo yum install ucsc-genome-tools-225-2.x86_64.rpm --nogpgcheck

On Debian/Ubuntu, install the .deb file with:

dpkg -i ucsc-genome-tools_225-1_amd64.deb

Tell me in the comments if there are any problems, I’ve tested them only on Debian and CentOS.

NCBI Blast Tabular output format fields

Posted 14 Dec 2009 — by maxhaussler
Category BLAST, genome bioinformatics

Certainly, with the new NCBI Blast+ tools, you won’t need this anymore, but as long as we are sticking with the old blastall programm with its horrible documentation, I keep forgetting the format of the BLAST tabular reports. Tabular format is created when you specify “-m 8”. This is the most useful format for parsing blast yourself without having to learn strange libraries like BioPerl, BioJava, BioPython or BioErlang (doesn’t this exist yet, Mike?)

So here is the meaning of the fields:

queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore

Parsing is then simple:

Python
for line in open(“myfile.blast”):
(queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore) = line.split(“\t”)

Perl
while (<>) {
($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount, $gapOpenCount, $queryStart, $queryEnd, $subjectStart, $subjectEnd, $eVal, $bitScore) = split(/\t/)
}

Creating Customised BioMarts from the Ensembl Database: A Step-by-Step Guide

Posted 15 Jul 2009 — by caseybergman
Category apache, biomart, ensembl, genome bioinformatics
In this blog post, we give a detailed description of constructing a mirror of the marts from the Ensembl database with custom data added to the existing Ensembl schemas. This enables us to extend the existing established marts — namely Sanger‘s Ensembl Mart and VEGA Mart — with arbitrary supplementary relating to genes and genomes.

BioMart is a set of tools for (i) denormalising database schemas in order to optimise database search queries and (ii) a framework for accessing the denormalised data via web browsers, Perl scripts or the DAS protocol. Its applications are manifold since the construction of a mart also provides an easily accessible data source with interactive and programmatic interfaces. A full description of the system can be found in the recent publication on BioMart.

In order to construct a minimal mart and to make it accessible via web browser, the following software needs to be installed:

  • MySQL, PostgreSQL or Oracle database
  • Perl
  • Apache Web-Server and ModPerl
  • BioMart
    • biomart-perl (command line tools for web site creation and DAS server)
    • martj (GUIs supporting the mart construction and maintenance)
With the preliminary software installed and set up, a mart can be created in these simple steps outlined as follows:
  1. Use MartBuilder to create the mart’s data tables
    1. Select relevant tables and databases for denormalising
    2. If necessary, establish relations between tables manually, but MartBuilder does usually a good job finding correlations between tables automatically
    3. Generate and execute the SQL which generates the data tables
  2. Use MartEditor to populate the mart’s meta tables with human readable information
    1. Let MartEditor guess the structure of the meta data
    2. Augment the meta data with sensible titles, labels and descriptions
  3. Run configure.pl of biomart-perl to write the web server’s HTML files and scripts

In the following, we give a detailed step-by-step description on how to create BioMarts from schemas in the Ensembl database and how to extend the marts with custom data. The Ensembl database consists of various schemas for each species. For example, there are schemas for human, cat, mouse and dog, each of which has sets of species-specific tables about genes, their transcripts and translation products. From a naive point of view, a mart combines these schemas and denormalises the schemas’ tables into a single new schema — the mart — that can then used to run optimised search queries.


Setting-Up Perl, Libraries and BioMart-Perl

In order to set up a mart, it is necessary to create the mart’s web back-end using a Perl script, which itself relies on Perl modules. BioMart runs under both Perl 5.8 and Perl 5.10, where the following commands should be executed in a shell:

# perl -MCPAN -e shellcpan[1]> install Bundle::CPANcpan[2]> q# cpancpan[1]> install Module::Buildcpan[2]> o conf prefer_installer MBcpan[3]> o conf commitcpan[4]> q

Now, the rest of the required modules can be installed by executing cpan x, where x stands for any of the following modules:

  • Spreadsheet::WriteExcel
  • Template::Constants
  • Log::Log4perl
  • Digest::SHA
  • IO::Compress::Gzip
  • Test::Exception
  • Template::Plugin::Number::Format
  • Exception::Class
  • CGI::Session::Driver::db_file
  • Spreadsheet::WriteExcel::Big
  • Number::Format
  • CGI::Session
  • Template
  • Readonly

LibGD — for the dynamic creation of images

For the programmatically creation of images, the installation of LibGD is required, which can be downloaded from http://www.libgd.org/Downloads. The software comes in form of a source code package, but its build and installation process is straightforward. Within the root directory of the extracted tar.gz (or tar.bz2) package, the following lines should be executed:

# ./configure# make# make install

Remark: configure should report that only the JPEG library is missing, but everything else is found

Expat

Expat is an XML parser, which is also needed by the BioMart software. It can be downloaded via http://sourceforge.net/project/showfiles.php?group_id=10127&package_id=10780 and installed using the following few lines:

# ./configure# make# make install

BioMart-Perl and MartJ

The installation of BioMart-Perl and MartJ is described in the full documentation of BioMart, which can be accessed via http://www.biomart.org/user-docs.pdf. Since there are only a few lines to execute, we repeat those lines that can be also found in the documentation here. First, the installation of BioMart-Perl is carried out by checking out the software from a CVS server as follows:

# cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/biomart loginPassword is CVSUSER# cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/biomart \co -r release-0_7 biomart-perl

This will create a directory biomart-perl in which the BioMart-Perl software and example configuration files reside.

Second, MartJ can be downloaded using the URL ftp://anonymous@ftp.ebi.ac.uk/pub/software/biomart/martj_current/martj-bin.tgz, which then can be extracted by:

tar xzf martj-bin.tgz

Tar will create a directory martj with the programs MartBuilder and MartEditor in the sub-directory bin. Each of the tools can be executed by running the respective shell scripts martbuilder.sh and marteditor.sh.

The construction of a mart using MartBuilder can consume large amounts of memory and it may happen that the Java program terminates because the imposed limit on the permissible maximum heap memory is exceeded. In that case, it is necessary to edit martbuilder.sh and replace the parameters -Xmx512m and -Xms256m with larger values. We experienced no memory allocation problems anymore after replacing -Xmx512m with -Xmx2048m.

Mart-Build

All Ensembl and VEGA marts are built using MartBuilder, with the exception of the sequence mart, which is created by using the now otherwise obsolete mart-build Perl scripts. The mart-build scripts can be obtained via Sanger’s CVS server too:

# cvs -d :pserver:cvsuser@cvsro.sanger.ac.uk:/cvsroot/biomart loginPassword is CVSUSER# cvs -d :pserver:cvsuser@cvsro.sanger.ac.uk:/cvsroot/biomart co mart-build

We will describe the actual sequence mart construction in detail below.

Running MySQL Database and Apache Web-Server

Describing how to set up MySQL and Apache’s web-server can easily fill books, and therefore we do not to attempt to describe every little detail about the installation and configuration of those programs here. However, we highlight some interesting details that might provide some ease to the troubled bioinformatics researcher.

MySQL is usually started on larger systems using the wrapper mysqld_safe, whereas on smaller systems it might be sufficient to start the MySQL daemon directly using

mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --user=username

Below, we will turn towards importing the Ensembl database, which is several hundred gigabytes large. In the default configuration of MySQL, the daemon will cater for database replication needs, i.e. the distribution of data from a master node to several slave nodes. In order to do this, the master node needs to write a ‘binary log-file’, which keeps track of every action performed on the master node. Unfortunately, this means that every line of data that we are importing into our database, will also get logged (and hence duplicated) in one of the binary log files. For the whole Ensembl database, which is somewhat 500-800MB in size, this can mean that a 1 terabyte harddrive might prove to be too small under this default option. As such, it is advisable to turn off binary data-logs — unless you really use database replication — by changing /etc/my.cnf and uncommenting the variable log-bin, i.e.

...#log-bin=mysql-bin...

The web-server has to be started with its root directory pointing to the biomart-perl installation directory. For example, if the biomart-perl directory was created by CVS in /usr/local, then httpd has to be invoked like

httpd -d /usr/local/biomart-perl

However, this will not result in a functioning mart yet, since it is first necessary to import the Ensembl database, create the mart’s data tables and finally the mart’s meta-data tables. Once these steps are carried out, the web-server can be started. Also, if the mart is edited so that perl bin/configure.pl (see below) has to be re-run, the web-server has to be restarted as well.

Importing Ensembl in MySQL

As it has been said before, the complete Ensembl database is somewhat between 500-800 megabyte in size, so when we import the data into the database, we want to keep as much disk-space free as possible. In the following, we describe how the compressed data files from ftp.ensembl.org can be streamed to the mysqlimport tool, without actually uncompressing the data files to the hard drive.

We can obtain an archive file that contains all compressed data files for a particular Ensembl version by ftp. Even though the archive itself is uncompressed, i.e. it is an ordinary tar archive, its contents are gzipped:

# ftp ftp.ensembl.orgftp> cd /pubftp> get release-55.tar

After the download is complete, the current working directory contains the file release-55.tar with the gzip-compressed contents of the respective ‘release-55’ directory on the ftp server. The file can then be extracted by executing

# tar xf release-55.tar

Since the tar file itself is not compressed, the extracted archive will take up the same disk space as the tar file.

The importer mysqlimport itself is not capable of reading data from a stream, so we have to install some additional software that enables us streaming data directly to mysqlimport. We use a custom script, which we present below, and we make use of the Perl module ‘Slurp’ for the actual input stream redirect. Slurp can be installed as follows:

# cpan MySQL::Slurp

The module itself is used in the following Perl code, which should be saved in a file called slurp.pl:

#!/usr/bin/perluse MySQL::Slurp;@import_args = ( '-u', 'username', '--password=userpassword' );my $slurper = MySQL::Slurp->new(database => $ARGV[0],table => $ARGV[1],method => 'mysqlimport',args => \@import_args)->open;$slurper->slurp();$slurper->close;

We execute slurp.pl within the following bash script, import.sh, which should be executed in the directory previously created by tar xf release-55.tar. Of course, we assume that slurp.pl is in the PATH environment of the shell, so that it will be picked up and invoked within the script successfully.

#!/bin/bashfor i in * ; doecho "CREATE DATABASE $i;\n" | mysql -u username --password=user_password ;gunzip -c $i/*.sql.gz | mysql -u username --password=user_password $i ;for j in $i/*.txt.gz ; dogunzip -c $j | slurp.pl $i `basename $j .txt.gz` ;done ;done

If only certain databases of species should be imported, then the for i in * ; do statement can be changed, for example, to for i in {homo_sapiens*,mus_musculus*} ; do.

Adding Custom Data

Custom data that is to appear in the final build mart has to be added into the tables of the Ensembl database. Anything related to the Ensembl mart will have to go into the *_code_* schemas of the Ensembl database, whilst data related to the VEGA mart has to be added to the *_vega_* schemas. Since both marts are centered around gene identifiers, custom data can be added by inserting tables into these schemas with a first column labelled gene_id of type INTEGER. Data in the second, third and further columns will then be associated with the respective gene and its gene ID of the first column, where the type of the data only depends on how it later will be interpreted in the mart’s web-interface.

Building a Mart from Ensembl Schema

From the Ensembl database, we are going to build the marts that are also provided through the BioMart web site. These marts are the following:

  • Ensembl Mart (focusing on genes)
  • (Gene) Feature Mart
  • (Gene) Ontology Mart
  • (Gene) Sequence Mart
  • VEGA Mart (focusing on genes)

For these marts (except for the sequence mart, see below), XML-files are provided by EBI that can be used with the MartJ tools to automatically build a mart, or alternatively the XML-files can be downloaded here:

  • calculated_data_patch.pl: for adding 3’/5′ UTR and CDS start, end and length data to a sequence mart
  • mart_53.zip: XML-files for creating marts relating to Ensembl version 53
  • mart_55.zip: XML-files for creating marts relating to Ensembl version 55
A (simplified) process of building a mart can be outlined as follows:

  1. create an empty schema for the mart (e.g., vega_mart_55)
  2. run MartBuilder
    1. import EBI’s XML-file
    2. generate SQL
  3. execute the SQL
  4. run MartEditor
    1. transfer the mart’s meta-data (which includes the labelling for the web-interface) from martdb.ensembl.org to the local mart

In the following, we describe the creation of a mart in detail, where we distinguish between preliminary mart construction steps, script-based mart construction steps and MartJ GUI-based mart construction steps.

Preliminary Mart Construction

For all marts, it is necessary to create an empty schema in MySQL first. For the marts we have listed above, the following lines have to be executed:

# echo "CREATE DATABASE ensembl_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE genomic_features_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE ontology_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE sequence_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE vega_mart_55;\n" | mysql -u username -p

Both the Ensembl mart and the feature mart require empty database schemas called master_schema_55 and master_schema_variation_55, which are used as templates in their mart creation process. These schemas can be generated by extracting the SQL commands of any species database that features core and/or variation schemas:

# gzip -cd microcebus_murinus_core_55_*.sql.gz > master_schema_55.sql# echo "CREATE DATABASE master_schema_55;" | mysql -u username -p# cat master_schema_55.sql | mysql -u username -p master_schema_55# gzip -cd anopheles_gambiae_variation_55_*.sql.gz > master_schema_variation_55.sql# echo "CREATE DATABASE master_schema_variation_55;" | mysql -u username -p# cat master_schema_variation_55.sql | mysql -u username -p master_schema_variation_55

Ensembl-, Feature-, Ontology- and VEGA-Mart Construction

Ensembl marts, gene feature marts, gene ontology marts and VEGA marts are built using the MartBuilder and MartEditor applications, following the steps described below. MartBuilder is a tool to create and populate the data behind a mart, whilst MartEditor allows us to edit the meta-information regarding the data tables, which is used by the web server to display meaningful names for filters and attributes. First, we describe the construction of a mart’s data tables with MartBuilder, and second, we go through the process of importing meta-data into said marts.

Each of the mentioned marts is associated with an XML-file that describes the mart’s structure, i.e. the relationships between tables in the database and information about which table columns should be condensed into the mart. The files ensembl_55.xml, features_55.xml, eontology_55.xml and vega_55.xml, which are provided by EBI, cannot be used directly in MartBuilder, since these files contain settings that are specific to EBI’s database infrastructure. In order to use these files in your own mart construction process, it is necessary to change occurrences of the hostname, port number, username and password in the XML files, where the username and password are referring to the MySQL username and password. In order to change this parameters semi-automatically, the following commands can be used,

cat ensembl_55.xml |sed -e 's/outputPort="18001"/outputPort="your_mysql_port"/g' |sed -e 's/overrideHost="ens-staging"/overrideHost="your_mysql_host"/g' |sed -e 's/overridePort="3306"/overridePort="your_mysql_port"/g' |sed -e 's/localhost:43306/your_mysql_host:your_mysql_port/g' |sed -e 's/username="joebloggs"/username="username"/g' |sed -e 's/password=""/password="your_password"/g' > ensembl_55.xml.new ;mv ensembl_55.xml.new ensembl_55.xml

The XML-files can then be loaded into MartBuilder, one after the other, and with the following menu selections, the data tables for a mart can be built. However, MartBuilder itself relies on MartRunner to execute the SQL commands that it generates, and therefore, it is necessary to start MartRunner before the invocation of MartBuilder. MartRunner’s executable is located in the same directory as MartBuilder and MartEditor, and can be started using the command line ./martrunner.sh 1500, which sets up the MartRunner server listening on port 1500 for incoming SQL jobs. It is then possible to continue the mart construction in MartBuilder:

  1. File -> Open…: load an XML-file that describes a mart
  2. Mart -> Update all schemas: if additional tables have been added to the source schemas or the XML-file is describing an older version of the mart (for example, XML-files for a mart version 53 whilst using Ensembl >53), then it is necessary to bring the mart description up to date with the actual source tables and their relationships to each other
  3. Schema -> Accept changes: ‘Update all schemas’ may lead to changes in the mart (highlighted in thick green), which have to be explicitly accepted
  4. Mart -> Generate SQL: choose MartRunner and set hostnames and ports accordingly

MartBuilder now opens a separate window that is an interface to MartRunner, where the following actions should be taken:

  1. Select the last entry (should be blue) in the Jobs available column and click on Start job
  2. The construction of the mart is finished when the job entry changes to green. If there is any error during the mart creation, then the entry will appear red.

Sequence marts

Sequence marts are not created using the MartJ tools; instead the otherwise obsolete mart-build scripts are used (see above). In order to successfully build a sequence mart, it is required to install the latest GO-term database first. It can be downloaded via http://archive.geneontology.org/latest-termdb/go_daily-termdb-tables.tar.gz and then added to the MySQL database using the command lines:

# tar xzf go_daily-termdb-tables.tar.gz# cd go_daily-termdb-tables# echo "CREATE DATABASE go_termdb;" | mysql -u username -p# cat *.sql | mysql -u username -p go_termdb# mysqlimport -u username -p -L go_termdb *.txt

The data behind the sequence mart can then be generated as follows:

# cd mart-build/scripts/ensembl# export ENSMARTHOST=your_mysql_host# export ENSMARTPORT=your_mysql_port# export ENSHOME=/usr/local/ensembl# export ENSMARTUSER=username# export ENSMARTPWD=your_password# export ENSMARTDRIVER=mysql# export PATH=.:$PATH# export PERL5LIB=$PERL5LIB:/usr/local/ensembl/modules# export PERL5LIB=$PERL5LIB:/usr/local/mart-build/modules# ./make_all_dna_chunks.pl sequence_mart_55 55 0

In order to add 3′ and 5′ UTR as well as CDS start, end and length to the sequence mart, it is necessary to run the script calculated_data_patch.pl, which was provided to us by EBI as well. The script will create a directory calculated_data, where for each species a SQL-file is generated that contains instructions for adding said data to the sequence mart. Database specific settings have to be made within the file directly, where the line

$registry->load_registry_from_db( -host => 'host',                     -user => 'user' );

should be replaced by appropriate values, for example:

$registry->load_registry_from_db( -host => 'your_host',                     -user => 'username',                     -pass => 'your_password' );

The extra data is added to the sequence mart by executing:

# cd calculated_data# for i in *.sql ; do cat $i | mysql -u username --password=your_password ; done

Meta-Data Transfer

Before the web interface to a mart can be generated, it is necessary to annotate the data tables in the mart with human readable descriptions, and additionally, provide information about which table columns can be queried and filtered. Since the construction of the meta data from scratch is a lengthy process, we transfer the mart-meta data from martdb.ensembl.org to the local database using MartEditor.

During the meta-data transfer, it will become necessary to safe the meta-data from a mart on martdb.ensembl.org on the hard-drive, before it can be uploaded into the local mart. MartEditor will save the meta-data in form of XML-files, which should not be confused with the XML-files we used to construct the data of the mart using MartBuilder.
In MartEditor, the following steps should be carried out in order to transfer the meta-data of an existing mart:

  1. File -> Database Connection: connect to the mart on martdb.ensembl.org
  2. File -> Save All: specify a directory in which the mart’s meta-data is stored in XML files
  3. File -> Database Connection: connect to the same local mart
  4. File -> Upload All: select the previously saved XML files
  5. File -> Database Connection: connect to the mart on martdb.ensembl.org
  6. File -> Import
  7. File -> Database Connection: connect to the local mart
  8. File -> Export
  9. File -> Exit
In case custom tables were added to the database schemas before the mart construction, the imported meta-data from martdb.ensembl.org will obviously not contain information about those attributes and filters. Meta-data for custom data has to be added manually in MartEditor as follows:
  1. File -> Database Connection: connect to the local mart
  2. File -> Update All: synchronise the data- and meta-tables of the mart
  3. File -> Import
  4. in the newly opened window: look for your attributes and features under ‘FilterPage: new_filters’ and ‘AttributePage: new_attributes’ and move them into existing filter-/attribute-groups or create your own. Then add human readable descriptions accordingly.
  5. File -> Export
  6. File -> Exit
It is important that the names chosen for an attribute and its respective filter coincide or otherwise your mart queries will fail with a lengthy error message.

Web-Site Construction

Finally, it is possible to create the mart’s web-site by generating the necessary HTML and script files. In the biomart-perl directory, a configuration file conf/martname.xml has to be created first though, which contains information about the mart’s name and database connection. The username and password in the XML file should be corresponding to a user with read-only privileges for the database, since the mart should only be readable through the web-interface and the database user’s permissions should be reflecting that for security issues too.

Given the configuration file martname.xml, we can now execute

perl bin/configure.pl --clean -r conf/martname.xml

The prompted question Do you want to install in API only mode [y/n] [n]: should be answered with ‘no’, which can be achieved by simply hitting return. After the script has completed, the web-server can be started as follows:

/usr/local/httpd-2.2.11/httpd -d /usr/local/biomart-perl

All BioMarts should be up and running now..

Acknowledgements

We would like to thank Rhoda Kinsella and Syed Haider at EBI for advice and support which they kindly provided throughout our introduction to using the BioMart system.

Configuring REannotate Apollo Tiers Files on OSX

Posted 23 Jun 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, transposable elements

Vini Pereira has recently published a nice paper on his REannotate package for defragmenting pieces of dead and nested transposable elements (TEs) detected using RepeatMasker. Defragmentation is an important aspect of accurately annotating and data mining transposable elements (as Hadi Queseneville and I have discussed in our 2007 review article on TE bioinformatics resources).

One of the nice features of REannotate is the ability to output GFF files ready for import into the Apollo genome annotation and curation tool. Although there is excellent documentation for configuring Apollo and REannotate provides a custom “tiers” file to properly display defragmented TE annotations, I struggled a bit to get these to work together on OSX based on the REannotate documentation. Partly my difficulty arose because the location of the “conf/” directory for the Apollo installation on OSX is not explicit, and in the end finding this directory provided two alternate solutions to this problem. Both solutions assume Apollo was installed via the .dmg file

Solution 1: This solution should be platform independent of the location of the conf/ directory.

$cd $HOME/.apollo
$wget -O ensj.tiers http://www.bioinformatics.org/reannotate/download/REannotate.tiers
Launch Apollo
Import REannotate GFF file and associated FASTA sequence.
Select File->Save Type Preferences..
Save As: ensj.tiers

Solution 2: This solution is OSX specific, and depends on the location of the conf/ directory.

$cd $HOME/.apollo
$wget -O REannotate.tiers http://www.bioinformatics.org/reannotate/download/REannotate.tiers
$cat /Applications/Apollo.app/Contents/Resources/app/conf/ensj.tiers REannotate.tiers > ensj.tiers
Launch Apollo
Import REannotate GFF file and associated FASTA sequence.

One final note: I find it helpful to “grep -v un-RepeatMasked_sequence” the REannotate GFF files before loading into Apollo to get rid of non-RepeatMasked annotations.

Compiling UCSC Source Tree Utilities on OSX

Posted 12 Mar 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, UCSC genome browser

The UCSC genome bioinformatics site widely regarded one of the most powerful bioinformatics portals for experimental and computational biologists on the web. The ability to visualize genomics data through the genome browser and perform data mining through the table browser, coupled with the ability to easily import custom data, permit a large range of possible genome-wide analyses to be performed with relative ease. One of the limitations of web-based access to the UCSC genome browser is the inability to automate your own analyses, which has led to the development of systems such as Galaxy, which provide ways to record and share your analysis pipeline.

However, for those of us who would rather type than click, another solution is to download the source code (originally developed by Jim Kent) that builds and runs the UCSC genome browser and integrate the amazing set of stand-alone executables into your own command-line workflows. As concisely summarized by Peter Schattner in an article in PLoS Computational Biology, The Source Tree includes:

“programs for sorting, splitting, or merging fasta sequences; record parsing and data conversion using GenBank, fasta, nib, and blast data formats; sequence alignment; motif searching; hidden Markov model development; and much more. Library subroutines are available for everything from managing C data structures such as linked lists, balanced trees, hashes, and directed graphs to developing routines for SQL, HTML, or CGI code. Additional library functions are available for biological sequence and data manipulation tasks such as reverse complementation, codon and amino acid lookup and sequence translation, as well as functions specifically designed for extracting, loading, and manipulating data in the UCSC Genome Browser Databases.”

Compiling and installing the utilities from source tree is fairly straightforward on most linux systems, although my earliest attempts to install on a powerpc OSX machine failed several times. The problems relate to building some executables around MySQL libraries which I never fully sorted out, but I’ve now gotten a fairly robust protocol for installation on i386 OSX machine. These instructions are adapted from the general installation notes in kent/src/README.

1) Install MySQL (5.0.27) and MySQL-dev (3.23.58) using fink.

2) Install libpng. [Note: my attempts to do this via Fink were unsuccessful.]

3) Obtain and Make Kent Source Tree Utilities

$wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
$unzip jksrc.zip
$mkdir $HOME/bin/i386
$sudo mkdir /usr/local/apache/
$sudo mkdir /usr/local/apache/cgi-bin-yourusername
$sudo chown -R yourusername /usr/local/apache/cgi-bin-yourusername
$sudo mkdir /usr/local/apache/htdocs/
$sudo chown -R yourusername /usr/local/apache/htdocs
$export PATH=$PATH:$HOME/bin/i386

[Note: it is necessary to add path to bin before making, since some parts of build require executables that are put there earlier in build]

$export MACHTYPE=i386
$export MYSQLLIBS="/sw/lib/mysql/libmysqlclient.a -lz"
$export MYSQLINC=/sw/include/mysql
$cd kent/src/lib
$make
$cd ../jkOwnLib
$make
$cd ..
$make

These instructions should (hopefully) cleanly build the code base that runs a mirror of the of UCSC genome browser, as well as the ~600 utilities including my personal favorite overlapSelect (which I plan to write more about later).

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Thanks goes to Max Haeussler for tipping me off the Source Tree and the power of overlapSelect. This protocol was updated 19 March 2011 and works on the 9 March 2001 UCSC jksrc.zip file.