Archive for the ‘biomart’ Category

Automatically Attaching DAS Tracks to the Ensembl Genome Browser with BioMart Perl 0.7

Posted 09 Mar 2010 — by caseybergman
Category biomart, ensembl

In this blog post, we describe how BioMart Perl 0.7 needs to be patched in order to create self-configuring DAS sources, which can then be automagically attached to the Ensembl Genome Browser using only a single URL.

BioMart is versatile framework that primarily addresses the denormalisation of very large databases in order to optimise query performance, but it also comes with various interfaces which permit an easy access to the optimised data sources (a.k.a the marts). One of the numerous interfaces is a Distributed Annotation System (DAS) server, which permits the integration of marts into remote systems and services. The most comprehensive marts are currently derived from the Ensembl genome databases, for which (among many other choices) the Ensembl Genome Browser provides an easy to use user interface. It is possible to manually attach BioMart DAS tracks to the Ensembl Genome Browser, so that annotation data from marts can be visualised in the browser on a gene and chromosome level of a genome besides the already existing features in the Ensembl databases.

BioMart DAS sources can be manually added as a DAS track in the Ensembl Genome Browser by

  1. clicking on ‘Manage your data’ in the panel to the left-hand side in the browser,
  2. then clicking on ‘Attach DAS’,
  3. entering the URL of the DAS server (e.g.,
  4. clicking ‘Next’,
  5. explicitly selecting the DAS source you want to add,
  6. clicking ‘Next’,
  7. selecting the species this DAS source applies to,
  8. clicking ‘Next’,
  9. selecting the coordinate system the DAS source uses,
  10. clicking ‘Next’,
  11. and finally the configuration dialog needs to be closed.

Here, we provide two patch files, and, which modify BioMart Perl 0.7 in such a way that a mart’s DAS track can be added using a similar URL to:


This URL will instruct the Ensembl Genome Browser to add the DAS track given by the parameter contigviewbottom. The browser in return then confirms the successful operation with the following message:

For this example we have just taken a copy of the Ensembl mart as the DAS source, so the data we are displaying is a mirror image of the already existing genes in the Ensembl database. Nevertheless, the DAS track ‘Your Source’ — the name is configurable: see below — appears now in the genome browser like this:

If we try the same with the Ensembl mart from, i.e. if we provide the DAS track URL above, then the operation will fail and the genome browser will notify us with the message:

Patching BioMart Perl 0.7

In order to apply the patch, simply copy both patch files into the parent directory of your BioMart Perl installation and then carry out the following steps:

computer:legacy joachim$
computer:legacy joachim$ patch --verbose -p1 <
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
|--- old/biomart-perl/bin/ 2009-01-10 13:21:32.000000000 +0000
|+++ new/biomart-perl/bin/ 2010-02-10 20:25:26.000000000 +0000
Patching file biomart-perl/bin/ using Plan A...
Hunk #1 succeeded at 26.
Hunk #2 succeeded at 703.
computer:legacy joachim$ patch --verbose -p1 <
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
|--- old/biomart-perl/bin/ 2010-02-08 11:40:07.000000000 +0000
|+++ new/biomart-perl/bin/ 2010-02-10 20:25:40.000000000 +0000
Patching file biomart-perl/bin/ using Plan A...
Hunk #1 succeeded at 137.
computer:legacy joachim$

Of course, the --verbose parameter can be omitted.

Our patch introduces a few new parameters in biomart-perl/conf/settings.conf:

sourcename=Your Source

Our Perl modifications will try to connect to your MySQL database (can be easily adapted for Oracle and Postgres: change ‘DBI:mysql:’ in and read the contents of ensembl_compara_version. For this, the database settings and the Ensembl version are needed. The parameter sourcename defines simply the string that is displayed as the DAS track’s name in the genome browser.

The Juice

We achieved the implementation of a self-configuring DAS track by modifying the value of the MAPMASTER XML element in BioMart’s DSN listing. For each DAS track, the map master’s URL now points to the Ensembl web-site where specific information about the track’s species and coordinate system are located. Access to the compara database is required in order to retrieve the assembly version of the genome the track is working on. The latter is required to form a proper URL for the MAPMASTER XML element.


Max: DAS insights and Ensembl self-configuring DAS-track knowledge
Joachim: BioMart patches

Creating Customised BioMarts from the Ensembl Database: A Step-by-Step Guide

Posted 15 Jul 2009 — by caseybergman
Category apache, biomart, ensembl, genome bioinformatics
In this blog post, we give a detailed description of constructing a mirror of the marts from the Ensembl database with custom data added to the existing Ensembl schemas. This enables us to extend the existing established marts — namely Sanger‘s Ensembl Mart and VEGA Mart — with arbitrary supplementary relating to genes and genomes.

BioMart is a set of tools for (i) denormalising database schemas in order to optimise database search queries and (ii) a framework for accessing the denormalised data via web browsers, Perl scripts or the DAS protocol. Its applications are manifold since the construction of a mart also provides an easily accessible data source with interactive and programmatic interfaces. A full description of the system can be found in the recent publication on BioMart.

In order to construct a minimal mart and to make it accessible via web browser, the following software needs to be installed:

  • MySQL, PostgreSQL or Oracle database
  • Perl
  • Apache Web-Server and ModPerl
  • BioMart
    • biomart-perl (command line tools for web site creation and DAS server)
    • martj (GUIs supporting the mart construction and maintenance)
With the preliminary software installed and set up, a mart can be created in these simple steps outlined as follows:
  1. Use MartBuilder to create the mart’s data tables
    1. Select relevant tables and databases for denormalising
    2. If necessary, establish relations between tables manually, but MartBuilder does usually a good job finding correlations between tables automatically
    3. Generate and execute the SQL which generates the data tables
  2. Use MartEditor to populate the mart’s meta tables with human readable information
    1. Let MartEditor guess the structure of the meta data
    2. Augment the meta data with sensible titles, labels and descriptions
  3. Run of biomart-perl to write the web server’s HTML files and scripts

In the following, we give a detailed step-by-step description on how to create BioMarts from schemas in the Ensembl database and how to extend the marts with custom data. The Ensembl database consists of various schemas for each species. For example, there are schemas for human, cat, mouse and dog, each of which has sets of species-specific tables about genes, their transcripts and translation products. From a naive point of view, a mart combines these schemas and denormalises the schemas’ tables into a single new schema — the mart — that can then used to run optimised search queries.

Setting-Up Perl, Libraries and BioMart-Perl

In order to set up a mart, it is necessary to create the mart’s web back-end using a Perl script, which itself relies on Perl modules. BioMart runs under both Perl 5.8 and Perl 5.10, where the following commands should be executed in a shell:

# perl -MCPAN -e shellcpan[1]> install Bundle::CPANcpan[2]> q# cpancpan[1]> install Module::Buildcpan[2]> o conf prefer_installer MBcpan[3]> o conf commitcpan[4]> q

Now, the rest of the required modules can be installed by executing cpan x, where x stands for any of the following modules:

  • Spreadsheet::WriteExcel
  • Template::Constants
  • Log::Log4perl
  • Digest::SHA
  • IO::Compress::Gzip
  • Test::Exception
  • Template::Plugin::Number::Format
  • Exception::Class
  • CGI::Session::Driver::db_file
  • Spreadsheet::WriteExcel::Big
  • Number::Format
  • CGI::Session
  • Template
  • Readonly

LibGD — for the dynamic creation of images

For the programmatically creation of images, the installation of LibGD is required, which can be downloaded from The software comes in form of a source code package, but its build and installation process is straightforward. Within the root directory of the extracted tar.gz (or tar.bz2) package, the following lines should be executed:

# ./configure# make# make install

Remark: configure should report that only the JPEG library is missing, but everything else is found


Expat is an XML parser, which is also needed by the BioMart software. It can be downloaded via and installed using the following few lines:

# ./configure# make# make install

BioMart-Perl and MartJ

The installation of BioMart-Perl and MartJ is described in the full documentation of BioMart, which can be accessed via Since there are only a few lines to execute, we repeat those lines that can be also found in the documentation here. First, the installation of BioMart-Perl is carried out by checking out the software from a CVS server as follows:

# cvs -d loginPassword is CVSUSER# cvs -d \co -r release-0_7 biomart-perl

This will create a directory biomart-perl in which the BioMart-Perl software and example configuration files reside.

Second, MartJ can be downloaded using the URL, which then can be extracted by:

tar xzf martj-bin.tgz

Tar will create a directory martj with the programs MartBuilder and MartEditor in the sub-directory bin. Each of the tools can be executed by running the respective shell scripts and

The construction of a mart using MartBuilder can consume large amounts of memory and it may happen that the Java program terminates because the imposed limit on the permissible maximum heap memory is exceeded. In that case, it is necessary to edit and replace the parameters -Xmx512m and -Xms256m with larger values. We experienced no memory allocation problems anymore after replacing -Xmx512m with -Xmx2048m.


All Ensembl and VEGA marts are built using MartBuilder, with the exception of the sequence mart, which is created by using the now otherwise obsolete mart-build Perl scripts. The mart-build scripts can be obtained via Sanger’s CVS server too:

# cvs -d loginPassword is CVSUSER# cvs -d co mart-build

We will describe the actual sequence mart construction in detail below.

Running MySQL Database and Apache Web-Server

Describing how to set up MySQL and Apache’s web-server can easily fill books, and therefore we do not to attempt to describe every little detail about the installation and configuration of those programs here. However, we highlight some interesting details that might provide some ease to the troubled bioinformatics researcher.

MySQL is usually started on larger systems using the wrapper mysqld_safe, whereas on smaller systems it might be sufficient to start the MySQL daemon directly using

mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --user=username

Below, we will turn towards importing the Ensembl database, which is several hundred gigabytes large. In the default configuration of MySQL, the daemon will cater for database replication needs, i.e. the distribution of data from a master node to several slave nodes. In order to do this, the master node needs to write a ‘binary log-file’, which keeps track of every action performed on the master node. Unfortunately, this means that every line of data that we are importing into our database, will also get logged (and hence duplicated) in one of the binary log files. For the whole Ensembl database, which is somewhat 500-800MB in size, this can mean that a 1 terabyte harddrive might prove to be too small under this default option. As such, it is advisable to turn off binary data-logs — unless you really use database replication — by changing /etc/my.cnf and uncommenting the variable log-bin, i.e.


The web-server has to be started with its root directory pointing to the biomart-perl installation directory. For example, if the biomart-perl directory was created by CVS in /usr/local, then httpd has to be invoked like

httpd -d /usr/local/biomart-perl

However, this will not result in a functioning mart yet, since it is first necessary to import the Ensembl database, create the mart’s data tables and finally the mart’s meta-data tables. Once these steps are carried out, the web-server can be started. Also, if the mart is edited so that perl bin/ (see below) has to be re-run, the web-server has to be restarted as well.

Importing Ensembl in MySQL

As it has been said before, the complete Ensembl database is somewhat between 500-800 megabyte in size, so when we import the data into the database, we want to keep as much disk-space free as possible. In the following, we describe how the compressed data files from can be streamed to the mysqlimport tool, without actually uncompressing the data files to the hard drive.

We can obtain an archive file that contains all compressed data files for a particular Ensembl version by ftp. Even though the archive itself is uncompressed, i.e. it is an ordinary tar archive, its contents are gzipped:

# ftp ftp.ensembl.orgftp> cd /pubftp> get release-55.tar

After the download is complete, the current working directory contains the file release-55.tar with the gzip-compressed contents of the respective ‘release-55’ directory on the ftp server. The file can then be extracted by executing

# tar xf release-55.tar

Since the tar file itself is not compressed, the extracted archive will take up the same disk space as the tar file.

The importer mysqlimport itself is not capable of reading data from a stream, so we have to install some additional software that enables us streaming data directly to mysqlimport. We use a custom script, which we present below, and we make use of the Perl module ‘Slurp’ for the actual input stream redirect. Slurp can be installed as follows:

# cpan MySQL::Slurp

The module itself is used in the following Perl code, which should be saved in a file called

#!/usr/bin/perluse MySQL::Slurp;@import_args = ( '-u', 'username', '--password=userpassword' );my $slurper = MySQL::Slurp->new(database => $ARGV[0],table => $ARGV[1],method => 'mysqlimport',args => \@import_args)->open;$slurper->slurp();$slurper->close;

We execute within the following bash script,, which should be executed in the directory previously created by tar xf release-55.tar. Of course, we assume that is in the PATH environment of the shell, so that it will be picked up and invoked within the script successfully.

#!/bin/bashfor i in * ; doecho "CREATE DATABASE $i;\n" | mysql -u username --password=user_password ;gunzip -c $i/*.sql.gz | mysql -u username --password=user_password $i ;for j in $i/*.txt.gz ; dogunzip -c $j | $i `basename $j .txt.gz` ;done ;done

If only certain databases of species should be imported, then the for i in * ; do statement can be changed, for example, to for i in {homo_sapiens*,mus_musculus*} ; do.

Adding Custom Data

Custom data that is to appear in the final build mart has to be added into the tables of the Ensembl database. Anything related to the Ensembl mart will have to go into the *_code_* schemas of the Ensembl database, whilst data related to the VEGA mart has to be added to the *_vega_* schemas. Since both marts are centered around gene identifiers, custom data can be added by inserting tables into these schemas with a first column labelled gene_id of type INTEGER. Data in the second, third and further columns will then be associated with the respective gene and its gene ID of the first column, where the type of the data only depends on how it later will be interpreted in the mart’s web-interface.

Building a Mart from Ensembl Schema

From the Ensembl database, we are going to build the marts that are also provided through the BioMart web site. These marts are the following:

  • Ensembl Mart (focusing on genes)
  • (Gene) Feature Mart
  • (Gene) Ontology Mart
  • (Gene) Sequence Mart
  • VEGA Mart (focusing on genes)

For these marts (except for the sequence mart, see below), XML-files are provided by EBI that can be used with the MartJ tools to automatically build a mart, or alternatively the XML-files can be downloaded here:

  • for adding 3’/5′ UTR and CDS start, end and length data to a sequence mart
  • XML-files for creating marts relating to Ensembl version 53
  • XML-files for creating marts relating to Ensembl version 55
A (simplified) process of building a mart can be outlined as follows:

  1. create an empty schema for the mart (e.g., vega_mart_55)
  2. run MartBuilder
    1. import EBI’s XML-file
    2. generate SQL
  3. execute the SQL
  4. run MartEditor
    1. transfer the mart’s meta-data (which includes the labelling for the web-interface) from to the local mart

In the following, we describe the creation of a mart in detail, where we distinguish between preliminary mart construction steps, script-based mart construction steps and MartJ GUI-based mart construction steps.

Preliminary Mart Construction

For all marts, it is necessary to create an empty schema in MySQL first. For the marts we have listed above, the following lines have to be executed:

# echo "CREATE DATABASE ensembl_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE genomic_features_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE ontology_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE sequence_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE vega_mart_55;\n" | mysql -u username -p

Both the Ensembl mart and the feature mart require empty database schemas called master_schema_55 and master_schema_variation_55, which are used as templates in their mart creation process. These schemas can be generated by extracting the SQL commands of any species database that features core and/or variation schemas:

# gzip -cd microcebus_murinus_core_55_*.sql.gz > master_schema_55.sql# echo "CREATE DATABASE master_schema_55;" | mysql -u username -p# cat master_schema_55.sql | mysql -u username -p master_schema_55# gzip -cd anopheles_gambiae_variation_55_*.sql.gz > master_schema_variation_55.sql# echo "CREATE DATABASE master_schema_variation_55;" | mysql -u username -p# cat master_schema_variation_55.sql | mysql -u username -p master_schema_variation_55

Ensembl-, Feature-, Ontology- and VEGA-Mart Construction

Ensembl marts, gene feature marts, gene ontology marts and VEGA marts are built using the MartBuilder and MartEditor applications, following the steps described below. MartBuilder is a tool to create and populate the data behind a mart, whilst MartEditor allows us to edit the meta-information regarding the data tables, which is used by the web server to display meaningful names for filters and attributes. First, we describe the construction of a mart’s data tables with MartBuilder, and second, we go through the process of importing meta-data into said marts.

Each of the mentioned marts is associated with an XML-file that describes the mart’s structure, i.e. the relationships between tables in the database and information about which table columns should be condensed into the mart. The files ensembl_55.xml, features_55.xml, eontology_55.xml and vega_55.xml, which are provided by EBI, cannot be used directly in MartBuilder, since these files contain settings that are specific to EBI’s database infrastructure. In order to use these files in your own mart construction process, it is necessary to change occurrences of the hostname, port number, username and password in the XML files, where the username and password are referring to the MySQL username and password. In order to change this parameters semi-automatically, the following commands can be used,

cat ensembl_55.xml |sed -e 's/outputPort="18001"/outputPort="your_mysql_port"/g' |sed -e 's/overrideHost="ens-staging"/overrideHost="your_mysql_host"/g' |sed -e 's/overridePort="3306"/overridePort="your_mysql_port"/g' |sed -e 's/localhost:43306/your_mysql_host:your_mysql_port/g' |sed -e 's/username="joebloggs"/username="username"/g' |sed -e 's/password=""/password="your_password"/g' > ;mv ensembl_55.xml

The XML-files can then be loaded into MartBuilder, one after the other, and with the following menu selections, the data tables for a mart can be built. However, MartBuilder itself relies on MartRunner to execute the SQL commands that it generates, and therefore, it is necessary to start MartRunner before the invocation of MartBuilder. MartRunner’s executable is located in the same directory as MartBuilder and MartEditor, and can be started using the command line ./ 1500, which sets up the MartRunner server listening on port 1500 for incoming SQL jobs. It is then possible to continue the mart construction in MartBuilder:

  1. File -> Open…: load an XML-file that describes a mart
  2. Mart -> Update all schemas: if additional tables have been added to the source schemas or the XML-file is describing an older version of the mart (for example, XML-files for a mart version 53 whilst using Ensembl >53), then it is necessary to bring the mart description up to date with the actual source tables and their relationships to each other
  3. Schema -> Accept changes: ‘Update all schemas’ may lead to changes in the mart (highlighted in thick green), which have to be explicitly accepted
  4. Mart -> Generate SQL: choose MartRunner and set hostnames and ports accordingly

MartBuilder now opens a separate window that is an interface to MartRunner, where the following actions should be taken:

  1. Select the last entry (should be blue) in the Jobs available column and click on Start job
  2. The construction of the mart is finished when the job entry changes to green. If there is any error during the mart creation, then the entry will appear red.

Sequence marts

Sequence marts are not created using the MartJ tools; instead the otherwise obsolete mart-build scripts are used (see above). In order to successfully build a sequence mart, it is required to install the latest GO-term database first. It can be downloaded via and then added to the MySQL database using the command lines:

# tar xzf go_daily-termdb-tables.tar.gz# cd go_daily-termdb-tables# echo "CREATE DATABASE go_termdb;" | mysql -u username -p# cat *.sql | mysql -u username -p go_termdb# mysqlimport -u username -p -L go_termdb *.txt

The data behind the sequence mart can then be generated as follows:

# cd mart-build/scripts/ensembl# export ENSMARTHOST=your_mysql_host# export ENSMARTPORT=your_mysql_port# export ENSHOME=/usr/local/ensembl# export ENSMARTUSER=username# export ENSMARTPWD=your_password# export ENSMARTDRIVER=mysql# export PATH=.:$PATH# export PERL5LIB=$PERL5LIB:/usr/local/ensembl/modules# export PERL5LIB=$PERL5LIB:/usr/local/mart-build/modules# ./ sequence_mart_55 55 0

In order to add 3′ and 5′ UTR as well as CDS start, end and length to the sequence mart, it is necessary to run the script, which was provided to us by EBI as well. The script will create a directory calculated_data, where for each species a SQL-file is generated that contains instructions for adding said data to the sequence mart. Database specific settings have to be made within the file directly, where the line

$registry->load_registry_from_db( -host => 'host',                     -user => 'user' );

should be replaced by appropriate values, for example:

$registry->load_registry_from_db( -host => 'your_host',                     -user => 'username',                     -pass => 'your_password' );

The extra data is added to the sequence mart by executing:

# cd calculated_data# for i in *.sql ; do cat $i | mysql -u username --password=your_password ; done

Meta-Data Transfer

Before the web interface to a mart can be generated, it is necessary to annotate the data tables in the mart with human readable descriptions, and additionally, provide information about which table columns can be queried and filtered. Since the construction of the meta data from scratch is a lengthy process, we transfer the mart-meta data from to the local database using MartEditor.

During the meta-data transfer, it will become necessary to safe the meta-data from a mart on on the hard-drive, before it can be uploaded into the local mart. MartEditor will save the meta-data in form of XML-files, which should not be confused with the XML-files we used to construct the data of the mart using MartBuilder.
In MartEditor, the following steps should be carried out in order to transfer the meta-data of an existing mart:

  1. File -> Database Connection: connect to the mart on
  2. File -> Save All: specify a directory in which the mart’s meta-data is stored in XML files
  3. File -> Database Connection: connect to the same local mart
  4. File -> Upload All: select the previously saved XML files
  5. File -> Database Connection: connect to the mart on
  6. File -> Import
  7. File -> Database Connection: connect to the local mart
  8. File -> Export
  9. File -> Exit
In case custom tables were added to the database schemas before the mart construction, the imported meta-data from will obviously not contain information about those attributes and filters. Meta-data for custom data has to be added manually in MartEditor as follows:
  1. File -> Database Connection: connect to the local mart
  2. File -> Update All: synchronise the data- and meta-tables of the mart
  3. File -> Import
  4. in the newly opened window: look for your attributes and features under ‘FilterPage: new_filters’ and ‘AttributePage: new_attributes’ and move them into existing filter-/attribute-groups or create your own. Then add human readable descriptions accordingly.
  5. File -> Export
  6. File -> Exit
It is important that the names chosen for an attribute and its respective filter coincide or otherwise your mart queries will fail with a lengthy error message.

Web-Site Construction

Finally, it is possible to create the mart’s web-site by generating the necessary HTML and script files. In the biomart-perl directory, a configuration file conf/martname.xml has to be created first though, which contains information about the mart’s name and database connection. The username and password in the XML file should be corresponding to a user with read-only privileges for the database, since the mart should only be readable through the web-interface and the database user’s permissions should be reflecting that for security issues too.

Given the configuration file martname.xml, we can now execute

perl bin/ --clean -r conf/martname.xml

The prompted question Do you want to install in API only mode [y/n] [n]: should be answered with ‘no’, which can be achieved by simply hitting return. After the script has completed, the web-server can be started as follows:

/usr/local/httpd-2.2.11/httpd -d /usr/local/biomart-perl

All BioMarts should be up and running now..


We would like to thank Rhoda Kinsella and Syed Haider at EBI for advice and support which they kindly provided throughout our introduction to using the BioMart system.