Archive for the ‘apache’ Category

Using Webdav to share files

Posted 17 Oct 2009 — by maxhaussler
Category apache, filesharing

I hate sending around the same text document all the time to the same people. With Google Docs still being too buggy for serious work and completely missing reference management and etherpad and zoho writer not being up to the task either, I am searching for a better solution. At school, some people used Microsoft’s SharePoint Server, which allows to open documents through the internet and also to save them back to the server, at a hefty price. It is using WebDAV as one of the supported protocols to access the files. WebDAV resembles http, except that it allows browsing folders and saving files to the server. Apache’s WebDAV module is a free implementation. You can keep your documents on a webserver, open them with Microsoft Office and OpenOffice and directly save them back on your webserver. Actually you can put all sorts of files into the WebDAV folder that you want to share with other people. The result is very similar to the commercial offer of dropbox.com, except that you don’t need to install a client and you are not limited to 2GB.

Here is how I installed WebDav on my own machine as root (adapted from the debian website):

  1. Enabled modules in apache that handle dav and authentication (Debian: “a2ensmod dav* auth_digest”)
  2. Added this to my apache configuration file (e.g. /etc/apache2/http.conf):
    Alias /dav /var/www/dav
    DAVLockDB /var/www/davLock/davLock

    AllowOverride None
    Options None
    Order allow,deny
    allow from all


    DAV On
    DAVMinTimeout 600
    AuthType Digest
    AuthName “webdav”
    AuthUserFile /etc/apache2/webdav-password
    Require valid-user
  3. Create the directories for the files and the lockfiles: “mkdir /var/www/dav /var/www/davLock”. Change owner to apache user: “chown www-data /var/www/dav /var/www/davLock” to make both writeable for the apache daemon.
  4. Create the user and the password: “htdigest -c /etc/apache2/webdav-password webdav myuser
  5. Restart apache: /etc/init.d/apache2 restart

You should then be able to access the folder from the MacOS Finder via “Go / Connect to Server” by entering your server name like http://myserver.com/das with the username myuser and the password that you specified in htdigest.

(To make webdav work with windows, see the comments below)

Creating Customised BioMarts from the Ensembl Database: A Step-by-Step Guide

Posted 15 Jul 2009 — by caseybergman
Category apache, biomart, ensembl, genome bioinformatics
In this blog post, we give a detailed description of constructing a mirror of the marts from the Ensembl database with custom data added to the existing Ensembl schemas. This enables us to extend the existing established marts — namely Sanger‘s Ensembl Mart and VEGA Mart — with arbitrary supplementary relating to genes and genomes.

BioMart is a set of tools for (i) denormalising database schemas in order to optimise database search queries and (ii) a framework for accessing the denormalised data via web browsers, Perl scripts or the DAS protocol. Its applications are manifold since the construction of a mart also provides an easily accessible data source with interactive and programmatic interfaces. A full description of the system can be found in the recent publication on BioMart.

In order to construct a minimal mart and to make it accessible via web browser, the following software needs to be installed:

  • MySQL, PostgreSQL or Oracle database
  • Perl
  • Apache Web-Server and ModPerl
  • BioMart
    • biomart-perl (command line tools for web site creation and DAS server)
    • martj (GUIs supporting the mart construction and maintenance)
With the preliminary software installed and set up, a mart can be created in these simple steps outlined as follows:
  1. Use MartBuilder to create the mart’s data tables
    1. Select relevant tables and databases for denormalising
    2. If necessary, establish relations between tables manually, but MartBuilder does usually a good job finding correlations between tables automatically
    3. Generate and execute the SQL which generates the data tables
  2. Use MartEditor to populate the mart’s meta tables with human readable information
    1. Let MartEditor guess the structure of the meta data
    2. Augment the meta data with sensible titles, labels and descriptions
  3. Run configure.pl of biomart-perl to write the web server’s HTML files and scripts

In the following, we give a detailed step-by-step description on how to create BioMarts from schemas in the Ensembl database and how to extend the marts with custom data. The Ensembl database consists of various schemas for each species. For example, there are schemas for human, cat, mouse and dog, each of which has sets of species-specific tables about genes, their transcripts and translation products. From a naive point of view, a mart combines these schemas and denormalises the schemas’ tables into a single new schema — the mart — that can then used to run optimised search queries.


Setting-Up Perl, Libraries and BioMart-Perl

In order to set up a mart, it is necessary to create the mart’s web back-end using a Perl script, which itself relies on Perl modules. BioMart runs under both Perl 5.8 and Perl 5.10, where the following commands should be executed in a shell:

# perl -MCPAN -e shellcpan[1]> install Bundle::CPANcpan[2]> q# cpancpan[1]> install Module::Buildcpan[2]> o conf prefer_installer MBcpan[3]> o conf commitcpan[4]> q

Now, the rest of the required modules can be installed by executing cpan x, where x stands for any of the following modules:

  • Spreadsheet::WriteExcel
  • Template::Constants
  • Log::Log4perl
  • Digest::SHA
  • IO::Compress::Gzip
  • Test::Exception
  • Template::Plugin::Number::Format
  • Exception::Class
  • CGI::Session::Driver::db_file
  • Spreadsheet::WriteExcel::Big
  • Number::Format
  • CGI::Session
  • Template
  • Readonly

LibGD — for the dynamic creation of images

For the programmatically creation of images, the installation of LibGD is required, which can be downloaded from http://www.libgd.org/Downloads. The software comes in form of a source code package, but its build and installation process is straightforward. Within the root directory of the extracted tar.gz (or tar.bz2) package, the following lines should be executed:

# ./configure# make# make install

Remark: configure should report that only the JPEG library is missing, but everything else is found

Expat

Expat is an XML parser, which is also needed by the BioMart software. It can be downloaded via http://sourceforge.net/project/showfiles.php?group_id=10127&package_id=10780 and installed using the following few lines:

# ./configure# make# make install

BioMart-Perl and MartJ

The installation of BioMart-Perl and MartJ is described in the full documentation of BioMart, which can be accessed via http://www.biomart.org/user-docs.pdf. Since there are only a few lines to execute, we repeat those lines that can be also found in the documentation here. First, the installation of BioMart-Perl is carried out by checking out the software from a CVS server as follows:

# cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/biomart loginPassword is CVSUSER# cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/biomart \co -r release-0_7 biomart-perl

This will create a directory biomart-perl in which the BioMart-Perl software and example configuration files reside.

Second, MartJ can be downloaded using the URL ftp://anonymous@ftp.ebi.ac.uk/pub/software/biomart/martj_current/martj-bin.tgz, which then can be extracted by:

tar xzf martj-bin.tgz

Tar will create a directory martj with the programs MartBuilder and MartEditor in the sub-directory bin. Each of the tools can be executed by running the respective shell scripts martbuilder.sh and marteditor.sh.

The construction of a mart using MartBuilder can consume large amounts of memory and it may happen that the Java program terminates because the imposed limit on the permissible maximum heap memory is exceeded. In that case, it is necessary to edit martbuilder.sh and replace the parameters -Xmx512m and -Xms256m with larger values. We experienced no memory allocation problems anymore after replacing -Xmx512m with -Xmx2048m.

Mart-Build

All Ensembl and VEGA marts are built using MartBuilder, with the exception of the sequence mart, which is created by using the now otherwise obsolete mart-build Perl scripts. The mart-build scripts can be obtained via Sanger’s CVS server too:

# cvs -d :pserver:cvsuser@cvsro.sanger.ac.uk:/cvsroot/biomart loginPassword is CVSUSER# cvs -d :pserver:cvsuser@cvsro.sanger.ac.uk:/cvsroot/biomart co mart-build

We will describe the actual sequence mart construction in detail below.

Running MySQL Database and Apache Web-Server

Describing how to set up MySQL and Apache’s web-server can easily fill books, and therefore we do not to attempt to describe every little detail about the installation and configuration of those programs here. However, we highlight some interesting details that might provide some ease to the troubled bioinformatics researcher.

MySQL is usually started on larger systems using the wrapper mysqld_safe, whereas on smaller systems it might be sufficient to start the MySQL daemon directly using

mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --user=username

Below, we will turn towards importing the Ensembl database, which is several hundred gigabytes large. In the default configuration of MySQL, the daemon will cater for database replication needs, i.e. the distribution of data from a master node to several slave nodes. In order to do this, the master node needs to write a ‘binary log-file’, which keeps track of every action performed on the master node. Unfortunately, this means that every line of data that we are importing into our database, will also get logged (and hence duplicated) in one of the binary log files. For the whole Ensembl database, which is somewhat 500-800MB in size, this can mean that a 1 terabyte harddrive might prove to be too small under this default option. As such, it is advisable to turn off binary data-logs — unless you really use database replication — by changing /etc/my.cnf and uncommenting the variable log-bin, i.e.

...#log-bin=mysql-bin...

The web-server has to be started with its root directory pointing to the biomart-perl installation directory. For example, if the biomart-perl directory was created by CVS in /usr/local, then httpd has to be invoked like

httpd -d /usr/local/biomart-perl

However, this will not result in a functioning mart yet, since it is first necessary to import the Ensembl database, create the mart’s data tables and finally the mart’s meta-data tables. Once these steps are carried out, the web-server can be started. Also, if the mart is edited so that perl bin/configure.pl (see below) has to be re-run, the web-server has to be restarted as well.

Importing Ensembl in MySQL

As it has been said before, the complete Ensembl database is somewhat between 500-800 megabyte in size, so when we import the data into the database, we want to keep as much disk-space free as possible. In the following, we describe how the compressed data files from ftp.ensembl.org can be streamed to the mysqlimport tool, without actually uncompressing the data files to the hard drive.

We can obtain an archive file that contains all compressed data files for a particular Ensembl version by ftp. Even though the archive itself is uncompressed, i.e. it is an ordinary tar archive, its contents are gzipped:

# ftp ftp.ensembl.orgftp> cd /pubftp> get release-55.tar

After the download is complete, the current working directory contains the file release-55.tar with the gzip-compressed contents of the respective ‘release-55’ directory on the ftp server. The file can then be extracted by executing

# tar xf release-55.tar

Since the tar file itself is not compressed, the extracted archive will take up the same disk space as the tar file.

The importer mysqlimport itself is not capable of reading data from a stream, so we have to install some additional software that enables us streaming data directly to mysqlimport. We use a custom script, which we present below, and we make use of the Perl module ‘Slurp’ for the actual input stream redirect. Slurp can be installed as follows:

# cpan MySQL::Slurp

The module itself is used in the following Perl code, which should be saved in a file called slurp.pl:

#!/usr/bin/perluse MySQL::Slurp;@import_args = ( '-u', 'username', '--password=userpassword' );my $slurper = MySQL::Slurp->new(database => $ARGV[0],table => $ARGV[1],method => 'mysqlimport',args => \@import_args)->open;$slurper->slurp();$slurper->close;

We execute slurp.pl within the following bash script, import.sh, which should be executed in the directory previously created by tar xf release-55.tar. Of course, we assume that slurp.pl is in the PATH environment of the shell, so that it will be picked up and invoked within the script successfully.

#!/bin/bashfor i in * ; doecho "CREATE DATABASE $i;\n" | mysql -u username --password=user_password ;gunzip -c $i/*.sql.gz | mysql -u username --password=user_password $i ;for j in $i/*.txt.gz ; dogunzip -c $j | slurp.pl $i `basename $j .txt.gz` ;done ;done

If only certain databases of species should be imported, then the for i in * ; do statement can be changed, for example, to for i in {homo_sapiens*,mus_musculus*} ; do.

Adding Custom Data

Custom data that is to appear in the final build mart has to be added into the tables of the Ensembl database. Anything related to the Ensembl mart will have to go into the *_code_* schemas of the Ensembl database, whilst data related to the VEGA mart has to be added to the *_vega_* schemas. Since both marts are centered around gene identifiers, custom data can be added by inserting tables into these schemas with a first column labelled gene_id of type INTEGER. Data in the second, third and further columns will then be associated with the respective gene and its gene ID of the first column, where the type of the data only depends on how it later will be interpreted in the mart’s web-interface.

Building a Mart from Ensembl Schema

From the Ensembl database, we are going to build the marts that are also provided through the BioMart web site. These marts are the following:

  • Ensembl Mart (focusing on genes)
  • (Gene) Feature Mart
  • (Gene) Ontology Mart
  • (Gene) Sequence Mart
  • VEGA Mart (focusing on genes)

For these marts (except for the sequence mart, see below), XML-files are provided by EBI that can be used with the MartJ tools to automatically build a mart, or alternatively the XML-files can be downloaded here:

  • calculated_data_patch.pl: for adding 3’/5′ UTR and CDS start, end and length data to a sequence mart
  • mart_53.zip: XML-files for creating marts relating to Ensembl version 53
  • mart_55.zip: XML-files for creating marts relating to Ensembl version 55
A (simplified) process of building a mart can be outlined as follows:

  1. create an empty schema for the mart (e.g., vega_mart_55)
  2. run MartBuilder
    1. import EBI’s XML-file
    2. generate SQL
  3. execute the SQL
  4. run MartEditor
    1. transfer the mart’s meta-data (which includes the labelling for the web-interface) from martdb.ensembl.org to the local mart

In the following, we describe the creation of a mart in detail, where we distinguish between preliminary mart construction steps, script-based mart construction steps and MartJ GUI-based mart construction steps.

Preliminary Mart Construction

For all marts, it is necessary to create an empty schema in MySQL first. For the marts we have listed above, the following lines have to be executed:

# echo "CREATE DATABASE ensembl_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE genomic_features_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE ontology_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE sequence_mart_55;\n" | mysql -u username -p# echo "CREATE DATABASE vega_mart_55;\n" | mysql -u username -p

Both the Ensembl mart and the feature mart require empty database schemas called master_schema_55 and master_schema_variation_55, which are used as templates in their mart creation process. These schemas can be generated by extracting the SQL commands of any species database that features core and/or variation schemas:

# gzip -cd microcebus_murinus_core_55_*.sql.gz > master_schema_55.sql# echo "CREATE DATABASE master_schema_55;" | mysql -u username -p# cat master_schema_55.sql | mysql -u username -p master_schema_55# gzip -cd anopheles_gambiae_variation_55_*.sql.gz > master_schema_variation_55.sql# echo "CREATE DATABASE master_schema_variation_55;" | mysql -u username -p# cat master_schema_variation_55.sql | mysql -u username -p master_schema_variation_55

Ensembl-, Feature-, Ontology- and VEGA-Mart Construction

Ensembl marts, gene feature marts, gene ontology marts and VEGA marts are built using the MartBuilder and MartEditor applications, following the steps described below. MartBuilder is a tool to create and populate the data behind a mart, whilst MartEditor allows us to edit the meta-information regarding the data tables, which is used by the web server to display meaningful names for filters and attributes. First, we describe the construction of a mart’s data tables with MartBuilder, and second, we go through the process of importing meta-data into said marts.

Each of the mentioned marts is associated with an XML-file that describes the mart’s structure, i.e. the relationships between tables in the database and information about which table columns should be condensed into the mart. The files ensembl_55.xml, features_55.xml, eontology_55.xml and vega_55.xml, which are provided by EBI, cannot be used directly in MartBuilder, since these files contain settings that are specific to EBI’s database infrastructure. In order to use these files in your own mart construction process, it is necessary to change occurrences of the hostname, port number, username and password in the XML files, where the username and password are referring to the MySQL username and password. In order to change this parameters semi-automatically, the following commands can be used,

cat ensembl_55.xml |sed -e 's/outputPort="18001"/outputPort="your_mysql_port"/g' |sed -e 's/overrideHost="ens-staging"/overrideHost="your_mysql_host"/g' |sed -e 's/overridePort="3306"/overridePort="your_mysql_port"/g' |sed -e 's/localhost:43306/your_mysql_host:your_mysql_port/g' |sed -e 's/username="joebloggs"/username="username"/g' |sed -e 's/password=""/password="your_password"/g' > ensembl_55.xml.new ;mv ensembl_55.xml.new ensembl_55.xml

The XML-files can then be loaded into MartBuilder, one after the other, and with the following menu selections, the data tables for a mart can be built. However, MartBuilder itself relies on MartRunner to execute the SQL commands that it generates, and therefore, it is necessary to start MartRunner before the invocation of MartBuilder. MartRunner’s executable is located in the same directory as MartBuilder and MartEditor, and can be started using the command line ./martrunner.sh 1500, which sets up the MartRunner server listening on port 1500 for incoming SQL jobs. It is then possible to continue the mart construction in MartBuilder:

  1. File -> Open…: load an XML-file that describes a mart
  2. Mart -> Update all schemas: if additional tables have been added to the source schemas or the XML-file is describing an older version of the mart (for example, XML-files for a mart version 53 whilst using Ensembl >53), then it is necessary to bring the mart description up to date with the actual source tables and their relationships to each other
  3. Schema -> Accept changes: ‘Update all schemas’ may lead to changes in the mart (highlighted in thick green), which have to be explicitly accepted
  4. Mart -> Generate SQL: choose MartRunner and set hostnames and ports accordingly

MartBuilder now opens a separate window that is an interface to MartRunner, where the following actions should be taken:

  1. Select the last entry (should be blue) in the Jobs available column and click on Start job
  2. The construction of the mart is finished when the job entry changes to green. If there is any error during the mart creation, then the entry will appear red.

Sequence marts

Sequence marts are not created using the MartJ tools; instead the otherwise obsolete mart-build scripts are used (see above). In order to successfully build a sequence mart, it is required to install the latest GO-term database first. It can be downloaded via http://archive.geneontology.org/latest-termdb/go_daily-termdb-tables.tar.gz and then added to the MySQL database using the command lines:

# tar xzf go_daily-termdb-tables.tar.gz# cd go_daily-termdb-tables# echo "CREATE DATABASE go_termdb;" | mysql -u username -p# cat *.sql | mysql -u username -p go_termdb# mysqlimport -u username -p -L go_termdb *.txt

The data behind the sequence mart can then be generated as follows:

# cd mart-build/scripts/ensembl# export ENSMARTHOST=your_mysql_host# export ENSMARTPORT=your_mysql_port# export ENSHOME=/usr/local/ensembl# export ENSMARTUSER=username# export ENSMARTPWD=your_password# export ENSMARTDRIVER=mysql# export PATH=.:$PATH# export PERL5LIB=$PERL5LIB:/usr/local/ensembl/modules# export PERL5LIB=$PERL5LIB:/usr/local/mart-build/modules# ./make_all_dna_chunks.pl sequence_mart_55 55 0

In order to add 3′ and 5′ UTR as well as CDS start, end and length to the sequence mart, it is necessary to run the script calculated_data_patch.pl, which was provided to us by EBI as well. The script will create a directory calculated_data, where for each species a SQL-file is generated that contains instructions for adding said data to the sequence mart. Database specific settings have to be made within the file directly, where the line

$registry->load_registry_from_db( -host => 'host',                     -user => 'user' );

should be replaced by appropriate values, for example:

$registry->load_registry_from_db( -host => 'your_host',                     -user => 'username',                     -pass => 'your_password' );

The extra data is added to the sequence mart by executing:

# cd calculated_data# for i in *.sql ; do cat $i | mysql -u username --password=your_password ; done

Meta-Data Transfer

Before the web interface to a mart can be generated, it is necessary to annotate the data tables in the mart with human readable descriptions, and additionally, provide information about which table columns can be queried and filtered. Since the construction of the meta data from scratch is a lengthy process, we transfer the mart-meta data from martdb.ensembl.org to the local database using MartEditor.

During the meta-data transfer, it will become necessary to safe the meta-data from a mart on martdb.ensembl.org on the hard-drive, before it can be uploaded into the local mart. MartEditor will save the meta-data in form of XML-files, which should not be confused with the XML-files we used to construct the data of the mart using MartBuilder.
In MartEditor, the following steps should be carried out in order to transfer the meta-data of an existing mart:

  1. File -> Database Connection: connect to the mart on martdb.ensembl.org
  2. File -> Save All: specify a directory in which the mart’s meta-data is stored in XML files
  3. File -> Database Connection: connect to the same local mart
  4. File -> Upload All: select the previously saved XML files
  5. File -> Database Connection: connect to the mart on martdb.ensembl.org
  6. File -> Import
  7. File -> Database Connection: connect to the local mart
  8. File -> Export
  9. File -> Exit
In case custom tables were added to the database schemas before the mart construction, the imported meta-data from martdb.ensembl.org will obviously not contain information about those attributes and filters. Meta-data for custom data has to be added manually in MartEditor as follows:
  1. File -> Database Connection: connect to the local mart
  2. File -> Update All: synchronise the data- and meta-tables of the mart
  3. File -> Import
  4. in the newly opened window: look for your attributes and features under ‘FilterPage: new_filters’ and ‘AttributePage: new_attributes’ and move them into existing filter-/attribute-groups or create your own. Then add human readable descriptions accordingly.
  5. File -> Export
  6. File -> Exit
It is important that the names chosen for an attribute and its respective filter coincide or otherwise your mart queries will fail with a lengthy error message.

Web-Site Construction

Finally, it is possible to create the mart’s web-site by generating the necessary HTML and script files. In the biomart-perl directory, a configuration file conf/martname.xml has to be created first though, which contains information about the mart’s name and database connection. The username and password in the XML file should be corresponding to a user with read-only privileges for the database, since the mart should only be readable through the web-interface and the database user’s permissions should be reflecting that for security issues too.

Given the configuration file martname.xml, we can now execute

perl bin/configure.pl --clean -r conf/martname.xml

The prompted question Do you want to install in API only mode [y/n] [n]: should be answered with ‘no’, which can be achieved by simply hitting return. After the script has completed, the web-server can be started as follows:

/usr/local/httpd-2.2.11/httpd -d /usr/local/biomart-perl

All BioMarts should be up and running now..

Acknowledgements

We would like to thank Rhoda Kinsella and Syed Haider at EBI for advice and support which they kindly provided throughout our introduction to using the BioMart system.