Archive for the ‘linux’ Category

Automating ASCP-based submission of NGS data to ENA using expect

Posted 12 Dec 2014 — by caseybergman
Category EBI, filesharing, genome bioinformatics, high throughput sequencing, linux, NCBI

Submitting next-generation sequencing (NGS) data to International Nucleotide Sequence Database Collaboration repositories such as EBI or NCBI is an essential, but time-consuming, phase of many projects in modern biology. Therefore, mechanisms to speed up NGS data submission are a welcome addition to many biologists toolkit.  The European Nucleotide Archive (ENA) provides detailed instructions on how to submit NGS data to their “dropbox” staging area, which is a key part of the overall submission process for an NGS project. This can be done in three ways:

  • Using ENA’s Webin client in your web browser,
  • Using a standard FTP client at the command line, or
  • Using Aspera’s ASCP client at the command line.

For users with many samples to submit, the latter two command line options are clearly preferred methods. While FTP is installed by default on most linux systems, transfer of large NGS data files by FTP is slow, and on some systems (such as ours) FTP is specifically disabled because of security concerns.

In contrast, ASCP is not installed by default on most linux systems, but provides a very fast method to transfer large data files to EBI or NCBI. One of the downsides of using ASCP is that it interactively prompts users for password information for each file transferred. This requires babysitting your ASCP-based command line submission and supplying the same password for each file, thereby undermining much of the automation that a command line solution should provide.

In searching around for solutions to the ASCP-babysitting problem, I stumbled on documentation page entitled “Expect script for automating Aspera uploads to the EBI ENA” written by Robert Davey at the The Genome Analysis Centre. I had never heard of the expect scripting language prior to reading this post, but it provides exactly the solution I was looking for:

[Expect] is used to automate control of interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, ssh, and others. Expect uses pseudo terminals (Unix) or emulates a console (Windows), starts the target program, and then communicates with it, just as a human would, via the terminal or console interface. (Wikipedia)

Robert’s expect script was a little more complicated than I needed for my purposes, and required a few tweaks to conform to updates to EBI’s ASCP submission process. Without too much trouble, I cobbled together a modified version that solves the automated transfer of any number of data files to the top level of your ENA dropbox:

set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l80M -d $line $
    expect "Password:"
    send "$pass\r"
    expect eof

This script requires expect and ASCP to be installed globally on your system, and for the user to provide three arguments:

  • a file of filenames (with the full path) to the files you would like to submit to ENA
  • the ID for your Webin submission account, and
  • the password for your Webin submission account

For example, if you have a directory of gzip’ed fastq files that you would like to submit to ENA, all you would need to submit your files in bulk would be to navigate to that directory and do something like the following:

#download the script above from github
wget --no-check-certificate

#create file of filenames
for k in `ls *gz`; do echo $j/$k; done > fofn

#perform ASCP submission, note: replace with your ENA (Webin ID and Password)
expect ena_submit.exp fofn Webin-ID Webin-Password

Customizing a Local UCSC Genome Browser Installation III: Loading Custom Data

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

In the second post of this series we examined the grp and trackDb tables within an organism database in the UCSC genome browser schema, and how they are used to build a UCSC genome browser display by specifying: (i) what sort of data a track contains, (ii) how it should be displayed, and (iii) how tracks should be organized within the display. In this last post of three, we will look at how to load and subsequently display custom data into a UCSC genome browser instance. As an example, we’ll be using a set of data from the modENCODE project formatted for loading into the UCSC browser. For this you’ll need a working browser install and the dm3 organism database (see the first post in this series for details).

An important feature of the genome browser schema that permits easy customization is that you can have multiple grp and trackDb tables within an organism database.  This allows separation of tracks for different purposes, such as trackDb_privateData, trackDb_chipSeq, trackDb_UserName etc. The same goes for grp tables. Thus, you can pull in data from the main UCSC site, leave it untouched, and simply add new data by creating new trackDb and grp tables.

This post outlines the five steps for adding custom data into a local UCSC genome browser:

1) Getting the data into the correct format
2) Building the required loader code
3) Loading the data
4) Configuring the track’s display options
5) Creating the new trackDb and grp tables and configuring the browser to use it.

1) The UCSC browser supports a number of different data formats, which are described at the UCSC format FAQ. Defining where your data comes from, and what sort of data it is (e.g. an alignment, microarray expression data etc.) will go a long way towards determining what sort of file you need to load. In many cases your data will already be in one of the supported formats so the decision will have been made for you. If your data isn’t in one of the supported formats, you’ll need to decide which one is the most appropriate and then reformat the data using a suitable tool, e.g. a perl or bash script. The example data we will be loading is taken from this paper. The supplementary data contains an Excel spreadsheet that we need to reformat into a number of BED files and then load into the UCSC database.

2) Each of the data formats that can be loaded into the browser system has its own loader program. These are found in the same source tree that we used to build the browser CGIs. You will find the source code for the loaders in the Kent source tree, the BED loader is in the directory kent/src/hg/makeDb/hgLoadBed. Assuming you have followed the instructions in the first part of this blog post then to compile the BED loader you simply cd to its directory and type make.

cd kent/src/hg/makeDb/hgLoadBed

The hgLoadBed executable will now be in ~/bin/$MACHTYPE. If you haven’t already it is a good idea to add this directory to your path, add the following line to ~/.bashrc:

export PATH=$PATH:~/bin/$MACHTYPE

The other loader we’ll require for the example custom data included in this blog is hgBbiDbLink:

cd kent/src/hg/makeDb/hgBbiDbLink

3) To load data, e.g. a BED file, the command is simply:

hgLoadBed databaseName tableName pathToBEDfile

The data-loaders supplied with the browser can be divided into two broad groups. The first of these, as represented by hgLoadBed, actually load the data into a table within the organism database. For example, one of the histone-modification datasets looks like this as a BED file:

chr2L   5982    6949    ID=Embryo_0_12h_GAF_ChIP_chip.GAF       5.54
chr2L   65947   68135   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       10.54
chr2L   72137   73344   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       6.55
chr2L   107044  109963  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       8.93
chr2L   127917  130678  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       9.17

When this BED file is loaded using hgLoadBed it produces a table that has the following structure:

DESC MdEncHistonesGAFBG3DccId2651;
| Field      | Type                 | Null | Key | Default | Extra |
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
4 rows in set (0.00 sec)

And contains (results set truncated):

select * from MdEncHistonesGAFBG3DccId2651;
| 754 | chrX     |   22237479 | 22239805 |
| 754 | chrX     |   22255450 | 22257285 |
| 754 | chrX     |   22257996 | 22259538 |
| 754 | chrX     |   22259653 | 22261174 |
| 754 | chrX     |   22264098 | 22265194 |
| 585 | chrYHet  |      84149 |    85164 |
| 585 | chrYHet  |     104864 |   105879 |
| 586 | chrYHet  |     150067 |   151212 |
5104 rows in set (0.06 sec)

The second type of loader doesn’t insert the data into a database table directly. Instead data in formats such as bigWig and bigBed are deposited in a particular location on the filesystem and a table with a single row simply points to them. For example:

desc mdEncReprocCbpAdultMaleTreat;
| Field    | Type         | Null | Key | Default | Extra |
| fileName | varchar(255) | NO   |     | NULL    |       |
1 row in set (0.00 sec)

Which contains:

select * from mdEncReprocCbpAdultMaleTreat;
| fileName                                     |
| /gbdb/dm3/modencode/ |
1 row in set (0.02 sec)

The loader hgBbiDbLink is responsible for producing this simple table that points to the location of the data on the filesystem. When the browser comes to display a track that is specified as bigWig in trackDb’s type field it knows that the table specified in trackDb’s tableName field will give point it at the bigWig file rather than containing the data itself. Supposed we have placed our bigWig file in the /gbdb directory:

cp /gbdb/dm3/modencode/.

Then we must execute hgBbiDbLink as follows:

hgBbiDbLink databaseName desiredTableName pathToDataFile


hgBbiDbLink dm3 mdEncReprocCbpAdultMaleTreat /gbdb/dm3/modencode/

4) Once the data is loaded into the database we need to tell the browser how it should be displayed, this is the role of the trackDb table. We create a new trackDb table for our data using the command hgTrackDb. This is found in the same branch of the source tree and the individual loaders themselves. The way in which tracks and grouped and displayed by the browser is specified in a file which you pass to hgTrackDb called a .ra file, which specifies the contents of the different fields in the trackDb table, as well as the parent-child relationship for composite tracks. The example data provided in this blog post contains a trackDb.ra file that looks like this:

track dnaseLi
shortLabel DNAse Li 2011
longLabel DNAse Li Eisen Biggin Genome Biol 2011
group blog
priority 100
visibility hide
color 200,20,20
altcolor 200,20,20
compositeTrack on
type bed 3

    track dnaseLiStage5
    shortLabel Stage 5
    longLabel Li Dnase Stage 5
    parent dnaseLi
    priority 1

    track dnaseLiStage9
    shortLabel Stage 9
    longLabel Li Dnase Stage 9
    parent dnaseLi
    priority 2

    track dnaseLiStage10
    shortLabel Stage 10
    longLabel Li Dnase Stage 10
    parent dnaseLi
    priority 3

    track dnaseLiStage11
    shortLabel Stage 11
    longLabel Li Dnase Stage 11
    parent dnaseLi
    priority 4

    track dnaseLiStage14
    shortLabel Stage 14
    longLabel Li Dnase Stage 14
    parent dnaseLi
    priority 5

The example data (e.g. /root/ucsc_example_data) comes from 5 different stages of embryonic development so we have 5 BED files. The trackDb.ra file groups these as a composite track so we have the top level track (dnaseLi) which is the parent of the 5 individual tracks. To load these definitions into the database we execute hgTrackDb as follows:

~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

The complete set of commands to download the example data and load it are is therefore:

tar xzvf blogData.tar.gz
cd ucsc_example_data
~/bin/i386/hgLoadBed dm3 dnaseLiStage5 stage5.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage9 stage9.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage10 stage10.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage11 stage11.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage14 stage14.bed
~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

5) Tell the browser to use our new trackDb table by editing the hg.conf configuration file in /var/www/cgi-bin and adding the new trackDB table name to the db.trackDb line. The db.TrackDb line takes a comma-separated list of trackDb tables that the browser will display. Assuming that our new trackDb table is called trackDb_NEW then we must edit hg.conf from this:


to this:


You’ll notice in the supplied trackDb.ra file that the tracks belong to a group called ‘blog’. This group doesn’t currently exist in the database. There are two ways we can go about rectifying this. The first is to add blog to the existing grp table like so:

INSERT INTO grp (name,label,priority) VALUES('blog','Example Data',11);

Or you can create an entirely new grp table. The browser handles multiple grp tables in the same fashion as multiple trackDb tables, i.e. they are specified as a comma-separated list in hg.conf. To create a new grp table, execute:

INSERT INTO grp_NEW (name,label,priority) VALUES('blog','Example Data',11);

And then edit hg.conf to add grp_NEW to the db.grp line:


If you now refresh your web browser the cgi scripts should create a new session with the new data tracks displayed as configured in the grp table.

Customizing a Local UCSC Genome Browser Installation II: The UCSC Genome Browser Database Structure

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

This is the second of three posts on how to set up a local mirror of the UCSC genome browser hosting custom data. In the previous post, I covered how to perform a basic installation of the UCSC genome In this post, I’ll provide an overview of the mySQL database structure that serves data to the genome browser using the Bergman Lab mirror ( as an example. In the next post, I focus on how to load custom data into a local mirror.

General databases

hgcentral – this is the central control database of the browser. It holds information relating to each assembly available on the server, organizing them into groups according to their clade and organism. For instance the mammalian clade may contain human and mouse assemblies, and for mouse we may have multiple assemblies (e.g. mm9, mm8 etc.).

The contents of the clade, genomeClade, dbDb and defaultDb tables control the three dropdown menus on the gateway page of the browser. The clade and genomeClade tables have a priority field that controls the order of the dropdown menus (see figure 1).

The defaultDb table specifies which assembly is the default for a particular organism. This is usually the latest assembly e.g. mm9 for mouse. You can alter this if, for whatever reason, you want to use a different assembly as your default:

UPDATE defaultDb SET name = 'mm8' WHERE genome = 'mouse';

Databases for each organism/assembly are listed in the dbDb table within hgcentral (see more below). For example the latest mouse assembly (mm9) has an entry that contains information relating to this build, an example query (not retrieving all the information contained within dbDb) looks like this:

> select name,description,organism,sourceName,taxId from dbDb where name = 'mm9';</pre>
 | name | description | organism | sourceName | taxId |
 | mm9 | July 2007 | Mouse | NCBI Build 37 | 10090 |

hgFixed – this database contains a bunch of human related information, predominately microarray stuff. If you are running a human mirror then you may wish to populate this database. If not, then it can be left empty.

Genome databases

We will now look at the structure of a genome database for a specific organism/assembly using mm9 as an example.

The first table to consider is chromInfo, this simple table specifies the name of each chromosome, its size and the location of its sequence file. Mouse has 19 chromosomes (named chr1 – chr19) plus the two sex chromosomes (chrX and chrY), mitochondrial DNA (chrM) and unmapped clone contigs (chrUn). If we list the tables within our minimal mm9 install we will see that most of the tables can clearly be identified as belonging to a particular chromosome by virtue of their names, e.g. all the tables with names starting chr10_* contain data relating to chromosome 10.

Of the remainder, the two tables of most interest to understanding how the browser works, and leading us towards the ability to load custom data, are the grp and trackDb tables. These two tables control the actual genome browser display, as we discussed above for the hgcentral tables that control the gateway web interface. Not surprisingly, the trackDb table holds information about all the data tracks you can see (see figure 2), and grp tells the browser how all these tracks should be organized (e.g. we may want a separate group for different types of experimental data, another for predictions made using some computational technique etc.).

Figure 2 shows a screenshot of the mm9 browser we developed in the Bergman Lab as part of the CISSTEM project.

Here, we can see that the tracks are grouped into particular types (Mapping & Sequence Tracks, Genes and Gene Prediction tracks etc.) – these are specified in mm9’s grp table and, like the dropdowns on the gateway page, ordered using a priority field

> select * from grp order by priority asc
| name        | label                            | priority |
| user        | Custom Tracks                    |        1 |
| map         | Mapping and Sequencing Tracks    |        2 |
| genes       | Genes and Gene Prediction Tracks |        3 |
| rna         | mRNA and EST Tracks              |        4 |
| regulation  | Expression and Regulation        |        5 |
| compGeno    | Comparative Genomics             |        6 |
| varRep      | Variation and Repeats            |        7 |
| x           | Experimental Tracks              |       10 |
| phenoAllele | Phenotype and Allele             |      4.5 |

The tracks that constitute these groups are specified in the trackDb table, we can see the tracks for a particular group like so:

> select longLabel from trackDb where grp = 'map' order by priority asc;
| longLabel                                                                   |
| Chromosome Bands Based On ISCN Lengths (for Ideogram)                       |
| Chromosome Bands Based On ISCN Lengths                                      |
| Mapability - CRG GEM Alignability of 36mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 40mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 50mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 75mers with no more than 2 mismatches  |
| STS Markers on Genetic and Radiation Hybrid Maps                            |
| Mapability - CRG GEM Alignability of 100mers with no more than 2 mismatches |
| Physical Map Contigs                                                        |
| Assembly from Fragments                                                     |
| Gap Locations                                                               |
| BAC End Pairs                                                               |
| Quantitative Trait Loci From Jackson Laboratory / Mouse Genome Informatics  |
| GC Percent in 5-Base Windows                                                |
| Mapability or Uniqueness of Reference Genome                                |

These two tables provide information to the browser to know which groups to display, which tracks belong to a particular group, and in which order they should be displayed. The remaining fields in the trackDb table fill in the gaps: what sort of data the track contains and how the browser is to display it. I won’t say anymore about trackDb here – we will naturally cover the details of how trackDb functions when we talk about loading custom data in the next post.

You will probably have noticed that there is not an exact equivalence between the query above and the tracks in the browser (e.g. the query returns a number of Mapability tracks whereas there is only a single Mapability dropdown in the browser). This is because some tracks are composite tracks — if you click Mapability in the browser you will see that the multiple tracks are contained within it. Composite tracks are specified in trackDb, we will investigate this in the next part of this blog post.

[This tutorial continues in Part III: Loading Custom Data]

Customizing a Local UCSC Genome Browser Installation I: Introduction and CentOS Install

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

The UCSC Genome Bioinformatics site is widely regarded as one of the most powerful bioinformatics portals for experimental and computational biologists on the web. While the UCSC Genome Browser provides excellent functionality for loading and visualizing custom data tracks, there are cases when your lab may have genome-wide data it wishes to share internally or with the wider scientific community, or when you might want to use the Genome Browser for an assembly that is not in the main or microbial Genome Browser sites. If so then you might need to install a local copy of the genome browser.

A blog post by noborujs on E-notações explains how to install the UCSC genome browser on an Ubuntu system, and the UCSC Genome genome browser team provides a general walk-through on their wiki and an internal presentation. Here I will provide a similar walkthrough for installing it on a CentOS system. The focus of these posts is go beyond the basic installation to explain the structure of the databases that underlie the browser; how these database table are used to create the web front-end, and how to customize a local installation with your own data. Hopefully having a clearer understanding of the database and browser architecture should make the process of loading your own data far easier.

This blog entry has grown to quite a size, so I’ve decided to split it into 3 more manageable parts and you can find a link to the next part at the end of this post.

The 3 parts are as follows:

1) Introduction and CentOS install (this post)
2) The databases & web front-end
3) Loading custom data

The walk-through presented here installs CentOS 5.7 using VirtualBox so you can follow this on your desktop if you have sufficient disk space. Like the Ubuntu walk-through linked to above, this will speed you through the install process with little or no reference to what the databases actually contain, or how they relate to the browser, this information will be included in part 2.

CentOS Installation

• Grab CentOS 5.7 i386 CD iso’s from one of the mirrors. Discs 1-5 are required for this install.
• Start VirtualBox and select ‘New’. Settings chosen:
• Name = CentOS 5.7
• OS = linux
• Version = Red Hat
• Virtual memory of 1GB
• Create a new hard disk with fixed virtual storage of 64GB
• Start the new virtual machine and select the first CentOS iso file as the installation source
• Choose ‘Server GUI’ as the installation type when prompted
• Set the hostname, time etc. and make sure you enable http when asked about services
• When the install prompts you for a different CD, select “Devices -> CD/DVD devices -> Choose a virtual CD/DVD disk file” and choose the relevant ISO file.
• Login as root, run Package Updater, reboot.

UCSC Browser Install

• Install dependencies:

yum install libpng-devel
yum install mysql-devel

• Set environment variables, the following were added to root’s .bashrc:

export MACHTYPE=i386
export MYSQLLIBS="/usr/lib/mysql/libmysqlclient.a -lz"
export MYSQLINC=/usr/include/mysql
export WEBROOT=/var/www

Each time you open a new terminal these environment variables will be set automatically.

• Download the Kent source tree from, unpack it and then build it:

mkdir -p ~/bin/$MACHTYPE # only used when you build other utilities from the source tree
cd kent/src/lib
cd ../jkOwnLib
cd ../hg/lib
cd ..

You will now have all the cgi’s necessary to run the browser in /var/www/cgi-bin and some JavaScript and CSS in /var/www/html. We need to tell SELinux to allow access to these:

restorecon -R -v '/var/www'

• Grab the static web content. First edit kent/src/product/scripts/browserEnvironment.txt to reflect your environment (MACHTYPE, DOCUMENTROOT etc.) then

cd kent/src/product/scripts
./ browserEnvironment.txt

• Create directories to hold temporary files for the browser:

mkdir $WEBROOT/trash
chmod 777 $WEBROOT/trash
ln -s $WEBROOT/trash $WEBROOT/html/trash

• Set up MySQL for the browser. We need an admin user with full access to the databases we’ll be creating later, and a read-only user for the cgi’s.


  ON hgcentral.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

  ON hgcentral.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

The above commands will need repeating for each of the databases that we subsequently create.

We also need a third user that has read/write permissions to the hgcentral  database only:

  ON hgcentral.*
  TO ucsc_readwrite@localhost
  IDENTIFIED BY 'readwrite_password';

Note: You should replace the passwords listed here with something sensible!

The installation README (path/to/installation/readme) suggests granting FILE to the admin user, but FILE can only be granted globally (i.e. GRANT FILE ON *.*) , so we must do this as follows:

GRANT FILE ON *.* TO ucsc_admin@localhost;

We now have the code for the browser and a database engine ready to receive some data. As an example, getting a human genome assembly installed requires the following:

• Create the primary gateway database for the browser:

mysql -u ucsc_admin -p  -e "CREATE DATABASE hgcentral"
mysql -u ucsc_admin -p hgcentral < hgcentral.sql

• Create the main configuration file for the browser hg.conf and save it in /var/www/cgi-bin:

cp kent/src/product/ex.hg.conf /var/www/cgi-bin/hg.conf

Then edit /var/www/cgi-bin/hg.conf to reflect the specifics of your installation.

• Admin users should also maintain a copy of this file saved as ~/.hg.conf, since the data loader applications look in your home directory for hg.conf. It is a good idea for ~/.hg.conf to be made private (i.e. only the owner can access it) otherwise your database admin password will be accessible by other users:

cp /var/www/cgi-bin/hg.conf ~/.hg.conf
chmod 600 ~/.hg.conf

When we issue commands to load custom data (see part 3) it is this copy of hg.conf that will supply the necessary database passwords.

The browser is now installed and functional, but will generate errors because the databases specified in the hgcentral database are not there yet. The gateway page needs a minimum human database in order to function even if the browser is being built for the display of other genomes.

To install a human genome database, the minimal set of mySQL tables required within the hg19 database is:
• grp
• trackDb
• hgFindSpec
• chromInfo
• gold – for performance this table is split by chromosome so we need chr*_gold*
• gap – split by chromosome as with gold so we need chr*_gap*

To install minimal hg19:

  ON hg19.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

  ON hg19.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "CREATE DATABASE hg19"
cd /var/lib/mysql/hg19
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/grp* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/trackDb* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/hgFindSpec* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chromInfo* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gold* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gap* .

The DNA sequence can be downloaded thus:

cd /gbdb
mkdir -p hg19/nib
cd hg19/nib
rsync -avP rsync://*.nib .

Again we will need to tell SELinux to allow the webserver to access these files:

semanage fcontext -a -t httpd_sys_content_t "/gbdb"

Create hgFixed database:

  ON hgFixed.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

  ON hgFixed.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "create database hgFixed"

The browser now functions properly and we can browse the hg19 assembly.

The 3rd part of this blog post looks at loading custom data, for this we will use some Drosophila melanogaster data taken from the modENCODE project. Therefore we will need repeat the steps we used to mirror the hg18 assembly to produce a local copy of the dm3 database. Alternatively, if you have sufficient disk space on your virtual machine you can grab all the D.melanogaster data like so:

mysql -u ucsc_admin -p -e "create database dm3"
cd /var/lib/mysql/dm3
rsync -avP rsync://* .
cd gbdb
mkdir dm3
cd dm3
rsync -avP rsync://* .

At present these commands will download approximately 30Gb.

[This tutorial continues in Part 2: The UCSC genome browser database structure]

Iterating over a list of tuples in bash

Posted 21 Jul 2010 — by maxhaussler
Category bash, linux

I’ve shown previously how to rename many files on the command line. That solution required a 2nd textfile in the format oldname<space>newname<return>.

Sometimes it is easier if there is no external file and we simply want to iterate over a list that is part of the script. The command to use in this case is called set, which will split on the character in the variable IFS (=internal field separator?).


for i in a,b c,d ; do IFS=","; set $i; echo $1 $2; done

Will output:

a b
c d

Speeding-Up WordNet::Similarity

Posted 10 Jun 2010 — by caseybergman
Category linux, text mining

One nifty Perl module is the WordNet::Similarity module, which provides a number of similarity measures between terms found in WordNet. WordNet::Similarity can be either used from the shell via the command, or alternatively, it can be run as a server by starting The latter has the advantage that WordNet will not be loaded into memory each time a measurement is taken, which speeds up queries drastically.

Does it really?

Unfortunately, the current implementation of the server allows us only to make one query per opened TCP connection, before the socket is closed again. I also experienced an unexplained grace time that passes before a server process actually finishes, which becomes a significant bottleneck when performing a lot of queries.

I am providing here a tiny patch for the file that abandons said limitations. With the patch applied, multiple queries can be made per TCP connection and there is no delay between them. The small changes that I have made speed-up the querying process by an unbelievable factor 10. Below you can find the little Ruby script that I used to measure the time needed for the original version of the server (running on port 30000) and the new version of the server (running on port 31134).


require 'socket'

left_words = [ 'dinosaur', 'elephant', 'horse', 'zebra', 'lion',
  'tiger', 'dog', 'cat', 'mouse' ]
right_words = [ 'lemur', 'salamander', 'gecko', 'chameleon',
  'lizard', 'iguana' ]

# Original implementation
puts 'Original implementation'
puts '-----------------------'
original_start =
left_words.each { |left_word|
  right_words.each { |right_word|
    socket ='localhost', 30000)
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp

    redo if response == 'busy'

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
original_stop =

# New implementation
puts ''
puts 'New implementation'
puts '------------------'
new_start =
socket ='localhost', 31134)
left_words.each { |left_word|
  right_words.each { |right_word|
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
# Let the server close the socket,
# or otherwise the child process may loop forever.
# socket.close
new_stop =

puts ''
puts 'Time required'
puts '-------------'
puts 'Original implementation: ' <<
  (original_stop.to_i - original_start.to_i).to_s <<
puts 'New implementation: ' <<
  (new_stop.to_i - new_start.to_i).to_s <<

The new implementation has an additional command called ‘e‘ that can be used to close the socket to the server. The actual patch of can be found here: similarity_server.patch

Python unicode and the problem with the terminal

Posted 24 May 2010 — by maxhaussler
Category linux, python

There are tons of tutorials on how to use Unicode in Python. They often fail to mention the problem of terminals. This actually applies to most programming languages that I know. If you find this boring now, I can fully understand, but once you’re obliged to work with Unicode (as we do as we’re mining scientific articles here), these things get important as they can cost a lot of time to fix and concern all places where data is input and output. You can read Unicode data from the keyboard and files into variables, no problem here.

For output, basically, it’s simple:

  • To convert a normal string (with Unicode in it) to a real unicode string (a different data type), use string.decode(“utf8”) (this is counter-intuitive to me: why is the method not called “toUnicode” or “encode” ??).
  • To convert from unicode to a normal string use string.encode(“utf8”)

If you have a normal string with unicode characters in it, you cannot print it to a terminal or write it to a file, as the default encoding of terminals and files is “ASCII” in python. It will lead to a “UnicodeDecodeError: Can’t decode byte at position xxx”, as the terminal is ASCII and string is supposed to be ASCII as well, but contains the special unicode characters. Python does not know how to display these strange characters on this terminal. There are two solutions:

  • For the terminal, tell Python that you are actually using a Unicode -capable terminal, as described here . Open files with the right “codec” as described here.
  • Alternatively:  Run the “.decode(“utf8”) method on all strings and they will display like this: u’\u0150sz’ which is not great, but at least there is no error messages anymore.

If you tell python that your terminal accepts Unicode, make sure that this is true: OSX terminal does by default, gnome-terminal does (look in Terminal – Encoding), xterm does not but can be set  to display it.

I know that you’re very eager to try this out. But if you want to test this now in your terminal, remember that if you cut-and-paste this word Ősz into your python interpreter or a source code file, although it is obvious to you that this is unicode, Python does not know what you’re thinking. You cannot write

print "Ősz"

as, again, this is supposed to be 0-127 ASCII string but the first letter is a code >127. So you either have to write u’Ősz’ or use the decode function again:

print 'Ősz'.decode("utf8")

Translate filenames

Posted 19 Apr 2010 — by maxhaussler
Category bash, linux

Sometimes you have to apply a set of Unix-commands based on data from a textfile. A very common example is a list a files and in addition a table (textfile) with lines containing the data on how to renames filenames to new filenames, like (oldFilename,newFilename). I always forgot how to code this, so here is an example shell script that accepts one parameter (the textfile) and renames files accordingly in the current directory:

cat $FILE | while read old new
    mv $old $new

Linux packages for UCSC genome source tools

Posted 09 Apr 2010 — by maxhaussler
Category genome bioinformatics, linux, UCSC genome browser

Some time ago I’ve created packages for the UCSC genome source tools for 64Bit Intel/AMD processors.

A .rpm version can be downloaded here; a .deb version can be downloaded here.

On Fedora/RedHat/CentOs, install the rpm with:

sudo yum install ucsc-genome-tools-225-2.x86_64.rpm --nogpgcheck

On Debian/Ubuntu, install the .deb file with:

dpkg -i ucsc-genome-tools_225-1_amd64.deb

Tell me in the comments if there are any problems, I’ve tested them only on Debian and CentOS.

Add headers to a file (add line to the beginning of a file)

Posted 25 Nov 2009 — by pedro
Category bash, linux

Let’s suppose you want to add your favourite line (STRING) to the first line of an existing file, which includes escaped characters such as \t or \s.

Entered your STRING with the following syntax and pass it the line editor Ex:

Here an example
$ echo -e “1i\nthis is my favourite line which can handle\ttabs\n.\nwq” | ex -s myfile

This command will add the text “this is my favourite line which can handle tabs”