Preview of the Dmel-stock-keeping database “flystockdb”

Posted 09 Nov 2010 — by caseybergman
Category bergman lab, database, drosophila

We are in the process of developing a database for stock-keeping of Drosophila melanogaster (fruit fly) stocks; it is ingenuously named “flystockdb” and will eventually be publicly available and open to anyone.

Key features of flystockdb:

  • web-interface (HTML + JavaScript)
  • rights-management (publicly-, privately-shared stocks)
  • software aided entry of genotypes and genomic features

The central access-point to flystockdb is its web-interface, which permits users to conveniently connect to the centralised fly-stock keeping database from everywhere. There is no installation process necessary, whilst data safety is guaranteed via the secured and rights-managed flystockdb server.

A fine granularity rights-management allows for sharing stocks publicly with everyone else, sharing stocks with designated flystockdb-users, and to keep private non-shared stocks. “Groups” can be used to accumulate communal stock-collections, to serve as repositories for private stock-supplies, or anything in-between.

Automatic recognition and linkage of whole genotypes — but also of single genotypic features — via FlyBase simplifies and aides the entry of fly-stocks. Genes, alleles, balancers and even P-elements are sorted onto the right chromosomes by contextually aware algorithms that can decompose and analyse user-input on-the-fly.

A preview of flystockdb is available via a screencast:

We anticipate to go live mid-2011.

On genome coordinate systems and transposable element annotation

[Update: For an extended discussion of the issues in this post, see: Bergman (2012) A proposal for the reference-based annotation of de novo transposable element insertions” Mobile Genetic Elements 2:51 – 54]

Before embarking on any genome annotation effort, it is necessary to establish the best way to represent the biological feature under investigation. This post discusses how best to represent the annotation of transposable element (TE) insertions that are mapped to (but not present in) a reference genome sequence (e.g. from mutagenesis or re-sequencing studies), and how the standard coordinate system in use today causes problems for the annotation of TE insertions.

There are two major coordinate systems in genome bioinformatics, that differ primarily in whether they anchor genomic feature to (“base coordinate systems”) or between (“interbase coordinate systems”) nucleotide positions. Most genome annotation portals (e.g. NCBI or Ensembl), bioinformatics software (e.g. BLAST) and annotation file formats (e.g. GFF) use the base coordinate system, which represents a feature starting at the first nucleotide as position 1. In contrast, a growing number of systems (e.g. UCSC, Chado, DAS2) employ the interbase coordinate system, whereby a feature starting at the first nucleotide is represented as position 0. Note, the UCSC genome bioinformatics team actually use both systems and refer to the base coordinate system as “one-based, fully-closed” (used in the UCSC genome browser display) and interbase coordinate system as “zero-based, half-open” (used in their tools and file formats), leading to a FAQ about this issue by users. The interbase coodinate system is also referred to as “space-based” by some authors.

The differences between base (bottom) and interbase (top) coordinate system can be visualized in the following diagram (taken from the Chado wiki).

There are several advantage for using the interbase coordinate system including: (i) the ability to represent features that occur between nucleotides (like a splice site), (ii) simpler arithmetic for computing the length of features (length=end-start) and overlaps (max(start1,start2), min(end1,end2)) and (iii) more rational conversion of coordinates from the positive to the negative strand (see discussion here).

So why is the choice of coordinate system important for the annotation of TEs mapped to a reference sequence? The short answer is that TEs (and other insertions) that are not a part of the reference sequence occur between nucleotides in the reference coordinate system, and therefore it is difficult to accurately represent the location of a TE on base coordinates. Nevertheless, base coordinate systems dominate most of genome bioinformatics and are an established framework that one has to work within.

How then should we annotate TE insertions on base coordinates that are mapped precisely to a reference genome? If a TE insertion in reality occurs between positions X and X+1 in a genome, do we annotate the start and end position both at the same nucleotide? If so, do we annotate the start/stop coordinate at position X, or both at position X+1? If we chose to annotate the insertion at position X, then we need to invoke a rule that the TE inserts after nucleotide X. However this solution breaks down if the insertion is on the negative strand, since we either need to map a negative strand insertion to X+1 or have a different rule for interpreting the placement of the TE on positive and negative strands. Alternatively, do we annotate the TE as starting at X and ending at X+1, attempting to fake interbase coordinates on a base coordinate system, but at face value implying that the TE insertion is not mapped precisely and spans 2 bp in the genome.

After grappling with this issue for some time, it seems that neither of these solutions is sufficient to deal with the complexities of TE insertion and reference mapping. To understand why, we must consider the mechanisms of TE integration and how TE insertions are mapped to the genome. Most TEs create staggered cuts to the genomic DNA that are filled on integration into the genome leading to short target site duplications (TSDs). Most TEs also target a palindromic sequence, and insert randomly with respect to orientation. A new TE insertion is typically mapped to the genome by sequencing a fragment that spans the TE into unique flanking DNA, either by directed (e.g. inverse/linker PCR) or random (e.g. shotgun re-sequencing) approaches. The TE-flank fragment can be obtained from the 5′ or 3′ end of the TE. However, where one places the TE insertion depends on whether one uses the TE-flank from the 5′ or 3′ end and the orientation of the TE insertion in the genome. As shown in the following diagram, for an insertion on the positive strand (>>>), a TE-flank fragment from the 5′ end is annotated to occur at the 3′ end of the TSD (shown in bold), whereas a 3′ TE-flank fragment is placed at the 5′ end of the TSD.  For an insertion on the negative strand (<<<), the opposite effect occurs. In both cases, TE-flank fragments from the 5′ and 3′ end map the TE insertion to different locations in the genome.

Thus, where one chooses to annotate a TE insertion relative to the TSD is dependent on the orientation of the TE insertion and which end is mapped to the genome. As a consequence, both the single-base and two-base representations proposed above are flawed, since TE insertions into the same target site are annotated at two different locations on the positive and negative strand. This issue lead us (in retrospect) to misinterpret some aspects of the P-element target site preference in a recent paper, since FlyBase uses a single-base coordinate system to annotate TE insertions.

As an alternative, I propose that the most efficient and accurate way to represent TE insertions mapped to a reference genome on base coordinates is to annotate the span of the TSD and label the orientation of the TE in the strand field. This formulation allows one to bypass having to chose where to locate the TE relative to the TSD (5′ vs. 3′, as is required under the one-base/two-base annotation framework), and can represent insertions into the same target site that occur on different strands. Furthermore, this solution allows one to use both 5′ and 3′ TE-flank information. In fact, the overlap between coordinates from the 5′ and 3′ TE-flank fragments defines the TSD. Finally, this solution requires no prior information about TSD length for a given TE family, and also accommodates TE families that generate variable length TSDs since the TSD is annotated on a per TE basis.

The only problem left open by this proposed solution is for TEs that do not create a TSD, which have been reported to exist. Any suggestions for a general solution that also allows for annotation of TE insertions without TSDs would be much appreciated….

Iterating over a list of tuples in bash

Posted 21 Jul 2010 — by maxhaussler
Category bash, linux

I’ve shown previously how to rename many files on the command line. That solution required a 2nd textfile in the format oldname<space>newname<return>.

Sometimes it is easier if there is no external file and we simply want to iterate over a list that is part of the script. The command to use in this case is called set, which will split on the character in the variable IFS (=internal field separator?).

Example:

for i in a,b c,d ; do IFS=","; set $i; echo $1 $2; done

Will output:

a b
c d

Determining the length of the S-phase from double-labeled cell counts

Posted 18 Jun 2010 — by maxhaussler
Category cell biology, cell cycle, developmental biology

A colleague of mine was recently grappling with the question of how to determine the length of the S-phase of a number of cells, which attracted my attention since the problem is quite similar to those we have in bioinformatics.

You might know that dyes exist that label cells that are currently in S-phase (Brdu, Idu, radioactive, etc) , but since cells divide all the time and the cells in a dish are usually not synchronized, one can not simply add the label and then wait for the label to go on and off, as the cells have to be killed to make the label visible. One could synchronize all cells to start their cell cycle at the same time and kill some every second to find out when the label comes on, but this is a huge amount of work.

And unnecessary: As it turns out, only two basic observations are required to infer the length of S phase. As an analogy, imagine all traffic lights of a town going red and green at different times. There is a technique that allows you to obtain the duration of the green phase from just two aerial images of the whole town, taken at a fixed interval. It took us some time to understand how you can deduce the duration of an event based on just two observations of the whole population.

We came across this in a paper by Martynoga et al and by following the literature, found out that the idea goes back to at least Wimber and Quastler, from the early 60s. It requires that two different labels are available, let’s call them green and red, as they actually are in Martynoga et al (and yes, the analogy with the traffic lights is false here, as there are no dyes for traffic lights that stick). The first injection of the dye into the cell population will mark all cells that are currently in S-phase green, the second injection will mark all cells that are in S-phase at the end of the experiment red. Then the cells are killed and the staining is performed.

The first injection is marking cells “green only”, the second one red, but the second ones will actually be “red and green”: There should be no cells “red only”, as the time between the two injections is set short enough that no cell can go into a new cell cycle between the two injections. So we only have “green only” and “red and green” cells and can count them.

What is marked “red/green” is the number of cells that are currently, at the timepoint of the 2nd injection, in the S-phase. (Although, in Martynoga et al, the labelling agent is present 30 min, they also explain that the dye needs 30 minutes to get into the cells. So the effect of killing the cells after 30 min means that we are pulse-labelling at one single timepoint and only the cells that are in S-phase exactly 30min before killing them will be marked.)

What is marked “green only” is the number of cells that have left the S-phase between the two injection. (This is marking all cells that exit their S phase during this time, as all cells that exit the cell cycle before it will not be marked at all and all cells that exit the cycle afterwards will be red/green.)

There two basic ideas that just have to be combined to determine the length of S-phase now:

a) At any time, the same number of cells is in S phase. If I have 100 cells, an S-phase of 4 hours and a cell cycle time of 10 hours, there will be always 40 cells in S-phase or more generally, 40% of all cells. Therefore, the Proportion of cells in S-phase = Time of S Phase / Time of cell cycle, or PS = Ts / Tc

b) During any duration of time, cells will constantly leave S-phase, the same every hour. If I have 100 cells and a total cell cycle time of 10 hours, 10 cells will leave the S-phase each hour. After four hours, 40 cells will have left S-phase. Therefore, the Proportion of cells leaving S-phase = Time between injections / Time of cell cycle, or PL = Ti / Tc

Dividing one formula by the other gives PS / PL = Ts / Ti. You can rearrange this formula to solve for the Time of S Phase Ts = PS / PL * Ti

Perhaps this will be helpful for someone one day who stumbles over this with Google.

Speeding-Up WordNet::Similarity

Posted 10 Jun 2010 — by caseybergman
Category linux, text mining

One nifty Perl module is the WordNet::Similarity module, which provides a number of similarity measures between terms found in WordNet. WordNet::Similarity can be either used from the shell via the command similarity.pl, or alternatively, it can be run as a server by starting similarity_server.pl. The latter has the advantage that WordNet will not be loaded into memory each time a measurement is taken, which speeds up queries drastically.

Does it really?

Unfortunately, the current implementation of the server allows us only to make one query per opened TCP connection, before the socket is closed again. I also experienced an unexplained grace time that passes before a server process actually finishes, which becomes a significant bottleneck when performing a lot of queries.

I am providing here a tiny patch for the similarity_server.pl file that abandons said limitations. With the patch applied, multiple queries can be made per TCP connection and there is no delay between them. The small changes that I have made speed-up the querying process by an unbelievable factor 10. Below you can find the little Ruby script that I used to measure the time needed for the original version of the server (running on port 30000) and the new version of the server (running on port 31134).

#!/usr/bin/ruby

require 'socket'

left_words = [ 'dinosaur', 'elephant', 'horse', 'zebra', 'lion',
  'tiger', 'dog', 'cat', 'mouse' ]
right_words = [ 'lemur', 'salamander', 'gecko', 'chameleon',
  'lizard', 'iguana' ]

# Original implementation
puts 'Original implementation'
puts '-----------------------'
original_start = Time.new
left_words.each { |left_word|
  right_words.each { |right_word|
    socket = TCPsocket.open('localhost', 30000)
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp
    socket.close

    redo if response == 'busy'

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
  }
}
original_stop = Time.new

# New implementation
puts ''
puts 'New implementation'
puts '------------------'
new_start = Time.new
socket = TCPsocket.open('localhost', 31134)
left_words.each { |left_word|
  right_words.each { |right_word|
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
  }
}
socket.puts("e\r\n\r\n")
# Let the server close the socket,
# or otherwise the child process may loop forever.
# socket.close
new_stop = Time.new

puts ''
puts 'Time required'
puts '-------------'
puts 'Original implementation: ' <<
  (original_stop.to_i - original_start.to_i).to_s <<
  's'
puts 'New implementation: ' <<
  (new_stop.to_i - new_start.to_i).to_s <<
  's'

The new implementation has an additional command called ‘e‘ that can be used to close the socket to the server. The actual patch of similarity_server.pl can be found here: similarity_server.patch

Python unicode and the problem with the terminal

Posted 24 May 2010 — by maxhaussler
Category linux, python

There are tons of tutorials on how to use Unicode in Python. They often fail to mention the problem of terminals. This actually applies to most programming languages that I know. If you find this boring now, I can fully understand, but once you’re obliged to work with Unicode (as we do as we’re mining scientific articles here), these things get important as they can cost a lot of time to fix and concern all places where data is input and output. You can read Unicode data from the keyboard and files into variables, no problem here.

For output, basically, it’s simple:

  • To convert a normal string (with Unicode in it) to a real unicode string (a different data type), use string.decode(“utf8”) (this is counter-intuitive to me: why is the method not called “toUnicode” or “encode” ??).
  • To convert from unicode to a normal string use string.encode(“utf8”)

If you have a normal string with unicode characters in it, you cannot print it to a terminal or write it to a file, as the default encoding of terminals and files is “ASCII” in python. It will lead to a “UnicodeDecodeError: Can’t decode byte at position xxx”, as the terminal is ASCII and string is supposed to be ASCII as well, but contains the special unicode characters. Python does not know how to display these strange characters on this terminal. There are two solutions:

  • For the terminal, tell Python that you are actually using a Unicode -capable terminal, as described here . Open files with the right “codec” as described here.
  • Alternatively:  Run the “.decode(“utf8”) method on all strings and they will display like this: u’\u0150sz’ which is not great, but at least there is no error messages anymore.

If you tell python that your terminal accepts Unicode, make sure that this is true: OSX terminal does by default, gnome-terminal does (look in Terminal – Encoding), xterm does not but can be set  to display it.

I know that you’re very eager to try this out. But if you want to test this now in your terminal, remember that if you cut-and-paste this word Ősz into your python interpreter or a source code file, although it is obvious to you that this is unicode, Python does not know what you’re thinking. You cannot write

print "Ősz"

as, again, this is supposed to be 0-127 ASCII string but the first letter is a code >127. So you either have to write u’Ősz’ or use the decode function again:

print 'Ősz'.decode("utf8")

The “real” impact factor of Nucleic Acids Research is ~5.6

Posted 27 Apr 2010 — by maxhaussler
Category scientometrics, text mining

Nucleic Acids Research is a respected journal, publishing articles about e.g. restriction enzymes, and DNA analysis. Twice a year they have a “special issue” with updates on databases and bioinformatics tools on the internet. These short “method papers” usually just resume what has been added to the database/tool and rarely report research results directly. But they do attract a lot of citations: people who are using a certain website or software tool are expected to cite the corresponding method article in Nucleic Acids Research.

Is this practice increasing the impact factor of this journal? Certainly, but how much? It turned out that the answer to this question takes only 45 minutes of web searching and a one-line program (or an excel formula).

The impact factor 2008 is calculated based on articles in 2006 and 2007. So I’ve downloaded the citation data from scopus.com and by dividing number of citations (14957) by number of articles (2260), arrived at an impact factor of 6.61 (the official impact factor is 6.87, this is probably because my list copied from Scopus includes some articles which are considered “not-citable” by ISI Thomson or because scopus has less data than Thomson). The 444 articles in special issues attracted 4750 citations. If NAR did not have special issues, its impact factor would therefore be around 5.6 (a bit higher, perhaps 5.8, due to the non-citable issue).

The data from scopus therefore gives everyone the possibility to quantify how much methods, reviews or original research determines the impact factor of a journal.

How to redo this: Go to scopus.com, search for “srctitle(nucleic acids research)”, select “Nucleic Acids Research” and Year:2006, click limit, click “Citation tracker” and export as a textfile. Repeated the same thing for 2007. Convert the text files with excel to tab-separated files and run the following one-liner on them:

cat 2006.txt 2007.txt | grep -v rratum | gawk 'BEGIN {FS="\t";}
/^200[0-9]/ {articles+=1; cites+=$11; if ($7=="") { webArticles+=1; webCites+=$11;}}
END {print "articles:"articles; print "citations:"cites; print "impact factor:" (cites/articles);
print "web/db articles:"webArticles; print "web/db article citations "webCites;
print "impact factor without web/db issue:" ((cites - webCites)/(articles- webArticles));}'

Note: a previous version of this post estimated the impact factor to be 4.5. This was wrong, as I forgot to remove the number of articles in the special issues from the total number of articles. I am very sorry for this bug.

Translate filenames

Posted 19 Apr 2010 — by maxhaussler
Category bash, linux

Sometimes you have to apply a set of Unix-commands based on data from a textfile. A very common example is a list a files and in addition a table (textfile) with lines containing the data on how to renames filenames to new filenames, like (oldFilename,newFilename). I always forgot how to code this, so here is an example shell script that accepts one parameter (the textfile) and renames files accordingly in the current directory:

#!/bin/bash
FILE=$1
cat $FILE | while read old new
do
    mv $old $new
done

Linux packages for UCSC genome source tools

Posted 09 Apr 2010 — by maxhaussler
Category genome bioinformatics, linux, UCSC genome browser

Some time ago I’ve created packages for the UCSC genome source tools for 64Bit Intel/AMD processors.

A .rpm version can be downloaded here; a .deb version can be downloaded here.

On Fedora/RedHat/CentOs, install the rpm with:

sudo yum install ucsc-genome-tools-225-2.x86_64.rpm --nogpgcheck

On Debian/Ubuntu, install the .deb file with:

dpkg -i ucsc-genome-tools_225-1_amd64.deb

Tell me in the comments if there are any problems, I’ve tested them only on Debian and CentOS.

Automatically Attaching DAS Tracks to the Ensembl Genome Browser with BioMart Perl 0.7

Posted 09 Mar 2010 — by caseybergman
Category biomart, ensembl

In this blog post, we describe how BioMart Perl 0.7 needs to be patched in order to create self-configuring DAS sources, which can then be automagically attached to the Ensembl Genome Browser using only a single URL.

BioMart is versatile framework that primarily addresses the denormalisation of very large databases in order to optimise query performance, but it also comes with various interfaces which permit an easy access to the optimised data sources (a.k.a the marts). One of the numerous interfaces is a Distributed Annotation System (DAS) server, which permits the integration of marts into remote systems and services. The most comprehensive marts are currently derived from the Ensembl genome databases, for which (among many other choices) the Ensembl Genome Browser provides an easy to use user interface. It is possible to manually attach BioMart DAS tracks to the Ensembl Genome Browser, so that annotation data from marts can be visualised in the browser on a gene and chromosome level of a genome besides the already existing features in the Ensembl databases.

BioMart DAS sources can be manually added as a DAS track in the Ensembl Genome Browser by

  1. clicking on ‘Manage your data’ in the panel to the left-hand side in the browser,
  2. then clicking on ‘Attach DAS’,
  3. entering the URL of the DAS server (e.g. http://www.biomart.org/biomart/das/default__hsapiens_gene_ensembl__ensembl_das_chr),
  4. clicking ‘Next’,
  5. explicitly selecting the DAS source you want to add,
  6. clicking ‘Next’,
  7. selecting the species this DAS source applies to,
  8. clicking ‘Next’,
  9. selecting the coordinate system the DAS source uses,
  10. clicking ‘Next’,
  11. and finally the configuration dialog needs to be closed.

Here, we provide two patch files, configureBioMart.pl.patch and ConfBuilder.pm.patch, which modify BioMart Perl 0.7 in such a way that a mart’s DAS track can be added using a similar URL to:

  • http://www.ensembl.org/Homo_sapiens/Location/View?r=6:133017695-133161157;contigviewbottom=das:http://your_host/biomart/das/default__hsapiens_gene_ensembl__ensembl_das_chr

This URL will instruct the Ensembl Genome Browser to add the DAS track given by the parameter contigviewbottom. The browser in return then confirms the successful operation with the following message:

For this example we have just taken a copy of the Ensembl mart as the DAS source, so the data we are displaying is a mirror image of the already existing genes in the Ensembl database. Nevertheless, the DAS track ‘Your Source’ — the name is configurable: see below — appears now in the genome browser like this:

If we try the same with the Ensembl mart from www.biomart.org, i.e. if we provide the DAS track URL http://www.biomart.org/biomart/das/default__hsapiens_gene_ensembl__ensembl_das_chr above, then the operation will fail and the genome browser will notify us with the message:

Patching BioMart Perl 0.7

In order to apply the patch, simply copy both patch files into the parent directory of your BioMart Perl installation and then carry out the following steps:

computer:legacy joachim$ lsConfBuilder.pm.patchbiomart-perlconfigureBioMart.pl.patch
computer:legacy joachim$ patch --verbose -p1 < ConfBuilder.pm.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- old/biomart-perl/bin/ConfBuilder.pm 2009-01-10 13:21:32.000000000 +0000
|+++ new/biomart-perl/bin/ConfBuilder.pm 2010-02-10 20:25:26.000000000 +0000
--------------------------
Patching file biomart-perl/bin/ConfBuilder.pm using Plan A...
Hunk #1 succeeded at 26.
Hunk #2 succeeded at 703.
done
computer:legacy joachim$ patch --verbose -p1 < configureBioMart.pl.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- old/biomart-perl/bin/configureBioMart.pl 2010-02-08 11:40:07.000000000 +0000
|+++ new/biomart-perl/bin/configureBioMart.pl 2010-02-10 20:25:40.000000000 +0000
--------------------------
Patching file biomart-perl/bin/configureBioMart.pl using Plan A...
Hunk #1 succeeded at 137.
done
computer:legacy joachim$

Of course, the --verbose parameter can be omitted.

Our patch introduces a few new parameters in biomart-perl/conf/settings.conf:

[databaseSettings]
username=mysql_username
password=mysql_password
host=mysql_host
port=mysql_port
[ensemblDAS]
sourcename=Your Source
version=56

Our Perl modifications will try to connect to your MySQL database (can be easily adapted for Oracle and Postgres: change ‘DBI:mysql:’ in ConfBuilder.pm) and read the contents of ensembl_compara_version. For this, the database settings and the Ensembl version are needed. The parameter sourcename defines simply the string that is displayed as the DAS track’s name in the genome browser.

The Juice

We achieved the implementation of a self-configuring DAS track by modifying the value of the MAPMASTER XML element in BioMart’s DSN listing. For each DAS track, the map master’s URL now points to the Ensembl web-site where specific information about the track’s species and coordinate system are located. Access to the compara database is required in order to retrieve the assembly version of the genome the track is working on. The latter is required to form a proper URL for the MAPMASTER XML element.

Contributors

Max: DAS insights and Ensembl self-configuring DAS-track knowledge
Joachim: BioMart patches