Automating ASCP-based submission of NGS data to ENA using expect

Posted 12 Dec 2014 — by caseybergman
Category EBI, filesharing, genome bioinformatics, high throughput sequencing, linux, NCBI

Submitting next-generation sequencing (NGS) data to International Nucleotide Sequence Database Collaboration repositories such as EBI or NCBI is an essential, but time-consuming, phase of many projects in modern biology. Therefore, mechanisms to speed up NGS data submission are a welcome addition to many biologists toolkit.  The European Nucleotide Archive (ENA) provides detailed instructions on how to submit NGS data to their “dropbox” staging area, which is a key part of the overall submission process for an NGS project. This can be done in three ways:

  • Using ENA’s Webin client in your web browser,
  • Using a standard FTP client at the command line, or
  • Using Aspera’s ASCP client at the command line.

For users with many samples to submit, the latter two command line options are clearly preferred methods. While FTP is installed by default on most linux systems, transfer of large NGS data files by FTP is slow, and on some systems (such as ours) FTP is specifically disabled because of security concerns.

In contrast, ASCP is not installed by default on most linux systems, but provides a very fast method to transfer large data files to EBI or NCBI. One of the downsides of using ASCP is that it interactively prompts users for password information for each file transferred. This requires babysitting your ASCP-based command line submission and supplying the same password for each file, thereby undermining much of the automation that a command line solution should provide.

In searching around for solutions to the ASCP-babysitting problem, I stumbled on documentation page entitled “Expect script for automating Aspera uploads to the EBI ENA” written by Robert Davey at the The Genome Analysis Centre. I had never heard of the expect scripting language prior to reading this post, but it provides exactly the solution I was looking for:

[Expect] is used to automate control of interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, ssh, and others. Expect uses pseudo terminals (Unix) or emulates a console (Windows), starts the target program, and then communicates with it, just as a human would, via the terminal or console interface. (Wikipedia)

Robert’s expect script was a little more complicated than I needed for my purposes, and required a few tweaks to conform to updates to EBI’s ASCP submission process. Without too much trouble, I cobbled together a modified version that solves the automated transfer of any number of data files to the top level of your ENA dropbox:

set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l80M -d $line $
    expect "Password:"
    send "$pass\r"
    expect eof

This script requires expect and ASCP to be installed globally on your system, and for the user to provide three arguments:

  • a file of filenames (with the full path) to the files you would like to submit to ENA
  • the ID for your Webin submission account, and
  • the password for your Webin submission account

For example, if you have a directory of gzip’ed fastq files that you would like to submit to ENA, all you would need to submit your files in bulk would be to navigate to that directory and do something like the following:

#download the script above from github
wget --no-check-certificate

#create file of filenames
for k in `ls *gz`; do echo $j/$k; done > fofn

#perform ASCP submission, note: replace with your ENA (Webin ID and Password)
expect ena_submit.exp fofn Webin-ID Webin-Password

High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome

Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.

Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here, or alternatively paste the following line into the “Paste URLs or data:” box on the “Add Custom Tracks” page of the Release 5 (dm3) browser:

track type=bam bigDataUrl= name=dm3PacBio

As shown in the following figure, mapped read depth is roughly Poisson distributed with a mean coverage of ~90x-95x on the autosomal arms and ~45x coverage for the X-chromosome. A two-fold reduction in read depth on the X relative to the autosomes is expected in a DNA sample only from heterogametic XY males. Mapped read depth is important to calculate empirically post hoc (not just based on theoretical a priori expectations) since the actual genome size of D. melanogaster is not precisely known and varies between males and females, with an upper estimate of the total male genome size (autosomes + X + Y) being ~220 Mb. Thus, we conclude the coverage of this dataset is of sufficient depth to generate long-range de novo assemblies of the D. melanogaster genome using PacBio data only, as confirmed by the preliminary Celera and FALCON assemblies.


The following browser screenshots show representative regions of how this high coverage PacBio dataset maps to the Release 5 reference genome:

  • A typical region of unique sequence (around the eve locus). Notice that read coverage is relatively uniform across the genome and roughly equal on the forward (blue) and reverse (red) strands. Also notice that some reads are long enough to span the entire distance between neighboring genes, suggesting unassembled PacBio reads are now long enough to establish syntenic relationships for neighboring genes in Drosophila from unassembled reads.
  • A typical region containing a transposable element (TE) insertion (in the ds locus), in this case a 8126 bp roo LTR retrotransposon (FBti0019098). Substantial numbers of mapped reads align across the junction region of the TE and its unique flanking regions. However, there is an excess of shorter reads inside the boundaries of the TE, which are likely to be incorrectly mapped because they are shorter than the length of the active TE sequence and have mapping quality values of ~0. With suitable read length and mapping quality filters, the unique location of TE insertions in the euchromatin should be identifiable using uncorrected PacBio reads.


  • One of the few remaining and persistent gaps in the euchromatic portion of the reference sequence that can be filled using PacBio sequences. This gap remains in the upcoming Release 6 reference sequence (Sue Celniker and Roger Hoskins, personal communication). This shows blasr can align PacBio reads through smaller sequence gaps that are represented by padded N’s, however the visualization of such reads may sometimes be flawed (i.e. by artificially displaying a gap in the mapped read that arises from a lack of alignment to the ambiguous nucleotides in the gap).


Below we provide a reproducible transcript of the code used to (i) setup our working environment, (ii) download and unpack this D. melanogaster PacBio dataset, (iii) generate the BAM file of mapped reads released here, and (iv) generate coverage density plots for the major chromosome arms shown above. Part of the rationale for providing these scripts is to showcase how simple it is to install the entire PacBio smrtanalysis toolkit, which is the easiest way we’ve found to install the blasr long-read aligner (plus get a large number of other goodies for processing PacBio data like pacBioToCA, pbalign, quiver, etc).

These scripts are written to be used by someone with no prior knowledge of PacBio data or tools, and can be run on a Oracle Grid Engine powered linux cluster running Scientific Linux (release 6.1).  They should be easily adapted for use on any linux platform supported by PacBio’s smrtanalysis toolkit, and to run on a single machine just execute the commands inside the “quote marks” on the lines with qsub statements.

(i) First, let’s download the Release 5 reference genome plus the smrtanalysis toolkit:


# get dm3 and reformat into single fasta file
echo "getting dm3 and reformatting into single fasta file"
wget -q
tar -xvzf chromFa.tar.gz
cat chr*.fa > dm3.fa
rm chromFa.tar.gz chr*.fa

#install PacBio smrtanalysis suite (including blasr PacBio long-read mapping engine)
echo "installing PacBio smrtanalysis suite"
bash --extract-only --rootdir ./
setupDrosPacBioEnv.shview rawview file on GitHub

If you want to have the smrtanalysis tooklit in your environment permanently, you’ll need to add the following magic line to your .profile:

source install/smrtanalysis-

(ii) Next, let’s get the D. melanogaster PacBio data and unpack it:


#get compressed .tar archives of D. melanogaster PacBio runs
echo "get tar archives of PacBio runs"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.tgz "wget"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.tgz "wget"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.tgz "wget"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.tgz "wget"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.tgz "wget"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.tgz  "wget"

#get md5sums for PacBio .tar archives
echo "getting md5sums for PacBio tar archives"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.md5 "wget"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.md5 "wget"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.md5 "wget"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.md5 "wget"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.md5 "wget"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.md5  "wget"

#verify .tar archives have been downloaded properly
echo "verifying archives have been downloaded properly"
qsub -b y -cwd -N md5_Dro1_24NOV2013_398.md5 -hold_jid wget_Dro1_24NOV2013_398.tgz,wget_Dro1_24NOV2013_398.md5 "md5sum -c Dro1_24NOV2013_398.md5 > Dro1_24NOV2013_398.md5.checkresult"
qsub -b y -cwd -N md5_Dro2_25NOV2013_399.md5 -hold_jid wget_Dro2_25NOV2013_399.tgz,wget_Dro2_25NOV2013_399.md5 "md5sum -c Dro2_25NOV2013_399.md5 > Dro2_25NOV2013_399.md5.checkresult"
qsub -b y -cwd -N md5_Dro3_26NOV2013_400.md5 -hold_jid wget_Dro3_26NOV2013_400.tgz,wget_Dro3_26NOV2013_400.md5 "md5sum -c Dro3_26NOV2013_400.md5 > Dro3_26NOV2013_400.md5.checkresult"
qsub -b y -cwd -N md5_Dro4_28NOV2013_401.md5 -hold_jid wget_Dro4_28NOV2013_401.tgz,wget_Dro4_28NOV2013_401.md5 "md5sum -c Dro4_28NOV2013_401.md5 > Dro4_28NOV2013_401.md5.checkresult"
qsub -b y -cwd -N md5_Dro5_29NOV2013_402.md5 -hold_jid wget_Dro5_29NOV2013_402.tgz,wget_Dro5_29NOV2013_402.md5 "md5sum -c Dro5_29NOV2013_402.md5 > Dro5_29NOV2013_402.md5.checkresult"
qsub -b y -cwd -N md5_Dro6_1DEC2013_403.md5  -hold_jid wget_Dro6_1DEC2013_403.tgz,wget_Dro6_1DEC2013_403.md5  "md5sum -c Dro6_1DEC2013_403.md5  > Dro6_1DEC2013_403.md5.checkresult"

#extract PacBio runs
echo "extracting PacBio runs"
qsub -b y -cwd -N tar_Dro1_24NOV2013_398.tgz -hold_jid wget_Dro1_24NOV2013_398.tgz "tar -xvzf Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N tar_Dro2_25NOV2013_399.tgz -hold_jid wget_Dro2_25NOV2013_399.tgz "tar -xvzf Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N tar_Dro3_26NOV2013_400.tgz -hold_jid wget_Dro3_26NOV2013_400.tgz "tar -xvzf Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N tar_Dro4_28NOV2013_401.tgz -hold_jid wget_Dro4_28NOV2013_401.tgz "tar -xvzf Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N tar_Dro5_29NOV2013_402.tgz -hold_jid wget_Dro5_29NOV2013_402.tgz "tar -xvzf Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N tar_Dro6_1DEC2013_403.tgz  -hold_jid wget_Dro6_1DEC2013_403.tgz  "tar -xvzf Dro6_1DEC2013_403.tgz"

#delete .tar archives and md5 files
qsub -V -b y -cwd -N rm_tar_md -hold_jid tar_Dro1_24NOV2013_398.tgz,tar_Dro2_25NOV2013_399.tgz,tar_Dro3_26NOV2013_400.tgz,tar_Dro4_28NOV2013_401.tgz,tar_Dro5_29NOV2013_402.tgz,tar_Dro6_1DEC2013_403.tgz "rm Dro*.tgz Dro*.md5"
getDrosPacBioData.shview rawview file on GitHub

(iii) Once this download is complete, we’ll map data from all of the cells to the Release 5 reference and merge the output into a single file of mapped (.bam), retaining unmapped (.unaligned) reads:


#map .bax.h5 files from individual PacBio cells to Release 5 using blasr
echo "mapping .bax.h5 files to Release 5"
for input in `ls Dro*/*/*/*bax.h5` 
 sample=`basename $input .bax.h5`
 qsub -l cores=12 -V -b y -cwd -N blasr_${sample} "install/smrtanalysis- $input dm3.fa -bestn 1 -nproc 12 -minPctIdentity 80 -sam -out $output1 -unaligned $output2"
 qsub -V -b y -cwd -N sort_${sample} -hold_jid blasr_${sample} "install/smrtanalysis- view -bS $output1 | install/smrtanalysis- sort - $sample"

#merge bam files from individual cells and index merged bam file
echo "merging and indexing bam files"
holdlist=`printf -- '%s,' "${samples[@]}"`
qsub -V -b y -cwd -N merge_pacbio -hold_jid $holdlist "install/smrtanalysis- merge $output $input"
qsub -V -b y -cwd -N index_pacbio -hold_jid merge_pacbio "install/smrtanalysis- index $output"

#merge unmapped reads and remove .sam, .bam and .unaligned files from individual cells
echo "merging unmapped reads and cleaning up"
qsub -V -b y -cwd -N merge_unmapped -hold_jid index_pacbio "cat m131*unaligned > dm3PacBio.unaligned"
qsub -V -b y -cwd -N rm_cell_files -hold_jid merge_unmapped "rm m131*unaligned m131*bam m131*sam"

mapDrosPacBioData.shview rawview file on GitHub

(iv) Last, let’s generate the coverage plot shown above using the genome coverage utility in Aaron Quinlan’s BedTools package, R and imagemagick:

mysql --user=genome -A -e "select chrom, size from dm3.chromInfo" > dm3.genome
bedtools genomecov -ibam dm3PacBio.bam -g dm3.genome > dm3.genome.coverage
R --no-save < dm3PacBioCoverage.R
convert -density 300 -quality 100 dm3PacBioCoverage.pdf dm3PacBioCoverage.jpg

Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.

Error-Corrected PacBio Sequences for the D. melanogaster Reference Strain

Posted 06 Nov 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

Using PacBio and Illumina whole genome shotgun sequences we recently released for the D. melanogaster reference strain, Sergey Koren and Adam Phillippy at the University of Maryland have recently run their pacBioToCA method to generate a dataset of error-corrected PacBio reads for this dataset, which the have kindly made available here for re-use without restriction. This pilot data set is not at high enough coverage and thus a whole genome assembly was not attempted. Nevertheless, both the raw and error-corrected datasets should be of use to better understand the nature of PacBio data and the pacBioToCA pipeline as applied to Drosophila genomes.

The 2057_PacBio_corrected.tgz archive contains the following files:

  • pacbio.blasr.spec – the specification file used for the pacBioToCA run.
  • corrected.trim.fastq.bz2 – the error-corrected PacBio reads.
  • corrected.trim.lens – a file containing columns with corrected read id, read length, running total BP, running mean, and running coverage assuming 140Mbp genome size.
  • corrected.trim.names – a look-up table to map read IDs in the pacBioToCA output to the original PacBio input IDs.

Sergey also sent along a quick summary of the correction:

The correction used the latest CA from the repository (as of 10/15/13). The max number of mappings each Illumina sequence could have was set to 20, the repeat separation was set to 10. The genome size was set to 180Mbp. BLASR was used to align the Illumina sequences to PacBio sequences for correction. The input for correction was 673 Mbp of PacBio data (max: 8,231, mean: 1,587) which corresponds to approximately 4.8X of a 140Mbp genome. The correction produced 469Mbp (max: 6,883, mean: 1,410) or 3.35X. The throughput was approximately 69%. To estimate, the accuracy of the data, it was mapped back to the reference D. melanogaster. The sequences average 99.85% identity. Approximately 10% of the sequences mapped in more than a single piece to the reference. Some of these mappings were due to short indel differences between the reference and the reads or N’s in the reference. Some sequences mapped across large distances in the reference. I did not confirm if these split mappings were supported by the uncorrected reads or not.

We thank both Sergey and Adam for taking the time to run their pacBioToCA pipeline and making these data available to the community, and hope these data are of use to others in their research.

Hosting Custom Tracks for the UCSC Genome Browser on Dropbox

Posted 17 Sep 2013 — by caseybergman
Category filesharing, genome bioinformatics, UCSC genome browser

One of the most powerful features of the UCSC Genome Browser is the ability to analyze your own genome annotation data as Custom Tracks at the main site. Custom Track files can be uploaded via several mechanisms: by pasting into a text box, loading a file through your web browser, or placing a file on a web-accessible server and pasting in a URL to the remote file. Loading custom tracks via remote URLs is the preferred method for large data files, and is mandatory for file types that require an associated “index” file such as the BAM file format commonly used in next-generation sequence analysis.

When developing a new pipeline, debugging a script or doing preliminary data analysis, you may frequently need to submit slightly different versions of your custom tracks to the UCSC browser. When using the remote URL function to load your data, this may require moving your files to a web accessible directory many times during the course of a day.  This is not a big problem when working on a machine that is running a web server. But when working on a machine that is not configured to be a web server (e.g most laptops), this often requires copying your files to another machine, which is a tiresome step in visualizing or analyzing your data.

During a recent visit to the lab, Danny Miller suggested using Dropbox to host custom tracks to solve this problem. Dropbox provides a very easy way to generate files on a local machine that sync automatically to a remote web-accessible server with effectively no effort by the user. Using Dropbox to host your UCSC custom tracks turns out to be a potentially very effective solution, and it looks like several others have stumbled on the same idea. In testing this solution, we have found that there are few gotchas that one should be aware of to make this protocol work as easily as it should.

First, in July 2012 Dropbox changed their mechanism of making files web-accessible and no longer provide a “Public Folder” for new accounts. This was done ostensibly to improve security, since files in a Public Folder can be indexed by Google if they are linked to. Instead, now any file in your Dropbox folder can be made web accessible using the “Share Link” function. This mechanism for providing URLs to the UCSC Genome Browser is non-optimal for two reasons.

The first problem with the Share Link function is that the URL automatically generated by Dropbox cannot be read by the UCSC Genome Browser. For example, the link generated to the file “test.bed” in my Dropbox folder is “”, which gives an “Unrecognized format line 1” error when pasted into the UCSC Browser.  This can easily be fixed if you just want to load a single custom track  to the UCSC Browser using Dropbox by simply replacing “www.dropbox” in the URL generated by Dropbox with “dl.dropboxusercontent”. In this example, the corrected path to the file would be “”, which can be loaded by the UCSC Genome Browser automatically.

The second problem with using the “Share Link” function for posting custom tracks to UCSC is that URL generated to each file using this function is unique. This is a problem for two reasons: (i) you need to share and modify links for each file you upload and (ii) neither you nor the UCSC Genome Browser are able to anticipate what the path would be for multiple custom track files. This is problematic for custom track file formats that require an associated index file, which is assumed by the UCSC Genome Browser to be in the same directory as your custom track, with the appropriate extension. Since Dropbox makes a unique path to the index file, even if shared, there is no way for the Genome Browser to know where it is. Both of these issues can be solved by using the Public Folder function of your Dropbox account, rather than the Share Links function.

The good news is that Dropbox does still make the Public Folder function available for older accounts and for newer accounts, you can activate a Public Folder using this “secret” link. By placing your custom tracks in your Public Folder, you now have a stable base URL to provide files the UCSC Genome Browser that does not require editing the URL, does not require explicitly sharing files, and can be anticipated by you and the browser. Following on from the example above, the link to the “test.bed” file inside a “custom_tracks” directory in my Dropbox Public Folder would be “” (your dropboxuserid will be long integer). Thus if you are using Dropbox to host many custom tracks or files that require an index file, the Public Folder option is the way to go.

There are a couple caveats to using Dropbox to host custom tracks. The first is that you are limited to the size of Dropbox allocation, which you can increase by paying for it or inviting new users to use Dropbox. UPDATE: According to Dave Tang (see comment below),  if you use Dropbox to host large BigWig files you may get blocked by DropBox. The second is that any link you share or anything in your Public Folder is public. So any stable links to your files from webpages may ultimately be indexed by Google. Since there is no option to password protect shared files in the Public Folder of your Dropbox account, users looking for free options to sync large, password-protected custom track files with indexes to remote web-accessible servers should look into BitBucket, which has no filesize limit (unlike GitHub).

Credits: Danny Miller (Stowers Institute) had the original inspiration for this solution, and Michael Nelson (U. of Manchester) helped identify pros/cons of the various Dropbox hosting options.

A Reference “Hologenome” for D. melanogaster

Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.

In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.


# Script to construct reference "hologenome"   #
# for D. melanogaster and associated microbes  #
# Casey M. Bergman (University of Manchester)  #

# download D. melanogaster (dm3) reference genome from UCSC genome database
# unpack dm3, remove composite mtDNA, extract mtDNA from chrU & combine into
# one .fa file excluding chrU & chrUextra

wget -q
tar -xvzf chromFa.tar.gz
rm chrM.fa
echo ">chrM_iso1" > chrM.fa
faFrag chrU.fa 5288527 5305749 stdout | grep -v chrU >> chrM.fa
cat chr2L.fa chr2LHet.fa chr2R.fa chr2RHet.fa chr3L.fa chr3LHet.fa chr3R.fa chr3RHet.fa chr4.fa chrX.fa chrXHet.fa chrYHet.fa chrM.fa > dm3.fa 
rm chromFa.tar.gz chr2L.fa chr2LHet.fa chr2R.fa chr2RHet.fa chr3L.fa chr3LHet.fa chr3R.fa chr3RHet.fa chr4.fa chrX.fa chrXHet.fa chrYHet.fa chrU.fa chrUextra.fa chrM.fa

# download yeast (sacCer3) reference genome from UCSC genome database
# unpack and rename sacCer3 chroms & combine into one .fa file

wget -q
tar -xvzf chromFa.tar.gz
for i in chr*fa; do cat $i | sed 's/>chr/>sacCer3_chr/g' >> sacCer3.fa; done
rm chromFa.tar.gz chrI.fa chrII.fa chrIII.fa chrIV.fa chrIX.fa chrM.fa chrV.fa chrVI.fa chrVII.fa chrVIII.fa chrX.fa chrXI.fa chrXII.fa chrXIII.fa chrXIV.fa chrXV.fa chrXVI.fa

# download genomes for microbes known to be associated with D. melanogaster
# from NCBI, combine multi-contig draft assemblies into single fasta file 
# per species & rename fasta headers to human readable form

# Wolbachia pipientis
wget -q
echo ">w_pipientis" >> w_pipientis.fa
grep -v \> AE017196.fna >> w_pipientis.fa
rm AE017196.fna

# Pseudomonas entomophila
wget -q
echo ">p_entomophila" >> p_entomophila.fa
grep -v \> NC_008027.fna >> p_entomophila.fa
rm NC_008027.fna

# Commensalibacter intestini
wget -q
tar -xvzf AGFR00000000.contig.fna.tgz
cat AGFR01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Commensalibacter_intestini_A911_74_/>c_intestini_/g' > c_intestini.fa
rm AGFR01*fna AGFR00000000.contig.fna.tgz

# Acetobacter pomorum
wget -q
tar -xvzf AEUP00000000.contig.fna.tgz
cat AEUP01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Acetobacter_pomorum_DM001_Contig00/>a_pomorum_/g' > a_pomorum.fa
rm AEUP01*fna AEUP00000000.contig.fna.tgz

# Gluconobacter morbifer
wget -q
tar -xvzf AGQV00000000.contig.fna.tgz
cat AGQV01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Gluconobacter_morbifer_G707_75_/>g_morbifer_/g'  > g_morbifer.fa
rm AGQV01*fna AGQV00000000.contig.fna.tgz

# Providencia burhodogranariea
wget -q
tar -xvzf AKKL00000000.contig.fna.tgz
cat AKKL01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/Providencia_burhodogranariea_DSM_19968_contig000/p_burhodogranariea_/g' > p_burhodogranariea.fa
rm AKKL01*fna AKKL00000000.contig.fna.tgz

# download genomes for microbes known to be associated with D. melanogaster
# from Ensembl & rename fasta headers to human readable form

# Providencia alcalifaciens
wget -q
gunzip Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa.gz
cat Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>contig000/>p_alcalifaciens_/g' > p_alcalifaciens.fa
rm Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa*

# Providencia rettgeri
wget -q
gunzip Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa.gz
cat Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>Contig/>p_rettgeri_/g' > p_rettgeri.fa
rm Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa* 

# Enterococcus faecalis
wget -q
gunzip Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa.gz
cat Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>GG/>e_faecalis_GG/g' > e_faecalis.fa
rm Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa*

# create D. melanogaster "hologenome" 
# from individual species fasta files
cat dm3.fa sacCer3.fa w_pipientis.fa p_entomophila.fa c_intestini.fa a_pomorum.fa g_morbifer.fa p_burhodogranariea.fa p_alcalifaciens.fa p_rettgeri.fa e_faecalis.fa > dm3_hologenome.fa
makeDmelHologenome.shview rawview file on GitHub

Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:

$ wget
$ sh
$ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50

Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.

PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain

Posted 31 Jul 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

As part of a collaboration with Danny Miller and Scott Hawley from the Stowers Institute, we have generated whole genome shotgun sequences using PacBio RS technology for the Drosophila melanogaster y; cn, bw, sp strain (Bloomington 2057), the same strain that was also used to assemble the D. melanogaster reference genome sequence. We’ve been meaning to release these data to the community since we got the data in April, but have been waylaid by teaching commitments and a spate of recent server problems.  Prompted by Danny’s visit to the Bergman Lab over the last two weeks and the generous release by Dmitri Petrov’s lab of a data set of Illumina long reads using the Moleculo technology for the same strain of D. melanogaster, we’ve finally gotten around to bundling these D. melanogaster PacBio sequences for general release. We’re hoping that the availability of both PacBio and Moleculo long-read data for the same strain that has one of the highest quality reference genomes for any species will allow the genomics community to investigate directly the pros/cons of each of these new exciting technologies.

These PacBio sequence data were generated by the University of Michigan DNA Sequencing Core facility on a PacBio RS using DNA from 10 adult males. Flies were starved for 4 hours to reduce microbial DNA contaminants before freezing at -80oC and prepped using the Qiagen DNeasy Blood & Tissue Kit (catalog number 69504). Six SMRT cells were used for sequencing 5 Kb fragment libraries on two Long Read mode runs: 1 cell on the first run (Run53) and 5 cells on the second run (Run55). Excluding the circular consensus (CCS) reads, and combining data from the S1 and S2 files for each of the six cells, we obtained 1,357,183,439 bp of raw DNA sequence from this experiment or roughly 7.5x coverage of the 180 Mb male D. melanogaster genome.

A ~63 Gb tar.gz of the entire PacBio long read dataset including .fasta, .fastq, and .h5 files can be found here. To help newcomers to PacBio data (like us) get over the hurdle of installing the PacBio long-read aligner blasr, we are also distributing CentOS 6/Scientific Linux 64-bit and source RPMs for blasr, made by Peter Briggs in the University of Manchester Bioinformatics Core Facility.

We have also generated 100bp paired-end Illumina data from the same stock, to aid with error-correction of the PacBio long reads. These data were generated by the Stowers Institute Molecular Biology Core facility. As above, genomic DNA was prepared from 10 starved, adult males using the Qiagen DNeasy Blood & Tissue Kit. 1ug of DNA from each was fragmented using a Covaris S220 sonicator (Covaris Inc.) to 250 base pair (bp) fragments by adjusting the treatment time to 85 seconds. Following manufacturer’s directions, short fragment libraries were made using the KAPA Library Preparation Kits (KAPA Biosystems, Cat. No. KK8201) and Bioo Scientific NEXTflex™ DNA Barcodes (Bioo Scientific, Cat. No. 514104). The resulting libraries were purified using Agencourt AMPure XP system (Beckman Coulter, Cat. No. A63880), then quantified using a Bioanalyzer (Agilent Technologies) and a Qubit Fluorometer (Life Technologies).  The library was pooled with several other strains, re-quantified and run as high output mode on six 100 bp paired-end lanes on an Illumina HiSeq 2000 instrument, using HiSeq Control Software and Real-Time Analysis (RTA) version Secondary Analysis version CASAVA-1.8.2 was run to demultiplex reads and generate FASTQ files.

The Illumina 100 bp paired-end dataset in .fastq format can be at ENA under accession ERX645969. Additional Illumina data for the ISO1 reference strain has been generated previously by Chuck Langley’s lab, either from mixed adult males and females (SRX040484) or haploid embryos (SRX040485, SRX040486, SRX040491) that could be used to supplement the PacBio error correction.

As with previous unpublished data we have released from the Bergman Lab, we have chosen to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomic data is published, please cite:

  • Miller, D.E., C.B. Smith, R.S. Hawley and C.M. Bergman (2013) PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain.

Visualizing Biological Data Using the SVGmap Browser

Posted 14 Mar 2013 — by caseybergman
Category data mining, drosophila

Early in 2012, Nuria Lopez-BigasBiomedical Genomics Group published a paper in Bioinformatics describing a very interesting tool for visualizing biological data in a spatial context called SVGmap. The basic idea behind SVGMap is (like most good ideas) quite straightforward – to plot numerical data on a pre-defined image to give biological context to the data in an easy-to-interpret visual form.

To do this, SVGmap takes as input an image in Scalable Vector Graphics (SVG) format where elements of the image are tagged with an identifier, plus a table of numerical data with values assigned to the same identifier as in the elements of the image. SVGMap then integrates these files using either a graphical user interface that runs in standard web browser or a command line interface application that runs in your terminal, allowing the user to display color-coded numerical data on the original image. The overall framework of SVGMap is shown below in an image taken from a post on the Biomedical Genomics Group blog.

Overview of the SVGmap

We’ve been using SVGMap over the last year to visualize tissue-specific gene expression data in Drosophila melanogaster from the FlyAtlas project, which comes as one of the pre-configured “experiments” in the SVGMap web application.

More recently, we’ve been also using the source distribution of SVGMap to display information about the insertion preferences of transposable elements in a tissue-specific context, which as required installing and configuring a local instance of SVGMap and run it via the browser. The documentation for SVGMap is good enough to do this on your own, but it took a while for us to get a working instance the first time around. We ran into the same issues again the second time, so I thought I write up my notes for future reference and to help others get SVGMap up and running as fast as possible.

Install and Launch the SVGMap exectuable

Open a terminal and execute the following commands:

cd svgmap-server
sh &

This will install and launch the SVGMap executable which runs as a web application on your machine on port 8095.

Configure your “experiment”

The combination of an image and data source is considered an “experiment” in the SVGmap framework.  Here we are going to use one of the SVG images (of the adult Dipteran body plan adapted from Creative Commons Licensed images in Wikipedia found here and here) provided by SVGmap plus a new data source from our transposable elment insertion site analysis as an example.

1) Type http://localhost:8095 into your web browser and click “Return”. You will now see a local version of SVGmap running on your machine which is identical to the version of SVGmap hosted at: but with nodatasource installed.

2) Click “Browse” tab at top of page, which will take you to another page where you can “Select the experiment to browse”. Skip this for now, since we are still in the process of configuration.

3) Click “configure browser” in the upper right hand corner of the page. This will take you to the “Experiment configuration area”.

4) Log in with the following credentials and click “Accept”:

  • Login: svgmap
  • Password: svgmap

5) This will bring up an option to add a new source in the Experiment configuration area. Click “New Source”.

6) This will send you to the “Add new experiment page”, where you can enter name of the new Experiment (e.g. “Test Experiment”).

7) Next, select the test SVG file of the Dipteran body plan by clicking “Browse…” next to the “Search SVG file” box and navigating to the examples/Drosophila_expression/ directory in your svmap installation directory and selecting the file named “housefly_anatomy_bnw.svg”.

8) Next we need to load in the file containing the data you would like to visualize. The formatting specification of the datafile for SVGmap is documented here. In our simple test case, we only want to visualize a single set of numerical values over different Dipteran tissues using our test data file which you can obtain as follows:


To load this test datafile, click “Browse…” next to the “Data file” dialogue box, navigate to the location of the test data file you just downloaded and select it.

9) Finally, click “Save and configure.”

10) This will send you to the “Edit experiment configuration” page, where you can select how to display your numerical data. In our test case, select “ratio” for the “Image option values” (it’s the only option, but it needs to be selected) and “linear two sided” for the Scale option. Make sure image radio button is selected, then click “Accept”.

Visualize your Data

Your new experiment has now been loaded and configured and you are ready to browse your data in the context of your image.

1) Click the “Browse” tab at top of page.

2) Select the experiment you just created from drop down menu (e.g. “Test Experiment”) and click “view”.

3) Enter a “Keyword(probe)” for the data you want to visualize, which in our case is called “TE” and click “Search”. This should bring up an image of a fly, color-coded according to your data with a scale bar below indicating the color associated with a particular numerical value of your data.

4) You can now adjust the range and color code for your scale by clicking the “edit” button next to “linear two sided scale”. For example, if you want a symmetrical scale that brackets the full range of our test TE data, click the radio button for “” and set Min to -0.3 and Max to 0.3, scroll down to the bottom of the pop-up box and click “OK”. Then click “Search” again to update your image, and you should see something that looks like the following screenshot:


5) Finally, you can export the image by clicking “Export to file” in the bottom right hand corner of the page. This will export at zipfile of the entire experiment, including the original input SVG file and a slightly modified version of you data file, plus the labeled SVG file and scale bar as a separate JPG file.

There you have it. You’ve now successfully installed SVGmap and configured a customized experiment for visualizing biological data on a SVG image, in this case the Dipteran adult body plan. In the next post, I’ll give a run through of how to use SVGmap at the command line, and introduce a stripped down R implementation of the SVGmap approach, called FlyFig, we have developed in the Bergman Lab to visualize numerical data on the Dipteran body plan.

Compendium of Drosophila Microbial Symbiont Genome Assemblies

Posted 25 Jan 2013 — by caseybergman
Category drosophila, genome bioinformatics, microbes

Following on from a recent post where we compiled genome assemblies for species in the Drosophila genus, here we bring together the growing list of genome assemblies for microbial symbionts of Drosophila species. Please add information about any other Drosophila symbiont genome sequences assemblies that might be missing from this table in the comments below.

Assembly Producer
Wolbachia pipientis wMel D. melanogaster TIGR AE017196
Wolbachia pipientis wMelPop D. melanogaster O’Neill Lab AQQE00000000
Wolbachia pipientis wAu D. simulans Sinkins Lab LK055284
Wolbachia pipientis wSim D. simulans Salzberg Lab AAGC00000000
Wolbachia pipientis wRi D. simulans Andersson Lab NC_012416
Wolbachia pipientis wHa D. simulans Andersson Lab NC_021089
Wolbachia pipientis wNo D. simulans Andersson Lab NC_021084
Wolbachia pipientis wSuzi D. suzukii Anfora Lab CAOU02000001
Wolbachia pipientis wAna D. annanasae Salzberg Lab AAGB00000000
Wolbachia pipientis n.a. D. willistoni Salzberg Lab AAQP00000000
Wolbachia pipientis wRec D. recens Bordenstein Lab JQAM00000000
Pseudomonas entomophila L48 D. melanogaster Genoscope NC_008027
Commensalibacter intestini A911 D. melanogaster Lee Lab AGFR00000000
Acetobacter pomorum DM001 D. melanogaster Lee Lab AEUP00000000
Gluconobacter morbifer G707 D. melanogaster Lee Lab AGQV00000000
Providencia sneebia DSM 19967 D. melanogaster Lazzaro Lab AKKN00000000
Providencia rettgeri Dmel1 D. melanogaster Lazzaro Lab AJSB00000000
Providencia alcalifaciens Dmel2 D. melanogaster Lazzaro Lab AKKM00000000
Providencia burhodogranariea DSM 19968 D. melanogaster Lazzaro Lab AKKL00000000
Lactobacillus brevis EW D. melanogaster Lee Lab AUTD00000000
Lactobacillus plantarum WJL D. melanogaster Lee Lab AUTE00000000
Lactobacillus plantarum DmelCS_001 D. melanogaster Douglas Lab JOJT00000000
Lactobacillus fructivorans DmelCS_002 D. melanogaster Douglas Lab JOJZ00000000
Lactobacillus brevis DmelCS_003 D. melanogaster Douglas Lab JOKA00000000
Acetobacter pomorum
DmelCS_004 D. melanogaster Douglas Lab JOKL00000000
Acetobacter malorum DmelCS_005 D. melanogaster Douglas Lab JOJU00000000
Acetobacter tropicalis DmelCS_006 D. melanogaster Douglas Lab JOKM00000000
Lactococcus lactis Bpl1 D. melanogaster Douglas Lab JRFX00000000
Enterococcus faecalis Fly1 D. melanogaster Broad Institute ACAR00000000

Formatting Figures for PLoS ONE with Illustrator on OSX

Posted 30 Aug 2012 — by caseybergman
Category OSX hacks

We in the Bergman Lab are big supporters of the Public Library of Science, and increasingly have been submitting papers to PLOS ONE over the last few years because we believe this journal represents the true values of science: openness, technical rigor and objectivity. Because PLOS ONE uses a streamlined production process, their author guidelines are very strict, with article formatting responsibilities falling on the author that would be traditionally handled with the help of a copy editor.

One area of PLOS ONE article formatting that I have found particularly difficult in the past is to get the exact figure specifications that pass the automated checks of the PLOS ONE Editorial Manager system.  Personally, I find the guidelines for figure preparation for PLOS ONE to be somewhat bewildering. In my first few submissions, I wasted substantial time uploading files that failed the automated checks and/or I had nerve-wracking requests to change figure formats at the post-acceptance stage, where I did not get another chance to look at the manuscript before it goes live. After some trial and error, I have gotten my head around what actually works to prepare figures for a trouble-free PLOS ONE submission. So to save a headache and speed up the publication process for one or more scientist out there (and so I have access to these notes out of the office), I’ve typed up a protocol that should work for preparing PLOS ONE-ready figures using Illustrator on OSX.

1. Prepare your figure in your favorite sofware (R, Illustrator, etc) and Save.

2. Open/Import into Illustrator.

3. In Illustrator, under the “File” menu, select “Export…”. You will now see a window entitled “Export”.

4. Select “TIFF (tif)” from the “Format” dropdown menu. You should see something like this:

5. Click “Export”. You will now see a window entitled “TIFF Options”.

6. Set the “Color Model” drop-down menu to “RGB”.

7. Click the “Other” radio button for the “Resolution” setting and set to 500 dpi.

8. Select the “Anti-Aliasing” check-box.

9. Select the “LZW Compression” check-box. At this point you should see a screen something like this:

10. Click “OK”.

You should now have a 500 dpi .tif file that is ready to upload with minimal (to no) complaints by the PLOS ONE Editorial Manager, and hopefully your next open-access manuscript will be speeding to publication soon.

Notes: This recipe was developed using an Illustrator 11.0.0 on a MacBook Air running OSX 10.6.8. Please don’t laugh at my ancient software – I find upgrading is the enemy of efficiency.

Release of 20 European Drosophila melanogaster genomes

Posted 17 Jul 2012 — by caseybergman
Category drosophila, high throughput sequencing, population genomics

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

Background: As part of an ongoing efforts to characterize genetic diversity in the nuclear and cytoplasmic genomes of D. melanogaster, the Haddrill and Bergman Labs have collaborated to sequence the complete genomes of 20 D. melanogaster isofemale strains collected by Penny Haddrill in Montpellier, France in August 2010. These 20 genomes represent a random sample of the full collection described by Haddrill and Vespoor (2011), which also describes microsatellite variation data for these strains.

Following on from the very generous early release of D. melanogaster genomes by major resequencing efforts in Drosophila, we have decided to follow suit and release these genomes prior to publication to maximise their utility by the wider research community and prevent unnecessary duplication of effort. One major aim for our sequencing of a reasonably high number of strains from this European population is to provide a complementary dataset to help interpret the larger samples of North American and (predominantly) African strains from the Drosophila Genetic Reference Panel and Drosophila Population Genomics Project, respectively. For more on the philosophy behind why we have made the decision to release these data early, please see this blog post on genomic data release by individual labs in the next-generation sequencing era.

Methods: Genomic DNA was prepared by Penny Haddrill for each isofemale line by pooling fifty females, snap freezing them in liquid nitrogen, extracting DNA using a standard phenol-chloroform extraction protocol with ethanol and ammonium acetate precipitation. 500 bp short-insert libraries were constructed and 91 bp paired-end reads were generated using an Illumina HiSeq 2000 to an estimated coverage of ~50x per strain by BGI-Hong Kong. Basic QC on reads was performed by BGI and mapping to the Wolbachia genome following the protocol in Richardson et al. (submitted) confirmed the same infection status for as determined by PCR in Haddrill and Vespoor (2011).

Conditions for use: The Haddrill and Bergman labs intend to use these data to study patterns of genetic diversity in the nuclear and cytoplasmic genomes, to estimate the ratio of diversity on the X chromosome relative to the autosomes, to detect signatures of both positive and negative selection in the nuclear and cytoplasmic genomes, and investigate the impact of variation in recombination rate around the genome.

We have decidede to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomes is published, please cite these data as:

  • Haddrill, P. and C.M. Bergman (2012) 20 Drosophila melanogaster genomes from Montpellier, France.

[Update: These genome sequence have now been published. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

The data: Gzipped Illumina fastq files for forward and reverse paired reads can be downloaded at the following locations. A script to dowload all files in serial can be found below the table.

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite accession numbers ERR705945-ERR705964 in your work.]