Archive for the ‘high throughput sequencing’ Category

Automating ASCP-based submission of NGS data to ENA using expect

Posted 12 Dec 2014 — by caseybergman
Category EBI, filesharing, genome bioinformatics, high throughput sequencing, linux, NCBI

Submitting next-generation sequencing (NGS) data to International Nucleotide Sequence Database Collaboration repositories such as EBI or NCBI is an essential, but time-consuming, phase of many projects in modern biology. Therefore, mechanisms to speed up NGS data submission are a welcome addition to many biologists toolkit.  The European Nucleotide Archive (ENA) provides detailed instructions on how to submit NGS data to their “dropbox” staging area, which is a key part of the overall submission process for an NGS project. This can be done in three ways:

  • Using ENA’s Webin client in your web browser,
  • Using a standard FTP client at the command line, or
  • Using Aspera’s ASCP client at the command line.

For users with many samples to submit, the latter two command line options are clearly preferred methods. While FTP is installed by default on most linux systems, transfer of large NGS data files by FTP is slow, and on some systems (such as ours) FTP is specifically disabled because of security concerns.

In contrast, ASCP is not installed by default on most linux systems, but provides a very fast method to transfer large data files to EBI or NCBI. One of the downsides of using ASCP is that it interactively prompts users for password information for each file transferred. This requires babysitting your ASCP-based command line submission and supplying the same password for each file, thereby undermining much of the automation that a command line solution should provide.

In searching around for solutions to the ASCP-babysitting problem, I stumbled on documentation page entitled “Expect script for automating Aspera uploads to the EBI ENA” written by Robert Davey at the The Genome Analysis Centre. I had never heard of the expect scripting language prior to reading this post, but it provides exactly the solution I was looking for:

[Expect] is used to automate control of interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, ssh, and others. Expect uses pseudo terminals (Unix) or emulates a console (Windows), starts the target program, and then communicates with it, just as a human would, via the terminal or console interface. (Wikipedia)

Robert’s expect script was a little more complicated than I needed for my purposes, and required a few tweaks to conform to updates to EBI’s ASCP submission process. Without too much trouble, I cobbled together a modified version that solves the automated transfer of any number of data files to the top level of your ENA dropbox:

set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l80M -d $line $
    expect "Password:"
    send "$pass\r"
    expect eof

This script requires expect and ASCP to be installed globally on your system, and for the user to provide three arguments:

  • a file of filenames (with the full path) to the files you would like to submit to ENA
  • the ID for your Webin submission account, and
  • the password for your Webin submission account

For example, if you have a directory of gzip’ed fastq files that you would like to submit to ENA, all you would need to submit your files in bulk would be to navigate to that directory and do something like the following:

#download the script above from github
wget --no-check-certificate

#create file of filenames
for k in `ls *gz`; do echo $j/$k; done > fofn

#perform ASCP submission, note: replace with your ENA (Webin ID and Password)
expect ena_submit.exp fofn Webin-ID Webin-Password

High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome

Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.

Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here, or alternatively paste the following line into the “Paste URLs or data:” box on the “Add Custom Tracks” page of the Release 5 (dm3) browser:

track type=bam bigDataUrl= name=dm3PacBio

As shown in the following figure, mapped read depth is roughly Poisson distributed with a mean coverage of ~90x-95x on the autosomal arms and ~45x coverage for the X-chromosome. A two-fold reduction in read depth on the X relative to the autosomes is expected in a DNA sample only from heterogametic XY males. Mapped read depth is important to calculate empirically post hoc (not just based on theoretical a priori expectations) since the actual genome size of D. melanogaster is not precisely known and varies between males and females, with an upper estimate of the total male genome size (autosomes + X + Y) being ~220 Mb. Thus, we conclude the coverage of this dataset is of sufficient depth to generate long-range de novo assemblies of the D. melanogaster genome using PacBio data only, as confirmed by the preliminary Celera and FALCON assemblies.


The following browser screenshots show representative regions of how this high coverage PacBio dataset maps to the Release 5 reference genome:

  • A typical region of unique sequence (around the eve locus). Notice that read coverage is relatively uniform across the genome and roughly equal on the forward (blue) and reverse (red) strands. Also notice that some reads are long enough to span the entire distance between neighboring genes, suggesting unassembled PacBio reads are now long enough to establish syntenic relationships for neighboring genes in Drosophila from unassembled reads.
  • A typical region containing a transposable element (TE) insertion (in the ds locus), in this case a 8126 bp roo LTR retrotransposon (FBti0019098). Substantial numbers of mapped reads align across the junction region of the TE and its unique flanking regions. However, there is an excess of shorter reads inside the boundaries of the TE, which are likely to be incorrectly mapped because they are shorter than the length of the active TE sequence and have mapping quality values of ~0. With suitable read length and mapping quality filters, the unique location of TE insertions in the euchromatin should be identifiable using uncorrected PacBio reads.


  • One of the few remaining and persistent gaps in the euchromatic portion of the reference sequence that can be filled using PacBio sequences. This gap remains in the upcoming Release 6 reference sequence (Sue Celniker and Roger Hoskins, personal communication). This shows blasr can align PacBio reads through smaller sequence gaps that are represented by padded N’s, however the visualization of such reads may sometimes be flawed (i.e. by artificially displaying a gap in the mapped read that arises from a lack of alignment to the ambiguous nucleotides in the gap).


Below we provide a reproducible transcript of the code used to (i) setup our working environment, (ii) download and unpack this D. melanogaster PacBio dataset, (iii) generate the BAM file of mapped reads released here, and (iv) generate coverage density plots for the major chromosome arms shown above. Part of the rationale for providing these scripts is to showcase how simple it is to install the entire PacBio smrtanalysis toolkit, which is the easiest way we’ve found to install the blasr long-read aligner (plus get a large number of other goodies for processing PacBio data like pacBioToCA, pbalign, quiver, etc).

These scripts are written to be used by someone with no prior knowledge of PacBio data or tools, and can be run on a Oracle Grid Engine powered linux cluster running Scientific Linux (release 6.1).  They should be easily adapted for use on any linux platform supported by PacBio’s smrtanalysis toolkit, and to run on a single machine just execute the commands inside the “quote marks” on the lines with qsub statements.

(i) First, let’s download the Release 5 reference genome plus the smrtanalysis toolkit:


# get dm3 and reformat into single fasta file
echo "getting dm3 and reformatting into single fasta file"
wget -q
tar -xvzf chromFa.tar.gz
cat chr*.fa > dm3.fa
rm chromFa.tar.gz chr*.fa

#install PacBio smrtanalysis suite (including blasr PacBio long-read mapping engine)
echo "installing PacBio smrtanalysis suite"
bash --extract-only --rootdir ./
setupDrosPacBioEnv.shview rawview file on GitHub

If you want to have the smrtanalysis tooklit in your environment permanently, you’ll need to add the following magic line to your .profile:

source install/smrtanalysis-

(ii) Next, let’s get the D. melanogaster PacBio data and unpack it:


#get compressed .tar archives of D. melanogaster PacBio runs
echo "get tar archives of PacBio runs"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.tgz "wget"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.tgz "wget"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.tgz "wget"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.tgz "wget"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.tgz "wget"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.tgz  "wget"

#get md5sums for PacBio .tar archives
echo "getting md5sums for PacBio tar archives"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.md5 "wget"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.md5 "wget"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.md5 "wget"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.md5 "wget"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.md5 "wget"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.md5  "wget"

#verify .tar archives have been downloaded properly
echo "verifying archives have been downloaded properly"
qsub -b y -cwd -N md5_Dro1_24NOV2013_398.md5 -hold_jid wget_Dro1_24NOV2013_398.tgz,wget_Dro1_24NOV2013_398.md5 "md5sum -c Dro1_24NOV2013_398.md5 > Dro1_24NOV2013_398.md5.checkresult"
qsub -b y -cwd -N md5_Dro2_25NOV2013_399.md5 -hold_jid wget_Dro2_25NOV2013_399.tgz,wget_Dro2_25NOV2013_399.md5 "md5sum -c Dro2_25NOV2013_399.md5 > Dro2_25NOV2013_399.md5.checkresult"
qsub -b y -cwd -N md5_Dro3_26NOV2013_400.md5 -hold_jid wget_Dro3_26NOV2013_400.tgz,wget_Dro3_26NOV2013_400.md5 "md5sum -c Dro3_26NOV2013_400.md5 > Dro3_26NOV2013_400.md5.checkresult"
qsub -b y -cwd -N md5_Dro4_28NOV2013_401.md5 -hold_jid wget_Dro4_28NOV2013_401.tgz,wget_Dro4_28NOV2013_401.md5 "md5sum -c Dro4_28NOV2013_401.md5 > Dro4_28NOV2013_401.md5.checkresult"
qsub -b y -cwd -N md5_Dro5_29NOV2013_402.md5 -hold_jid wget_Dro5_29NOV2013_402.tgz,wget_Dro5_29NOV2013_402.md5 "md5sum -c Dro5_29NOV2013_402.md5 > Dro5_29NOV2013_402.md5.checkresult"
qsub -b y -cwd -N md5_Dro6_1DEC2013_403.md5  -hold_jid wget_Dro6_1DEC2013_403.tgz,wget_Dro6_1DEC2013_403.md5  "md5sum -c Dro6_1DEC2013_403.md5  > Dro6_1DEC2013_403.md5.checkresult"

#extract PacBio runs
echo "extracting PacBio runs"
qsub -b y -cwd -N tar_Dro1_24NOV2013_398.tgz -hold_jid wget_Dro1_24NOV2013_398.tgz "tar -xvzf Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N tar_Dro2_25NOV2013_399.tgz -hold_jid wget_Dro2_25NOV2013_399.tgz "tar -xvzf Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N tar_Dro3_26NOV2013_400.tgz -hold_jid wget_Dro3_26NOV2013_400.tgz "tar -xvzf Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N tar_Dro4_28NOV2013_401.tgz -hold_jid wget_Dro4_28NOV2013_401.tgz "tar -xvzf Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N tar_Dro5_29NOV2013_402.tgz -hold_jid wget_Dro5_29NOV2013_402.tgz "tar -xvzf Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N tar_Dro6_1DEC2013_403.tgz  -hold_jid wget_Dro6_1DEC2013_403.tgz  "tar -xvzf Dro6_1DEC2013_403.tgz"

#delete .tar archives and md5 files
qsub -V -b y -cwd -N rm_tar_md -hold_jid tar_Dro1_24NOV2013_398.tgz,tar_Dro2_25NOV2013_399.tgz,tar_Dro3_26NOV2013_400.tgz,tar_Dro4_28NOV2013_401.tgz,tar_Dro5_29NOV2013_402.tgz,tar_Dro6_1DEC2013_403.tgz "rm Dro*.tgz Dro*.md5"
getDrosPacBioData.shview rawview file on GitHub

(iii) Once this download is complete, we’ll map data from all of the cells to the Release 5 reference and merge the output into a single file of mapped (.bam), retaining unmapped (.unaligned) reads:


#map .bax.h5 files from individual PacBio cells to Release 5 using blasr
echo "mapping .bax.h5 files to Release 5"
for input in `ls Dro*/*/*/*bax.h5` 
 sample=`basename $input .bax.h5`
 qsub -l cores=12 -V -b y -cwd -N blasr_${sample} "install/smrtanalysis- $input dm3.fa -bestn 1 -nproc 12 -minPctIdentity 80 -sam -out $output1 -unaligned $output2"
 qsub -V -b y -cwd -N sort_${sample} -hold_jid blasr_${sample} "install/smrtanalysis- view -bS $output1 | install/smrtanalysis- sort - $sample"

#merge bam files from individual cells and index merged bam file
echo "merging and indexing bam files"
holdlist=`printf -- '%s,' "${samples[@]}"`
qsub -V -b y -cwd -N merge_pacbio -hold_jid $holdlist "install/smrtanalysis- merge $output $input"
qsub -V -b y -cwd -N index_pacbio -hold_jid merge_pacbio "install/smrtanalysis- index $output"

#merge unmapped reads and remove .sam, .bam and .unaligned files from individual cells
echo "merging unmapped reads and cleaning up"
qsub -V -b y -cwd -N merge_unmapped -hold_jid index_pacbio "cat m131*unaligned > dm3PacBio.unaligned"
qsub -V -b y -cwd -N rm_cell_files -hold_jid merge_unmapped "rm m131*unaligned m131*bam m131*sam"

mapDrosPacBioData.shview rawview file on GitHub

(iv) Last, let’s generate the coverage plot shown above using the genome coverage utility in Aaron Quinlan’s BedTools package, R and imagemagick:

mysql --user=genome -A -e "select chrom, size from dm3.chromInfo" > dm3.genome
bedtools genomecov -ibam dm3PacBio.bam -g dm3.genome > dm3.genome.coverage
R --no-save < dm3PacBioCoverage.R
convert -density 300 -quality 100 dm3PacBioCoverage.pdf dm3PacBioCoverage.jpg

Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.

Error-Corrected PacBio Sequences for the D. melanogaster Reference Strain

Posted 06 Nov 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

Using PacBio and Illumina whole genome shotgun sequences we recently released for the D. melanogaster reference strain, Sergey Koren and Adam Phillippy at the University of Maryland have recently run their pacBioToCA method to generate a dataset of error-corrected PacBio reads for this dataset, which the have kindly made available here for re-use without restriction. This pilot data set is not at high enough coverage and thus a whole genome assembly was not attempted. Nevertheless, both the raw and error-corrected datasets should be of use to better understand the nature of PacBio data and the pacBioToCA pipeline as applied to Drosophila genomes.

The 2057_PacBio_corrected.tgz archive contains the following files:

  • pacbio.blasr.spec – the specification file used for the pacBioToCA run.
  • corrected.trim.fastq.bz2 – the error-corrected PacBio reads.
  • corrected.trim.lens – a file containing columns with corrected read id, read length, running total BP, running mean, and running coverage assuming 140Mbp genome size.
  • corrected.trim.names – a look-up table to map read IDs in the pacBioToCA output to the original PacBio input IDs.

Sergey also sent along a quick summary of the correction:

The correction used the latest CA from the repository (as of 10/15/13). The max number of mappings each Illumina sequence could have was set to 20, the repeat separation was set to 10. The genome size was set to 180Mbp. BLASR was used to align the Illumina sequences to PacBio sequences for correction. The input for correction was 673 Mbp of PacBio data (max: 8,231, mean: 1,587) which corresponds to approximately 4.8X of a 140Mbp genome. The correction produced 469Mbp (max: 6,883, mean: 1,410) or 3.35X. The throughput was approximately 69%. To estimate, the accuracy of the data, it was mapped back to the reference D. melanogaster. The sequences average 99.85% identity. Approximately 10% of the sequences mapped in more than a single piece to the reference. Some of these mappings were due to short indel differences between the reference and the reads or N’s in the reference. Some sequences mapped across large distances in the reference. I did not confirm if these split mappings were supported by the uncorrected reads or not.

We thank both Sergey and Adam for taking the time to run their pacBioToCA pipeline and making these data available to the community, and hope these data are of use to others in their research.

A Reference “Hologenome” for D. melanogaster

Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.

In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.

makeDmelHologenome.shview rawview file on GitHub

Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:

$ wget
$ sh
$ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50

Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.

PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain

Posted 31 Jul 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

As part of a collaboration with Danny Miller and Scott Hawley from the Stowers Institute, we have generated whole genome shotgun sequences using PacBio RS technology for the Drosophila melanogaster y; cn, bw, sp strain (Bloomington 2057), the same strain that was also used to assemble the D. melanogaster reference genome sequence. We’ve been meaning to release these data to the community since we got the data in April, but have been waylaid by teaching commitments and a spate of recent server problems.  Prompted by Danny’s visit to the Bergman Lab over the last two weeks and the generous release by Dmitri Petrov’s lab of a data set of Illumina long reads using the Moleculo technology for the same strain of D. melanogaster, we’ve finally gotten around to bundling these D. melanogaster PacBio sequences for general release. We’re hoping that the availability of both PacBio and Moleculo long-read data for the same strain that has one of the highest quality reference genomes for any species will allow the genomics community to investigate directly the pros/cons of each of these new exciting technologies.

These PacBio sequence data were generated by the University of Michigan DNA Sequencing Core facility on a PacBio RS using DNA from 10 adult males. Flies were starved for 4 hours to reduce microbial DNA contaminants before freezing at -80oC and prepped using the Qiagen DNeasy Blood & Tissue Kit (catalog number 69504). Six SMRT cells were used for sequencing 5 Kb fragment libraries on two Long Read mode runs: 1 cell on the first run (Run53) and 5 cells on the second run (Run55). Excluding the circular consensus (CCS) reads, and combining data from the S1 and S2 files for each of the six cells, we obtained 1,357,183,439 bp of raw DNA sequence from this experiment or roughly 7.5x coverage of the 180 Mb male D. melanogaster genome.

A ~63 Gb tar.gz of the entire PacBio long read dataset including .fasta, .fastq, and .h5 files can be found here. To help newcomers to PacBio data (like us) get over the hurdle of installing the PacBio long-read aligner blasr, we are also distributing CentOS 6/Scientific Linux 64-bit and source RPMs for blasr, made by Peter Briggs in the University of Manchester Bioinformatics Core Facility.

We have also generated 100bp paired-end Illumina data from the same stock, to aid with error-correction of the PacBio long reads. These data were generated by the Stowers Institute Molecular Biology Core facility. As above, genomic DNA was prepared from 10 starved, adult males using the Qiagen DNeasy Blood & Tissue Kit. 1ug of DNA from each was fragmented using a Covaris S220 sonicator (Covaris Inc.) to 250 base pair (bp) fragments by adjusting the treatment time to 85 seconds. Following manufacturer’s directions, short fragment libraries were made using the KAPA Library Preparation Kits (KAPA Biosystems, Cat. No. KK8201) and Bioo Scientific NEXTflex™ DNA Barcodes (Bioo Scientific, Cat. No. 514104). The resulting libraries were purified using Agencourt AMPure XP system (Beckman Coulter, Cat. No. A63880), then quantified using a Bioanalyzer (Agilent Technologies) and a Qubit Fluorometer (Life Technologies).  The library was pooled with several other strains, re-quantified and run as high output mode on six 100 bp paired-end lanes on an Illumina HiSeq 2000 instrument, using HiSeq Control Software and Real-Time Analysis (RTA) version Secondary Analysis version CASAVA-1.8.2 was run to demultiplex reads and generate FASTQ files.

The Illumina 100 bp paired-end dataset in .fastq format can be at ENA under accession ERX645969. Additional Illumina data for the ISO1 reference strain has been generated previously by Chuck Langley’s lab, either from mixed adult males and females (SRX040484) or haploid embryos (SRX040485, SRX040486, SRX040491) that could be used to supplement the PacBio error correction.

As with previous unpublished data we have released from the Bergman Lab, we have chosen to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomic data is published, please cite:

  • Miller, D.E., C.B. Smith, R.S. Hawley and C.M. Bergman (2013) PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain.

Release of 20 European Drosophila melanogaster genomes

Posted 17 Jul 2012 — by caseybergman
Category drosophila, high throughput sequencing, population genomics

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

Background: As part of an ongoing efforts to characterize genetic diversity in the nuclear and cytoplasmic genomes of D. melanogaster, the Haddrill and Bergman Labs have collaborated to sequence the complete genomes of 20 D. melanogaster isofemale strains collected by Penny Haddrill in Montpellier, France in August 2010. These 20 genomes represent a random sample of the full collection described by Haddrill and Vespoor (2011), which also describes microsatellite variation data for these strains.

Following on from the very generous early release of D. melanogaster genomes by major resequencing efforts in Drosophila, we have decided to follow suit and release these genomes prior to publication to maximise their utility by the wider research community and prevent unnecessary duplication of effort. One major aim for our sequencing of a reasonably high number of strains from this European population is to provide a complementary dataset to help interpret the larger samples of North American and (predominantly) African strains from the Drosophila Genetic Reference Panel and Drosophila Population Genomics Project, respectively. For more on the philosophy behind why we have made the decision to release these data early, please see this blog post on genomic data release by individual labs in the next-generation sequencing era.

Methods: Genomic DNA was prepared by Penny Haddrill for each isofemale line by pooling fifty females, snap freezing them in liquid nitrogen, extracting DNA using a standard phenol-chloroform extraction protocol with ethanol and ammonium acetate precipitation. 500 bp short-insert libraries were constructed and 91 bp paired-end reads were generated using an Illumina HiSeq 2000 to an estimated coverage of ~50x per strain by BGI-Hong Kong. Basic QC on reads was performed by BGI and mapping to the Wolbachia genome following the protocol in Richardson et al. (submitted) confirmed the same infection status for as determined by PCR in Haddrill and Vespoor (2011).

Conditions for use: The Haddrill and Bergman labs intend to use these data to study patterns of genetic diversity in the nuclear and cytoplasmic genomes, to estimate the ratio of diversity on the X chromosome relative to the autosomes, to detect signatures of both positive and negative selection in the nuclear and cytoplasmic genomes, and investigate the impact of variation in recombination rate around the genome.

We have decidede to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomes is published, please cite these data as:

  • Haddrill, P. and C.M. Bergman (2012) 20 Drosophila melanogaster genomes from Montpellier, France.

[Update: These genome sequence have now been published. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

The data: Gzipped Illumina fastq files for forward and reverse paired reads can be downloaded at the following locations. A script to dowload all files in serial can be found below the table.

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite accession numbers ERR705945-ERR705964 in your work.]

Compendium of Drosophila Genome Assemblies

Posted 28 Apr 2012 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

[UPDATE: The UCSD Drosophila Species Stock Center now maintains a list of information about Drosophila genome sequences: The compendium here will no longer be updated. To ensure a single, up-to-date source of information, please send suggested links to the DSSC.]

Since the publication of the Drosophila melanogaster genome, the D. pseudoobscura genome, and the remaining 10 species from the Drosophila 12 Genomes Project, a number of other Drosophila genomes have been sequenced and assembled using whole genome shotgun approaches. Periodically, I find myself needing to find versions of these assemblies from the data producer, NCBI, Flybase, or UCSC, and so to help streamline this process I’ve collected these links together into a single compendium here, along with links to the original stocks used for sequencing. Hopefully this will be of use to others as well.

Data Producer
UCSC Genome Browser
Stock Center
D. melanogaster Berkeley Drosophila Genome Project AABU00000000 dmel dm3 1522
D. simulans Washington University Genome Sequencing Center AAGH00000000 dsim droSim1 n.a.
D. simulans Andolfatto Lab n.a. n.a. n.a. 1380
D. mauritiana Schlotterer Lab n.a. n.a. n.a. E-18912
D. sechellia Broad Institute AAKO00000000 dsec droSec1 1373
D. yakuba Washington University Genome Sequencing Center AAEU00000000 dyak droYak2 1385
D. santomea Andolfatto Lab n.a. n.a. n.a. 2367
D. erecta Agencourt Biosciences AAPQ00000000 dere droEre2 1374
D. ficusphila Baylor College of Medicine Genome Center AFFG00000000 n.a. n.a. 2332
D. eugracilis Baylor College of Medicine Genome Center AFPQ00000000 n.a. n.a. 2350
D. suzukii University of Edinburgh GenePool CAKG00000000 n.a. n.a. n.a
D. biarmipes Baylor College of Medicine Genome Center AFFD00000000 n.a. n.a. 2312
D. takahashii Baylor College of Medicine Genome Center AFFI00000000 n.a. n.a. 2325
D. elegans Baylor College of Medicine Genome Center AFFF00000000 n.a. n.a. 2334
D. rhopaloa Baylor College of Medicine Genome Center AFPP00000000 n.a. n.a. 2327
D. kikkawai Baylor College of Medicine Genome Center AFFH00000000 n.a. n.a. 2330
D. bipectinata Baylor College of Medicine Genome Center AFFE00000000 n.a. n.a. 2329
D. ananassae Agencourt Biosciences AAPP00000000 dana droAna3 1368
D. pseudoobscura Baylor College of Medicine Genome Center AADE00000000 dpse dp4 1277
D. persimilis Broad Institute AAIZ00000000 dper droPer1 1369
D. miranda Bachtrog Lab AJMI00000000 n.a. n.a. n.a.
D. willistoni J. Craig Venter Institute AAQB00000000 dwil droWil1 1375
D. virilis Agencourt Biosciences AANI00000000 dvir droVir3 1371
D. americana Vieira Lab n.a. n.a. n.a. n.a.
D. mojavensis Agencourt Biosciences AAPU00000000 dmoj droMoj3 1370
D. grimshawi Agencourt Biosciences AAPT00000000 dgri droGri2 1225
D. albomicans Wang/Bachtrog Labs ACVV00000000 n.a. n.a. n.a.

Wolbachia and Mitochondrial Genomes from Drosophila melanogaster Resequencing Projects

As part of ongoing efforts to characterize complete genome sequences of microbial symbioints of Drosophila species, the Bergman Lab has been involved in mining complete genomes of the Wolbachia endosymbiont from whole-genome shotgun sequences of D. melanogaster. This work is inspired by Steve Salzberg and colleagues’ pioneering paper in 2005 showing that Wolbachia genomes can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. We have adapted this technique to utilize short-read next generation sequencing data from population genomics projects in D. melanogaster, together with the reference Wolbachia genome published by Wu et al. (2005).

Currently we have identified infection status and extracted nearly-complete genomes from the two major resequencing efforts in D. melanogaster: the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP). We employ a fairly standard BWA/SAMtools  pipeline, with a few tricks that are essential for getting good quality assemblies and consensus sequences. Our first iteration of this pipeline was used to predict infection status in the DGRP strains. These results have been published in the DGRP main paper published in early 2012, and can be found in Supplemental Table 9 of the DGRP paper or more usefully in a comma-separated file at the DGRP website.

We have now applied an update version of our pipeline to both the DGRP and DPGP datasets and have used these “essentially complete” genome sequences to study the recent evolutionary history of Wolbachia in D. melanogaster. In the process of generating these data we received several inquiries about the status of this project so, in the spirit of Open Science that makes genomics such a productive field, we have released the consensus sequences and alignments of 179 Wolbachia and 290 mitochondrial genomes from the DGRP and DPGP prior to publication of our manuscript.

Since we are not the primary data producer for these assemblies, it is not approporiate to employ a data release policy based on the NHGRI guidelines. Instead, we have chosen to release these data under a Creative Commons Attribution 3.0 Unported License. If you use these assemblies or alignments in your projects, please cite the main DGRP and DPGP papers for the raw data and cite the following reference for the mtDNA or Wolbachia assemblies:

Richardson, M.F., L.A. Weinert, J.J. Welch, R.S. Linheiro, M.M. Magwire, F.M. Jiggins & C.M. Bergman (2012) Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics 8:e1003129.

A tar.gz archive contaning a reference-based multiple sequence alignment of release 1.0 of the complete D. melanogaster mtDNA and Wolbachia assemblies can be obtained here. If you have questions about the methods used to produce these assemblies, any associated meta-data from these assemblies, or the content of our manuscript please contact Casey Bergman for details.

Creative Commons License

Mapping between SRA IDs and accessing metadata with the Bioconductor SRAdb package

Posted 12 Oct 2011 — by caseybergman
Category drosophila, high throughput sequencing, population genomics, R

NCBI’s Sequence Read Archive (SRA) is one of the primary repositories for next-generation sequencing data.  Submissions to SRA are often complex, with multiple samples from a given submission, or multiple sequencing runs associated with a given sample. As a result, a given SRA submission “envelope” can be associated with database records at several levels, each with corresponding identifiers and nomenclature. Valid levels in the SRA data model include: the study (prefixed with SRP), the experiment (prefixed with SRX), the sample (prefixed with SRS), and the run (prefixed with SRR). Mapping identifiers between the different parts of an SRA submission is important when you need to access meta-data from different parts of a submission or you need to aggregate data across different levels of a submission (e.g. sequencing runs from the same biological sample).

In the past, we’ve been guilty of using PERL to screen-scrape meta-data from the SRA html pages to aggregate accessions from different sequencing runs conducted by the Drosophila Genetic Reference Panel (DGRP) project. However, this is a non-optimal strategy because of changes to the SRA html format that make this approach brittle. Recently, I stumbled across a much better way to map IDs and access meta-data in SRA, using the SRAdb package from Bioconductor. The developers of SRAdb provide a nicely repackaged version of all SRA meta-data in SQLlite that can easily be accessed using a few lines of R.

The following example shows how to map IDs at all levels for SRA accessions from the two major Drosophila resequencing projects the DGRP (SRA study SRP000694) and the Drosophila Population Genomics Project (SRA studies SRP000224 & SRP005599).

# Script to generate SRA metadata for Drosophila #
# melaogaster population genomics projects       #
# Casey M. Bergman (University of Manchester)    #

#install the SRAdb bioconductor package
#download and connect to the SRA SQLlite database
sqlfile <- getSRAdbFile()
sra_connection <- dbConnect(SQLite(), "SRAmetadb.sqlite")
# make linking table between SRP, SRA, SRS, SRX & SRR for DPGP & DGRP projects
conversion <- sraConvert(c("SRP000694","SRP000224", "SRP005599"), sra_con = sra_connection)
# open outfile & print header
outFile <- file("DrosophilaPopulationGenomicData.tsv", "w")
cat("Project\tAccession\tSample\tExperiment\tRun\tSampleAlias\tSampleDescription\n", file=outFile)
# loop through conversion table and look-up meta-data for each sample
for (i in 1:nrow(conversion)) {
sample_metadata_sql <- paste("select sample_alias, description from sample where sample_accession like '", conversion&#91;i,3&#93;, "'", sep="")
sample_metadata_results <- dbGetQuery(sra_connection, sample_metadata_sql)
cat(conversion&#91;i,1&#93;, conversion&#91;i,2&#93;, conversion&#91;i,3&#93;, conversion&#91;i,4&#93;, conversion&#91;i,5&#93;, sample_metadata_results&#91;1,1&#93;, sample_metadata_results&#91;1,2&#93;, "\n", file=outFile, sep = "\t")
#close filehandle

#clean up
system("rm SRAmetadb.sqlite")
&#91;/sourcecode&#93;<div class="wp-git-embed" style="margin-bottom:10px; border:1px solid #CCC; text-align:right; width:99%; margin-top:-13px; font-size:11px; font-style:italic;"><span style="display:inline-block; padding:4px;">getDrosPopGenData.R</span><a style="display:inline-block; padding:4px 6px;" href="" target="_blank">view raw</a><a style="display:inline-block; padding:4px 6px; float:left;" href="" target="_blank">view file on <strong>GitHub</strong></a></div>

This script can be downloaded and run as follows:

$ wget
$ R --no-save &gt; getDrosPopGenSraMetaData.R

For those interested in the actual output of this script please find the DrosophilaPopulationGenomicData.tsv file here.

Credits: Thanks to Pedro Olivares-Chauvet for R syntax help and Jack Zhu for a prompt bug fix in the SRAdb files.

Installing libsequence and analysis tools on OSX

After being relegated to the margins of biology for almost a century, the field of population genetics is now moving closer the the center of mainstream biology in the era of next-generation sequencing (NGS). In addition to providing genome-wide DNA variation data to study classical questions in evolution that are motivated by population genetic theory, NGS now permits population-genetic based techniques like GWAS (genome-wide association studies) and eQTL (expression quantitative trait locus) analysis to allow biologists to identify and map functional regions across the genome.

Handling big data requires industrial strength bioinformatics tool-kits, which are increasingly available for many aspects of genome bioinformatics (e.g. the Kent source tree, SAMtools, BEDtools, etc.). However, for molecular population genetics at the command line, researchers have a more limited palette that can be used for genome-scale data, including: VariScan, the PERL PopGen modules, and the libsequence C++ library.

Motivated by a recent question on BioStar and a request from a colleague to generate a table of polymorphic sites across a bacterial genome, I’ve recently had a play with the last of these — Kevin Thornton’s libsequence and accompanying analysis toolkit — which includes a utility called compute which is a “mini-DNAsp for the Unix command-line.” As ever, getting this installed on OSX was more of a challenge than desired, with a couple of dependencies (including Boost and GSL), but in the end was do-able as follows:

$ wget
$ tar -xvzf boost_1_47_0.tar.gz
$ cd boost_1_47_0
$ ./
$ sudo ./b2 install
$ cd ..

$ wget
$ tar -xvzf libsequence-1.7.3.tar.gz
$ cd libsequence-1.7.3
$ ./configure
$ make
$ sudo make install
$ cd ..

$ wget
$ tar -xvzf gsl-1.15.tar.gz
$ cd gsl-1.15
$ ./configure --disable-shared --disable-dependency-tracking
$ make
$ sudo make install
$ cd ..

$ wget
$ tar -xvzf analysis-0.8.0.tar.gz
$ cd analysis-0.8.0
$ ./configure
$ make
$ sudo make install
$ cd ..

With this recipe, compute and friends should now be happily installed in /usr/local/bin/.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8.

Enhanced by Zemanta