Archive for the ‘drosophila’ Category

High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome

Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.

Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here, or alternatively paste the following line into the “Paste URLs or data:” box on the “Add Custom Tracks” page of the Release 5 (dm3) browser:

track type=bam bigDataUrl= name=dm3PacBio

As shown in the following figure, mapped read depth is roughly Poisson distributed with a mean coverage of ~90x-95x on the autosomal arms and ~45x coverage for the X-chromosome. A two-fold reduction in read depth on the X relative to the autosomes is expected in a DNA sample only from heterogametic XY males. Mapped read depth is important to calculate empirically post hoc (not just based on theoretical a priori expectations) since the actual genome size of D. melanogaster is not precisely known and varies between males and females, with an upper estimate of the total male genome size (autosomes + X + Y) being ~220 Mb. Thus, we conclude the coverage of this dataset is of sufficient depth to generate long-range de novo assemblies of the D. melanogaster genome using PacBio data only, as confirmed by the preliminary Celera and FALCON assemblies.


The following browser screenshots show representative regions of how this high coverage PacBio dataset maps to the Release 5 reference genome:

  • A typical region of unique sequence (around the eve locus). Notice that read coverage is relatively uniform across the genome and roughly equal on the forward (blue) and reverse (red) strands. Also notice that some reads are long enough to span the entire distance between neighboring genes, suggesting unassembled PacBio reads are now long enough to establish syntenic relationships for neighboring genes in Drosophila from unassembled reads.
  • A typical region containing a transposable element (TE) insertion (in the ds locus), in this case a 8126 bp roo LTR retrotransposon (FBti0019098). Substantial numbers of mapped reads align across the junction region of the TE and its unique flanking regions. However, there is an excess of shorter reads inside the boundaries of the TE, which are likely to be incorrectly mapped because they are shorter than the length of the active TE sequence and have mapping quality values of ~0. With suitable read length and mapping quality filters, the unique location of TE insertions in the euchromatin should be identifiable using uncorrected PacBio reads.


  • One of the few remaining and persistent gaps in the euchromatic portion of the reference sequence that can be filled using PacBio sequences. This gap remains in the upcoming Release 6 reference sequence (Sue Celniker and Roger Hoskins, personal communication). This shows blasr can align PacBio reads through smaller sequence gaps that are represented by padded N’s, however the visualization of such reads may sometimes be flawed (i.e. by artificially displaying a gap in the mapped read that arises from a lack of alignment to the ambiguous nucleotides in the gap).


Below we provide a reproducible transcript of the code used to (i) setup our working environment, (ii) download and unpack this D. melanogaster PacBio dataset, (iii) generate the BAM file of mapped reads released here, and (iv) generate coverage density plots for the major chromosome arms shown above. Part of the rationale for providing these scripts is to showcase how simple it is to install the entire PacBio smrtanalysis toolkit, which is the easiest way we’ve found to install the blasr long-read aligner (plus get a large number of other goodies for processing PacBio data like pacBioToCA, pbalign, quiver, etc).

These scripts are written to be used by someone with no prior knowledge of PacBio data or tools, and can be run on a Oracle Grid Engine powered linux cluster running Scientific Linux (release 6.1).  They should be easily adapted for use on any linux platform supported by PacBio’s smrtanalysis toolkit, and to run on a single machine just execute the commands inside the “quote marks” on the lines with qsub statements.

(i) First, let’s download the Release 5 reference genome plus the smrtanalysis toolkit:


# get dm3 and reformat into single fasta file
echo "getting dm3 and reformatting into single fasta file"
wget -q
tar -xvzf chromFa.tar.gz
cat chr*.fa > dm3.fa
rm chromFa.tar.gz chr*.fa

#install PacBio smrtanalysis suite (including blasr PacBio long-read mapping engine)
echo "installing PacBio smrtanalysis suite"
bash --extract-only --rootdir ./
setupDrosPacBioEnv.shview rawview file on GitHub

If you want to have the smrtanalysis tooklit in your environment permanently, you’ll need to add the following magic line to your .profile:

source install/smrtanalysis-

(ii) Next, let’s get the D. melanogaster PacBio data and unpack it:


#get compressed .tar archives of D. melanogaster PacBio runs
echo "get tar archives of PacBio runs"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.tgz "wget"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.tgz "wget"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.tgz "wget"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.tgz "wget"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.tgz "wget"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.tgz  "wget"

#get md5sums for PacBio .tar archives
echo "getting md5sums for PacBio tar archives"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.md5 "wget"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.md5 "wget"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.md5 "wget"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.md5 "wget"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.md5 "wget"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.md5  "wget"

#verify .tar archives have been downloaded properly
echo "verifying archives have been downloaded properly"
qsub -b y -cwd -N md5_Dro1_24NOV2013_398.md5 -hold_jid wget_Dro1_24NOV2013_398.tgz,wget_Dro1_24NOV2013_398.md5 "md5sum -c Dro1_24NOV2013_398.md5 > Dro1_24NOV2013_398.md5.checkresult"
qsub -b y -cwd -N md5_Dro2_25NOV2013_399.md5 -hold_jid wget_Dro2_25NOV2013_399.tgz,wget_Dro2_25NOV2013_399.md5 "md5sum -c Dro2_25NOV2013_399.md5 > Dro2_25NOV2013_399.md5.checkresult"
qsub -b y -cwd -N md5_Dro3_26NOV2013_400.md5 -hold_jid wget_Dro3_26NOV2013_400.tgz,wget_Dro3_26NOV2013_400.md5 "md5sum -c Dro3_26NOV2013_400.md5 > Dro3_26NOV2013_400.md5.checkresult"
qsub -b y -cwd -N md5_Dro4_28NOV2013_401.md5 -hold_jid wget_Dro4_28NOV2013_401.tgz,wget_Dro4_28NOV2013_401.md5 "md5sum -c Dro4_28NOV2013_401.md5 > Dro4_28NOV2013_401.md5.checkresult"
qsub -b y -cwd -N md5_Dro5_29NOV2013_402.md5 -hold_jid wget_Dro5_29NOV2013_402.tgz,wget_Dro5_29NOV2013_402.md5 "md5sum -c Dro5_29NOV2013_402.md5 > Dro5_29NOV2013_402.md5.checkresult"
qsub -b y -cwd -N md5_Dro6_1DEC2013_403.md5  -hold_jid wget_Dro6_1DEC2013_403.tgz,wget_Dro6_1DEC2013_403.md5  "md5sum -c Dro6_1DEC2013_403.md5  > Dro6_1DEC2013_403.md5.checkresult"

#extract PacBio runs
echo "extracting PacBio runs"
qsub -b y -cwd -N tar_Dro1_24NOV2013_398.tgz -hold_jid wget_Dro1_24NOV2013_398.tgz "tar -xvzf Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N tar_Dro2_25NOV2013_399.tgz -hold_jid wget_Dro2_25NOV2013_399.tgz "tar -xvzf Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N tar_Dro3_26NOV2013_400.tgz -hold_jid wget_Dro3_26NOV2013_400.tgz "tar -xvzf Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N tar_Dro4_28NOV2013_401.tgz -hold_jid wget_Dro4_28NOV2013_401.tgz "tar -xvzf Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N tar_Dro5_29NOV2013_402.tgz -hold_jid wget_Dro5_29NOV2013_402.tgz "tar -xvzf Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N tar_Dro6_1DEC2013_403.tgz  -hold_jid wget_Dro6_1DEC2013_403.tgz  "tar -xvzf Dro6_1DEC2013_403.tgz"

#delete .tar archives and md5 files
qsub -V -b y -cwd -N rm_tar_md -hold_jid tar_Dro1_24NOV2013_398.tgz,tar_Dro2_25NOV2013_399.tgz,tar_Dro3_26NOV2013_400.tgz,tar_Dro4_28NOV2013_401.tgz,tar_Dro5_29NOV2013_402.tgz,tar_Dro6_1DEC2013_403.tgz "rm Dro*.tgz Dro*.md5"
getDrosPacBioData.shview rawview file on GitHub

(iii) Once this download is complete, we’ll map data from all of the cells to the Release 5 reference and merge the output into a single file of mapped (.bam), retaining unmapped (.unaligned) reads:


#map .bax.h5 files from individual PacBio cells to Release 5 using blasr
echo "mapping .bax.h5 files to Release 5"
for input in `ls Dro*/*/*/*bax.h5` 
 sample=`basename $input .bax.h5`
 qsub -l cores=12 -V -b y -cwd -N blasr_${sample} "install/smrtanalysis- $input dm3.fa -bestn 1 -nproc 12 -minPctIdentity 80 -sam -out $output1 -unaligned $output2"
 qsub -V -b y -cwd -N sort_${sample} -hold_jid blasr_${sample} "install/smrtanalysis- view -bS $output1 | install/smrtanalysis- sort - $sample"

#merge bam files from individual cells and index merged bam file
echo "merging and indexing bam files"
holdlist=`printf -- '%s,' "${samples[@]}"`
qsub -V -b y -cwd -N merge_pacbio -hold_jid $holdlist "install/smrtanalysis- merge $output $input"
qsub -V -b y -cwd -N index_pacbio -hold_jid merge_pacbio "install/smrtanalysis- index $output"

#merge unmapped reads and remove .sam, .bam and .unaligned files from individual cells
echo "merging unmapped reads and cleaning up"
qsub -V -b y -cwd -N merge_unmapped -hold_jid index_pacbio "cat m131*unaligned > dm3PacBio.unaligned"
qsub -V -b y -cwd -N rm_cell_files -hold_jid merge_unmapped "rm m131*unaligned m131*bam m131*sam"

mapDrosPacBioData.shview rawview file on GitHub

(iv) Last, let’s generate the coverage plot shown above using the genome coverage utility in Aaron Quinlan’s BedTools package, R and imagemagick:

mysql --user=genome -A -e "select chrom, size from dm3.chromInfo" > dm3.genome
bedtools genomecov -ibam dm3PacBio.bam -g dm3.genome > dm3.genome.coverage
R --no-save < dm3PacBioCoverage.R
convert -density 300 -quality 100 dm3PacBioCoverage.pdf dm3PacBioCoverage.jpg

Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.

Error-Corrected PacBio Sequences for the D. melanogaster Reference Strain

Posted 06 Nov 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

Using PacBio and Illumina whole genome shotgun sequences we recently released for the D. melanogaster reference strain, Sergey Koren and Adam Phillippy at the University of Maryland have recently run their pacBioToCA method to generate a dataset of error-corrected PacBio reads for this dataset, which the have kindly made available here for re-use without restriction. This pilot data set is not at high enough coverage and thus a whole genome assembly was not attempted. Nevertheless, both the raw and error-corrected datasets should be of use to better understand the nature of PacBio data and the pacBioToCA pipeline as applied to Drosophila genomes.

The 2057_PacBio_corrected.tgz archive contains the following files:

  • pacbio.blasr.spec – the specification file used for the pacBioToCA run.
  • corrected.trim.fastq.bz2 – the error-corrected PacBio reads.
  • corrected.trim.lens – a file containing columns with corrected read id, read length, running total BP, running mean, and running coverage assuming 140Mbp genome size.
  • corrected.trim.names – a look-up table to map read IDs in the pacBioToCA output to the original PacBio input IDs.

Sergey also sent along a quick summary of the correction:

The correction used the latest CA from the repository (as of 10/15/13). The max number of mappings each Illumina sequence could have was set to 20, the repeat separation was set to 10. The genome size was set to 180Mbp. BLASR was used to align the Illumina sequences to PacBio sequences for correction. The input for correction was 673 Mbp of PacBio data (max: 8,231, mean: 1,587) which corresponds to approximately 4.8X of a 140Mbp genome. The correction produced 469Mbp (max: 6,883, mean: 1,410) or 3.35X. The throughput was approximately 69%. To estimate, the accuracy of the data, it was mapped back to the reference D. melanogaster. The sequences average 99.85% identity. Approximately 10% of the sequences mapped in more than a single piece to the reference. Some of these mappings were due to short indel differences between the reference and the reads or N’s in the reference. Some sequences mapped across large distances in the reference. I did not confirm if these split mappings were supported by the uncorrected reads or not.

We thank both Sergey and Adam for taking the time to run their pacBioToCA pipeline and making these data available to the community, and hope these data are of use to others in their research.

A Reference “Hologenome” for D. melanogaster

Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.

In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.


# Script to construct reference "hologenome"   #
# for D. melanogaster and associated microbes  #
# Casey M. Bergman (University of Manchester)  #

# download D. melanogaster (dm3) reference genome from UCSC genome database
# unpack dm3, remove composite mtDNA, extract mtDNA from chrU & combine into
# one .fa file excluding chrU & chrUextra

wget -q
tar -xvzf chromFa.tar.gz
rm chrM.fa
echo ">chrM_iso1" > chrM.fa
faFrag chrU.fa 5288527 5305749 stdout | grep -v chrU >> chrM.fa
cat chr2L.fa chr2LHet.fa chr2R.fa chr2RHet.fa chr3L.fa chr3LHet.fa chr3R.fa chr3RHet.fa chr4.fa chrX.fa chrXHet.fa chrYHet.fa chrM.fa > dm3.fa 
rm chromFa.tar.gz chr2L.fa chr2LHet.fa chr2R.fa chr2RHet.fa chr3L.fa chr3LHet.fa chr3R.fa chr3RHet.fa chr4.fa chrX.fa chrXHet.fa chrYHet.fa chrU.fa chrUextra.fa chrM.fa

# download yeast (sacCer3) reference genome from UCSC genome database
# unpack and rename sacCer3 chroms & combine into one .fa file

wget -q
tar -xvzf chromFa.tar.gz
for i in chr*fa; do cat $i | sed 's/>chr/>sacCer3_chr/g' >> sacCer3.fa; done
rm chromFa.tar.gz chrI.fa chrII.fa chrIII.fa chrIV.fa chrIX.fa chrM.fa chrV.fa chrVI.fa chrVII.fa chrVIII.fa chrX.fa chrXI.fa chrXII.fa chrXIII.fa chrXIV.fa chrXV.fa chrXVI.fa

# download genomes for microbes known to be associated with D. melanogaster
# from NCBI, combine multi-contig draft assemblies into single fasta file 
# per species & rename fasta headers to human readable form

# Wolbachia pipientis
wget -q
echo ">w_pipientis" >> w_pipientis.fa
grep -v \> AE017196.fna >> w_pipientis.fa
rm AE017196.fna

# Pseudomonas entomophila
wget -q
echo ">p_entomophila" >> p_entomophila.fa
grep -v \> NC_008027.fna >> p_entomophila.fa
rm NC_008027.fna

# Commensalibacter intestini
wget -q
tar -xvzf AGFR00000000.contig.fna.tgz
cat AGFR01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Commensalibacter_intestini_A911_74_/>c_intestini_/g' > c_intestini.fa
rm AGFR01*fna AGFR00000000.contig.fna.tgz

# Acetobacter pomorum
wget -q
tar -xvzf AEUP00000000.contig.fna.tgz
cat AEUP01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Acetobacter_pomorum_DM001_Contig00/>a_pomorum_/g' > a_pomorum.fa
rm AEUP01*fna AEUP00000000.contig.fna.tgz

# Gluconobacter morbifer
wget -q
tar -xvzf AGQV00000000.contig.fna.tgz
cat AGQV01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Gluconobacter_morbifer_G707_75_/>g_morbifer_/g'  > g_morbifer.fa
rm AGQV01*fna AGQV00000000.contig.fna.tgz

# Providencia burhodogranariea
wget -q
tar -xvzf AKKL00000000.contig.fna.tgz
cat AKKL01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/Providencia_burhodogranariea_DSM_19968_contig000/p_burhodogranariea_/g' > p_burhodogranariea.fa
rm AKKL01*fna AKKL00000000.contig.fna.tgz

# download genomes for microbes known to be associated with D. melanogaster
# from Ensembl & rename fasta headers to human readable form

# Providencia alcalifaciens
wget -q
gunzip Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa.gz
cat Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>contig000/>p_alcalifaciens_/g' > p_alcalifaciens.fa
rm Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa*

# Providencia rettgeri
wget -q
gunzip Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa.gz
cat Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>Contig/>p_rettgeri_/g' > p_rettgeri.fa
rm Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa* 

# Enterococcus faecalis
wget -q
gunzip Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa.gz
cat Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>GG/>e_faecalis_GG/g' > e_faecalis.fa
rm Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa*

# create D. melanogaster "hologenome" 
# from individual species fasta files
cat dm3.fa sacCer3.fa w_pipientis.fa p_entomophila.fa c_intestini.fa a_pomorum.fa g_morbifer.fa p_burhodogranariea.fa p_alcalifaciens.fa p_rettgeri.fa e_faecalis.fa > dm3_hologenome.fa
makeDmelHologenome.shview rawview file on GitHub

Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:

$ wget
$ sh
$ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50

Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.

PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain

Posted 31 Jul 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

As part of a collaboration with Danny Miller and Scott Hawley from the Stowers Institute, we have generated whole genome shotgun sequences using PacBio RS technology for the Drosophila melanogaster y; cn, bw, sp strain (Bloomington 2057), the same strain that was also used to assemble the D. melanogaster reference genome sequence. We’ve been meaning to release these data to the community since we got the data in April, but have been waylaid by teaching commitments and a spate of recent server problems.  Prompted by Danny’s visit to the Bergman Lab over the last two weeks and the generous release by Dmitri Petrov’s lab of a data set of Illumina long reads using the Moleculo technology for the same strain of D. melanogaster, we’ve finally gotten around to bundling these D. melanogaster PacBio sequences for general release. We’re hoping that the availability of both PacBio and Moleculo long-read data for the same strain that has one of the highest quality reference genomes for any species will allow the genomics community to investigate directly the pros/cons of each of these new exciting technologies.

These PacBio sequence data were generated by the University of Michigan DNA Sequencing Core facility on a PacBio RS using DNA from 10 adult males. Flies were starved for 4 hours to reduce microbial DNA contaminants before freezing at -80oC and prepped using the Qiagen DNeasy Blood & Tissue Kit (catalog number 69504). Six SMRT cells were used for sequencing 5 Kb fragment libraries on two Long Read mode runs: 1 cell on the first run (Run53) and 5 cells on the second run (Run55). Excluding the circular consensus (CCS) reads, and combining data from the S1 and S2 files for each of the six cells, we obtained 1,357,183,439 bp of raw DNA sequence from this experiment or roughly 7.5x coverage of the 180 Mb male D. melanogaster genome.

A ~63 Gb tar.gz of the entire PacBio long read dataset including .fasta, .fastq, and .h5 files can be found here. To help newcomers to PacBio data (like us) get over the hurdle of installing the PacBio long-read aligner blasr, we are also distributing CentOS 6/Scientific Linux 64-bit and source RPMs for blasr, made by Peter Briggs in the University of Manchester Bioinformatics Core Facility.

We have also generated 100bp paired-end Illumina data from the same stock, to aid with error-correction of the PacBio long reads. These data were generated by the Stowers Institute Molecular Biology Core facility. As above, genomic DNA was prepared from 10 starved, adult males using the Qiagen DNeasy Blood & Tissue Kit. 1ug of DNA from each was fragmented using a Covaris S220 sonicator (Covaris Inc.) to 250 base pair (bp) fragments by adjusting the treatment time to 85 seconds. Following manufacturer’s directions, short fragment libraries were made using the KAPA Library Preparation Kits (KAPA Biosystems, Cat. No. KK8201) and Bioo Scientific NEXTflex™ DNA Barcodes (Bioo Scientific, Cat. No. 514104). The resulting libraries were purified using Agencourt AMPure XP system (Beckman Coulter, Cat. No. A63880), then quantified using a Bioanalyzer (Agilent Technologies) and a Qubit Fluorometer (Life Technologies).  The library was pooled with several other strains, re-quantified and run as high output mode on six 100 bp paired-end lanes on an Illumina HiSeq 2000 instrument, using HiSeq Control Software and Real-Time Analysis (RTA) version Secondary Analysis version CASAVA-1.8.2 was run to demultiplex reads and generate FASTQ files.

The Illumina 100 bp paired-end dataset in .fastq format can be at ENA under accession ERX645969. Additional Illumina data for the ISO1 reference strain has been generated previously by Chuck Langley’s lab, either from mixed adult males and females (SRX040484) or haploid embryos (SRX040485, SRX040486, SRX040491) that could be used to supplement the PacBio error correction.

As with previous unpublished data we have released from the Bergman Lab, we have chosen to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomic data is published, please cite:

  • Miller, D.E., C.B. Smith, R.S. Hawley and C.M. Bergman (2013) PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain.

Visualizing Biological Data Using the SVGmap Browser

Posted 14 Mar 2013 — by caseybergman
Category data mining, drosophila

Early in 2012, Nuria Lopez-BigasBiomedical Genomics Group published a paper in Bioinformatics describing a very interesting tool for visualizing biological data in a spatial context called SVGmap. The basic idea behind SVGMap is (like most good ideas) quite straightforward – to plot numerical data on a pre-defined image to give biological context to the data in an easy-to-interpret visual form.

To do this, SVGmap takes as input an image in Scalable Vector Graphics (SVG) format where elements of the image are tagged with an identifier, plus a table of numerical data with values assigned to the same identifier as in the elements of the image. SVGMap then integrates these files using either a graphical user interface that runs in standard web browser or a command line interface application that runs in your terminal, allowing the user to display color-coded numerical data on the original image. The overall framework of SVGMap is shown below in an image taken from a post on the Biomedical Genomics Group blog.

Overview of the SVGmap

We’ve been using SVGMap over the last year to visualize tissue-specific gene expression data in Drosophila melanogaster from the FlyAtlas project, which comes as one of the pre-configured “experiments” in the SVGMap web application.

More recently, we’ve been also using the source distribution of SVGMap to display information about the insertion preferences of transposable elements in a tissue-specific context, which as required installing and configuring a local instance of SVGMap and run it via the browser. The documentation for SVGMap is good enough to do this on your own, but it took a while for us to get a working instance the first time around. We ran into the same issues again the second time, so I thought I write up my notes for future reference and to help others get SVGMap up and running as fast as possible.

Install and Launch the SVGMap exectuable

Open a terminal and execute the following commands:

cd svgmap-server
sh &

This will install and launch the SVGMap executable which runs as a web application on your machine on port 8095.

Configure your “experiment”

The combination of an image and data source is considered an “experiment” in the SVGmap framework.  Here we are going to use one of the SVG images (of the adult Dipteran body plan adapted from Creative Commons Licensed images in Wikipedia found here and here) provided by SVGmap plus a new data source from our transposable elment insertion site analysis as an example.

1) Type http://localhost:8095 into your web browser and click “Return”. You will now see a local version of SVGmap running on your machine which is identical to the version of SVGmap hosted at: but with nodatasource installed.

2) Click “Browse” tab at top of page, which will take you to another page where you can “Select the experiment to browse”. Skip this for now, since we are still in the process of configuration.

3) Click “configure browser” in the upper right hand corner of the page. This will take you to the “Experiment configuration area”.

4) Log in with the following credentials and click “Accept”:

  • Login: svgmap
  • Password: svgmap

5) This will bring up an option to add a new source in the Experiment configuration area. Click “New Source”.

6) This will send you to the “Add new experiment page”, where you can enter name of the new Experiment (e.g. “Test Experiment”).

7) Next, select the test SVG file of the Dipteran body plan by clicking “Browse…” next to the “Search SVG file” box and navigating to the examples/Drosophila_expression/ directory in your svmap installation directory and selecting the file named “housefly_anatomy_bnw.svg”.

8) Next we need to load in the file containing the data you would like to visualize. The formatting specification of the datafile for SVGmap is documented here. In our simple test case, we only want to visualize a single set of numerical values over different Dipteran tissues using our test data file which you can obtain as follows:


To load this test datafile, click “Browse…” next to the “Data file” dialogue box, navigate to the location of the test data file you just downloaded and select it.

9) Finally, click “Save and configure.”

10) This will send you to the “Edit experiment configuration” page, where you can select how to display your numerical data. In our test case, select “ratio” for the “Image option values” (it’s the only option, but it needs to be selected) and “linear two sided” for the Scale option. Make sure image radio button is selected, then click “Accept”.

Visualize your Data

Your new experiment has now been loaded and configured and you are ready to browse your data in the context of your image.

1) Click the “Browse” tab at top of page.

2) Select the experiment you just created from drop down menu (e.g. “Test Experiment”) and click “view”.

3) Enter a “Keyword(probe)” for the data you want to visualize, which in our case is called “TE” and click “Search”. This should bring up an image of a fly, color-coded according to your data with a scale bar below indicating the color associated with a particular numerical value of your data.

4) You can now adjust the range and color code for your scale by clicking the “edit” button next to “linear two sided scale”. For example, if you want a symmetrical scale that brackets the full range of our test TE data, click the radio button for “” and set Min to -0.3 and Max to 0.3, scroll down to the bottom of the pop-up box and click “OK”. Then click “Search” again to update your image, and you should see something that looks like the following screenshot:


5) Finally, you can export the image by clicking “Export to file” in the bottom right hand corner of the page. This will export at zipfile of the entire experiment, including the original input SVG file and a slightly modified version of you data file, plus the labeled SVG file and scale bar as a separate JPG file.

There you have it. You’ve now successfully installed SVGmap and configured a customized experiment for visualizing biological data on a SVG image, in this case the Dipteran adult body plan. In the next post, I’ll give a run through of how to use SVGmap at the command line, and introduce a stripped down R implementation of the SVGmap approach, called FlyFig, we have developed in the Bergman Lab to visualize numerical data on the Dipteran body plan.

Compendium of Drosophila Microbial Symbiont Genome Assemblies

Posted 25 Jan 2013 — by caseybergman
Category drosophila, genome bioinformatics, microbes

Following on from a recent post where we compiled genome assemblies for species in the Drosophila genus, here we bring together the growing list of genome assemblies for microbial symbionts of Drosophila species. Please add information about any other Drosophila symbiont genome sequences assemblies that might be missing from this table in the comments below.

Assembly Producer
Wolbachia pipientis wMel D. melanogaster TIGR AE017196
Wolbachia pipientis wMelPop D. melanogaster O’Neill Lab AQQE00000000
Wolbachia pipientis wAu D. simulans Sinkins Lab LK055284
Wolbachia pipientis wSim D. simulans Salzberg Lab AAGC00000000
Wolbachia pipientis wRi D. simulans Andersson Lab NC_012416
Wolbachia pipientis wHa D. simulans Andersson Lab NC_021089
Wolbachia pipientis wNo D. simulans Andersson Lab NC_021084
Wolbachia pipientis wSuzi D. suzukii Anfora Lab CAOU02000001
Wolbachia pipientis wAna D. annanasae Salzberg Lab AAGB00000000
Wolbachia pipientis n.a. D. willistoni Salzberg Lab AAQP00000000
Wolbachia pipientis wRec D. recens Bordenstein Lab JQAM00000000
Pseudomonas entomophila L48 D. melanogaster Genoscope NC_008027
Commensalibacter intestini A911 D. melanogaster Lee Lab AGFR00000000
Acetobacter pomorum DM001 D. melanogaster Lee Lab AEUP00000000
Gluconobacter morbifer G707 D. melanogaster Lee Lab AGQV00000000
Providencia sneebia DSM 19967 D. melanogaster Lazzaro Lab AKKN00000000
Providencia rettgeri Dmel1 D. melanogaster Lazzaro Lab AJSB00000000
Providencia alcalifaciens Dmel2 D. melanogaster Lazzaro Lab AKKM00000000
Providencia burhodogranariea DSM 19968 D. melanogaster Lazzaro Lab AKKL00000000
Lactobacillus brevis EW D. melanogaster Lee Lab AUTD00000000
Lactobacillus plantarum WJL D. melanogaster Lee Lab AUTE00000000
Lactobacillus plantarum DmelCS_001 D. melanogaster Douglas Lab JOJT00000000
Lactobacillus fructivorans DmelCS_002 D. melanogaster Douglas Lab JOJZ00000000
Lactobacillus brevis DmelCS_003 D. melanogaster Douglas Lab JOKA00000000
Acetobacter pomorum
DmelCS_004 D. melanogaster Douglas Lab JOKL00000000
Acetobacter malorum DmelCS_005 D. melanogaster Douglas Lab JOJU00000000
Acetobacter tropicalis DmelCS_006 D. melanogaster Douglas Lab JOKM00000000
Lactococcus lactis Bpl1 D. melanogaster Douglas Lab JRFX00000000
Enterococcus faecalis Fly1 D. melanogaster Broad Institute ACAR00000000

Release of 20 European Drosophila melanogaster genomes

Posted 17 Jul 2012 — by caseybergman
Category drosophila, high throughput sequencing, population genomics

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

Background: As part of an ongoing efforts to characterize genetic diversity in the nuclear and cytoplasmic genomes of D. melanogaster, the Haddrill and Bergman Labs have collaborated to sequence the complete genomes of 20 D. melanogaster isofemale strains collected by Penny Haddrill in Montpellier, France in August 2010. These 20 genomes represent a random sample of the full collection described by Haddrill and Vespoor (2011), which also describes microsatellite variation data for these strains.

Following on from the very generous early release of D. melanogaster genomes by major resequencing efforts in Drosophila, we have decided to follow suit and release these genomes prior to publication to maximise their utility by the wider research community and prevent unnecessary duplication of effort. One major aim for our sequencing of a reasonably high number of strains from this European population is to provide a complementary dataset to help interpret the larger samples of North American and (predominantly) African strains from the Drosophila Genetic Reference Panel and Drosophila Population Genomics Project, respectively. For more on the philosophy behind why we have made the decision to release these data early, please see this blog post on genomic data release by individual labs in the next-generation sequencing era.

Methods: Genomic DNA was prepared by Penny Haddrill for each isofemale line by pooling fifty females, snap freezing them in liquid nitrogen, extracting DNA using a standard phenol-chloroform extraction protocol with ethanol and ammonium acetate precipitation. 500 bp short-insert libraries were constructed and 91 bp paired-end reads were generated using an Illumina HiSeq 2000 to an estimated coverage of ~50x per strain by BGI-Hong Kong. Basic QC on reads was performed by BGI and mapping to the Wolbachia genome following the protocol in Richardson et al. (submitted) confirmed the same infection status for as determined by PCR in Haddrill and Vespoor (2011).

Conditions for use: The Haddrill and Bergman labs intend to use these data to study patterns of genetic diversity in the nuclear and cytoplasmic genomes, to estimate the ratio of diversity on the X chromosome relative to the autosomes, to detect signatures of both positive and negative selection in the nuclear and cytoplasmic genomes, and investigate the impact of variation in recombination rate around the genome.

We have decidede to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomes is published, please cite these data as:

  • Haddrill, P. and C.M. Bergman (2012) 20 Drosophila melanogaster genomes from Montpellier, France.

[Update: These genome sequence have now been published. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

The data: Gzipped Illumina fastq files for forward and reverse paired reads can be downloaded at the following locations. A script to dowload all files in serial can be found below the table.

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite accession numbers ERR705945-ERR705964 in your work.]

Compendium of Drosophila Genome Assemblies

Posted 28 Apr 2012 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

[UPDATE: The UCSD Drosophila Species Stock Center now maintains a list of information about Drosophila genome sequences: The compendium here will no longer be updated. To ensure a single, up-to-date source of information, please send suggested links to the DSSC.]

Since the publication of the Drosophila melanogaster genome, the D. pseudoobscura genome, and the remaining 10 species from the Drosophila 12 Genomes Project, a number of other Drosophila genomes have been sequenced and assembled using whole genome shotgun approaches. Periodically, I find myself needing to find versions of these assemblies from the data producer, NCBI, Flybase, or UCSC, and so to help streamline this process I’ve collected these links together into a single compendium here, along with links to the original stocks used for sequencing. Hopefully this will be of use to others as well.

Data Producer
UCSC Genome Browser
Stock Center
D. melanogaster Berkeley Drosophila Genome Project AABU00000000 dmel dm3 1522
D. simulans Washington University Genome Sequencing Center AAGH00000000 dsim droSim1 n.a.
D. simulans Andolfatto Lab n.a. n.a. n.a. 1380
D. mauritiana Schlotterer Lab n.a. n.a. n.a. E-18912
D. sechellia Broad Institute AAKO00000000 dsec droSec1 1373
D. yakuba Washington University Genome Sequencing Center AAEU00000000 dyak droYak2 1385
D. santomea Andolfatto Lab n.a. n.a. n.a. 2367
D. erecta Agencourt Biosciences AAPQ00000000 dere droEre2 1374
D. ficusphila Baylor College of Medicine Genome Center AFFG00000000 n.a. n.a. 2332
D. eugracilis Baylor College of Medicine Genome Center AFPQ00000000 n.a. n.a. 2350
D. suzukii University of Edinburgh GenePool CAKG00000000 n.a. n.a. n.a
D. biarmipes Baylor College of Medicine Genome Center AFFD00000000 n.a. n.a. 2312
D. takahashii Baylor College of Medicine Genome Center AFFI00000000 n.a. n.a. 2325
D. elegans Baylor College of Medicine Genome Center AFFF00000000 n.a. n.a. 2334
D. rhopaloa Baylor College of Medicine Genome Center AFPP00000000 n.a. n.a. 2327
D. kikkawai Baylor College of Medicine Genome Center AFFH00000000 n.a. n.a. 2330
D. bipectinata Baylor College of Medicine Genome Center AFFE00000000 n.a. n.a. 2329
D. ananassae Agencourt Biosciences AAPP00000000 dana droAna3 1368
D. pseudoobscura Baylor College of Medicine Genome Center AADE00000000 dpse dp4 1277
D. persimilis Broad Institute AAIZ00000000 dper droPer1 1369
D. miranda Bachtrog Lab AJMI00000000 n.a. n.a. n.a.
D. willistoni J. Craig Venter Institute AAQB00000000 dwil droWil1 1375
D. virilis Agencourt Biosciences AANI00000000 dvir droVir3 1371
D. americana Vieira Lab n.a. n.a. n.a. n.a.
D. mojavensis Agencourt Biosciences AAPU00000000 dmoj droMoj3 1370
D. grimshawi Agencourt Biosciences AAPT00000000 dgri droGri2 1225
D. albomicans Wang/Bachtrog Labs ACVV00000000 n.a. n.a. n.a.

Wolbachia and Mitochondrial Genomes from Drosophila melanogaster Resequencing Projects

As part of ongoing efforts to characterize complete genome sequences of microbial symbioints of Drosophila species, the Bergman Lab has been involved in mining complete genomes of the Wolbachia endosymbiont from whole-genome shotgun sequences of D. melanogaster. This work is inspired by Steve Salzberg and colleagues’ pioneering paper in 2005 showing that Wolbachia genomes can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. We have adapted this technique to utilize short-read next generation sequencing data from population genomics projects in D. melanogaster, together with the reference Wolbachia genome published by Wu et al. (2005).

Currently we have identified infection status and extracted nearly-complete genomes from the two major resequencing efforts in D. melanogaster: the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP). We employ a fairly standard BWA/SAMtools  pipeline, with a few tricks that are essential for getting good quality assemblies and consensus sequences. Our first iteration of this pipeline was used to predict infection status in the DGRP strains. These results have been published in the DGRP main paper published in early 2012, and can be found in Supplemental Table 9 of the DGRP paper or more usefully in a comma-separated file at the DGRP website.

We have now applied an update version of our pipeline to both the DGRP and DPGP datasets and have used these “essentially complete” genome sequences to study the recent evolutionary history of Wolbachia in D. melanogaster. In the process of generating these data we received several inquiries about the status of this project so, in the spirit of Open Science that makes genomics such a productive field, we have released the consensus sequences and alignments of 179 Wolbachia and 290 mitochondrial genomes from the DGRP and DPGP prior to publication of our manuscript.

Since we are not the primary data producer for these assemblies, it is not approporiate to employ a data release policy based on the NHGRI guidelines. Instead, we have chosen to release these data under a Creative Commons Attribution 3.0 Unported License. If you use these assemblies or alignments in your projects, please cite the main DGRP and DPGP papers for the raw data and cite the following reference for the mtDNA or Wolbachia assemblies:

Richardson, M.F., L.A. Weinert, J.J. Welch, R.S. Linheiro, M.M. Magwire, F.M. Jiggins & C.M. Bergman (2012) Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics 8:e1003129.

A tar.gz archive contaning a reference-based multiple sequence alignment of release 1.0 of the complete D. melanogaster mtDNA and Wolbachia assemblies can be obtained here. If you have questions about the methods used to produce these assemblies, any associated meta-data from these assemblies, or the content of our manuscript please contact Casey Bergman for details.

Creative Commons License

Customizing a Local UCSC Genome Browser Installation III: Loading Custom Data

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

In the second post of this series we examined the grp and trackDb tables within an organism database in the UCSC genome browser schema, and how they are used to build a UCSC genome browser display by specifying: (i) what sort of data a track contains, (ii) how it should be displayed, and (iii) how tracks should be organized within the display. In this last post of three, we will look at how to load and subsequently display custom data into a UCSC genome browser instance. As an example, we’ll be using a set of data from the modENCODE project formatted for loading into the UCSC browser. For this you’ll need a working browser install and the dm3 organism database (see the first post in this series for details).

An important feature of the genome browser schema that permits easy customization is that you can have multiple grp and trackDb tables within an organism database.  This allows separation of tracks for different purposes, such as trackDb_privateData, trackDb_chipSeq, trackDb_UserName etc. The same goes for grp tables. Thus, you can pull in data from the main UCSC site, leave it untouched, and simply add new data by creating new trackDb and grp tables.

This post outlines the five steps for adding custom data into a local UCSC genome browser:

1) Getting the data into the correct format
2) Building the required loader code
3) Loading the data
4) Configuring the track’s display options
5) Creating the new trackDb and grp tables and configuring the browser to use it.

1) The UCSC browser supports a number of different data formats, which are described at the UCSC format FAQ. Defining where your data comes from, and what sort of data it is (e.g. an alignment, microarray expression data etc.) will go a long way towards determining what sort of file you need to load. In many cases your data will already be in one of the supported formats so the decision will have been made for you. If your data isn’t in one of the supported formats, you’ll need to decide which one is the most appropriate and then reformat the data using a suitable tool, e.g. a perl or bash script. The example data we will be loading is taken from this paper. The supplementary data contains an Excel spreadsheet that we need to reformat into a number of BED files and then load into the UCSC database.

2) Each of the data formats that can be loaded into the browser system has its own loader program. These are found in the same source tree that we used to build the browser CGIs. You will find the source code for the loaders in the Kent source tree, the BED loader is in the directory kent/src/hg/makeDb/hgLoadBed. Assuming you have followed the instructions in the first part of this blog post then to compile the BED loader you simply cd to its directory and type make.

cd kent/src/hg/makeDb/hgLoadBed

The hgLoadBed executable will now be in ~/bin/$MACHTYPE. If you haven’t already it is a good idea to add this directory to your path, add the following line to ~/.bashrc:

export PATH=$PATH:~/bin/$MACHTYPE

The other loader we’ll require for the example custom data included in this blog is hgBbiDbLink:

cd kent/src/hg/makeDb/hgBbiDbLink

3) To load data, e.g. a BED file, the command is simply:

hgLoadBed databaseName tableName pathToBEDfile

The data-loaders supplied with the browser can be divided into two broad groups. The first of these, as represented by hgLoadBed, actually load the data into a table within the organism database. For example, one of the histone-modification datasets looks like this as a BED file:

chr2L   5982    6949    ID=Embryo_0_12h_GAF_ChIP_chip.GAF       5.54
chr2L   65947   68135   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       10.54
chr2L   72137   73344   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       6.55
chr2L   107044  109963  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       8.93
chr2L   127917  130678  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       9.17

When this BED file is loaded using hgLoadBed it produces a table that has the following structure:

DESC MdEncHistonesGAFBG3DccId2651;
| Field      | Type                 | Null | Key | Default | Extra |
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
4 rows in set (0.00 sec)

And contains (results set truncated):

select * from MdEncHistonesGAFBG3DccId2651;
| 754 | chrX     |   22237479 | 22239805 |
| 754 | chrX     |   22255450 | 22257285 |
| 754 | chrX     |   22257996 | 22259538 |
| 754 | chrX     |   22259653 | 22261174 |
| 754 | chrX     |   22264098 | 22265194 |
| 585 | chrYHet  |      84149 |    85164 |
| 585 | chrYHet  |     104864 |   105879 |
| 586 | chrYHet  |     150067 |   151212 |
5104 rows in set (0.06 sec)

The second type of loader doesn’t insert the data into a database table directly. Instead data in formats such as bigWig and bigBed are deposited in a particular location on the filesystem and a table with a single row simply points to them. For example:

desc mdEncReprocCbpAdultMaleTreat;
| Field    | Type         | Null | Key | Default | Extra |
| fileName | varchar(255) | NO   |     | NULL    |       |
1 row in set (0.00 sec)

Which contains:

select * from mdEncReprocCbpAdultMaleTreat;
| fileName                                     |
| /gbdb/dm3/modencode/ |
1 row in set (0.02 sec)

The loader hgBbiDbLink is responsible for producing this simple table that points to the location of the data on the filesystem. When the browser comes to display a track that is specified as bigWig in trackDb’s type field it knows that the table specified in trackDb’s tableName field will give point it at the bigWig file rather than containing the data itself. Supposed we have placed our bigWig file in the /gbdb directory:

cp /gbdb/dm3/modencode/.

Then we must execute hgBbiDbLink as follows:

hgBbiDbLink databaseName desiredTableName pathToDataFile


hgBbiDbLink dm3 mdEncReprocCbpAdultMaleTreat /gbdb/dm3/modencode/

4) Once the data is loaded into the database we need to tell the browser how it should be displayed, this is the role of the trackDb table. We create a new trackDb table for our data using the command hgTrackDb. This is found in the same branch of the source tree and the individual loaders themselves. The way in which tracks and grouped and displayed by the browser is specified in a file which you pass to hgTrackDb called a .ra file, which specifies the contents of the different fields in the trackDb table, as well as the parent-child relationship for composite tracks. The example data provided in this blog post contains a trackDb.ra file that looks like this:

track dnaseLi
shortLabel DNAse Li 2011
longLabel DNAse Li Eisen Biggin Genome Biol 2011
group blog
priority 100
visibility hide
color 200,20,20
altcolor 200,20,20
compositeTrack on
type bed 3

    track dnaseLiStage5
    shortLabel Stage 5
    longLabel Li Dnase Stage 5
    parent dnaseLi
    priority 1

    track dnaseLiStage9
    shortLabel Stage 9
    longLabel Li Dnase Stage 9
    parent dnaseLi
    priority 2

    track dnaseLiStage10
    shortLabel Stage 10
    longLabel Li Dnase Stage 10
    parent dnaseLi
    priority 3

    track dnaseLiStage11
    shortLabel Stage 11
    longLabel Li Dnase Stage 11
    parent dnaseLi
    priority 4

    track dnaseLiStage14
    shortLabel Stage 14
    longLabel Li Dnase Stage 14
    parent dnaseLi
    priority 5

The example data (e.g. /root/ucsc_example_data) comes from 5 different stages of embryonic development so we have 5 BED files. The trackDb.ra file groups these as a composite track so we have the top level track (dnaseLi) which is the parent of the 5 individual tracks. To load these definitions into the database we execute hgTrackDb as follows:

~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

The complete set of commands to download the example data and load it are is therefore:

tar xzvf blogData.tar.gz
cd ucsc_example_data
~/bin/i386/hgLoadBed dm3 dnaseLiStage5 stage5.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage9 stage9.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage10 stage10.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage11 stage11.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage14 stage14.bed
~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

5) Tell the browser to use our new trackDb table by editing the hg.conf configuration file in /var/www/cgi-bin and adding the new trackDB table name to the db.trackDb line. The db.TrackDb line takes a comma-separated list of trackDb tables that the browser will display. Assuming that our new trackDb table is called trackDb_NEW then we must edit hg.conf from this:


to this:


You’ll notice in the supplied trackDb.ra file that the tracks belong to a group called ‘blog’. This group doesn’t currently exist in the database. There are two ways we can go about rectifying this. The first is to add blog to the existing grp table like so:

INSERT INTO grp (name,label,priority) VALUES('blog','Example Data',11);

Or you can create an entirely new grp table. The browser handles multiple grp tables in the same fashion as multiple trackDb tables, i.e. they are specified as a comma-separated list in hg.conf. To create a new grp table, execute:

INSERT INTO grp_NEW (name,label,priority) VALUES('blog','Example Data',11);

And then edit hg.conf to add grp_NEW to the db.grp line:


If you now refresh your web browser the cgi scripts should create a new session with the new data tracks displayed as configured in the grp table.