Archive for the ‘genome bioinformatics’ Category

Automating ASCP-based submission of NGS data to ENA using expect

Posted 12 Dec 2014 — by caseybergman
Category EBI, filesharing, genome bioinformatics, high throughput sequencing, linux, NCBI

Submitting next-generation sequencing (NGS) data to International Nucleotide Sequence Database Collaboration repositories such as EBI or NCBI is an essential, but time-consuming, phase of many projects in modern biology. Therefore, mechanisms to speed up NGS data submission are a welcome addition to many biologists toolkit.  The European Nucleotide Archive (ENA) provides detailed instructions on how to submit NGS data to their “dropbox” staging area, which is a key part of the overall submission process for an NGS project. This can be done in three ways:

  • Using ENA’s Webin client in your web browser,
  • Using a standard FTP client at the command line, or
  • Using Aspera’s ASCP client at the command line.

For users with many samples to submit, the latter two command line options are clearly preferred methods. While FTP is installed by default on most linux systems, transfer of large NGS data files by FTP is slow, and on some systems (such as ours) FTP is specifically disabled because of security concerns.

In contrast, ASCP is not installed by default on most linux systems, but provides a very fast method to transfer large data files to EBI or NCBI. One of the downsides of using ASCP is that it interactively prompts users for password information for each file transferred. This requires babysitting your ASCP-based command line submission and supplying the same password for each file, thereby undermining much of the automation that a command line solution should provide.

In searching around for solutions to the ASCP-babysitting problem, I stumbled on documentation page entitled “Expect script for automating Aspera uploads to the EBI ENA” written by Robert Davey at the The Genome Analysis Centre. I had never heard of the expect scripting language prior to reading this post, but it provides exactly the solution I was looking for:

[Expect] is used to automate control of interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, ssh, and others. Expect uses pseudo terminals (Unix) or emulates a console (Windows), starts the target program, and then communicates with it, just as a human would, via the terminal or console interface. (Wikipedia)

Robert’s expect script was a little more complicated than I needed for my purposes, and required a few tweaks to conform to updates to EBI’s ASCP submission process. Without too much trouble, I cobbled together a modified version that solves the automated transfer of any number of data files to the top level of your ENA dropbox:

#!/usr/bin/expect
 
set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
 
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l80M -d $line $dropbox@webin.ebi.ac.uk:.
    expect "Password:"
    send "$pass\r"
    expect eof
  }
}

This script requires expect and ASCP to be installed globally on your system, and for the user to provide three arguments:

  • a file of filenames (with the full path) to the files you would like to submit to ENA
  • the ID for your Webin submission account, and
  • the password for your Webin submission account

For example, if you have a directory of gzip’ed fastq files that you would like to submit to ENA, all you would need to submit your files in bulk would be to navigate to that directory and do something like the following:

#download the script above from github
wget --no-check-certificate https://raw.githubusercontent.com/bergmanlab/blogscripts/master/ena_submit.exp

#create file of filenames
j=`pwd`
for k in `ls *gz`; do echo $j/$k; done > fofn

#perform ASCP submission, note: replace with your ENA (Webin ID and Password)
expect ena_submit.exp fofn Webin-ID Webin-Password

High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome

Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.

Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here, or alternatively paste the following line into the “Paste URLs or data:” box on the “Add Custom Tracks” page of the Release 5 (dm3) browser:

track type=bam bigDataUrl=http://bergmanlab.genetics.uga.edu/data/tracks/dm3/dm3PacBio.bam name=dm3PacBio

As shown in the following figure, mapped read depth is roughly Poisson distributed with a mean coverage of ~90x-95x on the autosomal arms and ~45x coverage for the X-chromosome. A two-fold reduction in read depth on the X relative to the autosomes is expected in a DNA sample only from heterogametic XY males. Mapped read depth is important to calculate empirically post hoc (not just based on theoretical a priori expectations) since the actual genome size of D. melanogaster is not precisely known and varies between males and females, with an upper estimate of the total male genome size (autosomes + X + Y) being ~220 Mb. Thus, we conclude the coverage of this dataset is of sufficient depth to generate long-range de novo assemblies of the D. melanogaster genome using PacBio data only, as confirmed by the preliminary Celera and FALCON assemblies.

coverage

The following browser screenshots show representative regions of how this high coverage PacBio dataset maps to the Release 5 reference genome:

  • A typical region of unique sequence (around the eve locus). Notice that read coverage is relatively uniform across the genome and roughly equal on the forward (blue) and reverse (red) strands. Also notice that some reads are long enough to span the entire distance between neighboring genes, suggesting unassembled PacBio reads are now long enough to establish syntenic relationships for neighboring genes in Drosophila from unassembled reads.
eve
  • A typical region containing a transposable element (TE) insertion (in the ds locus), in this case a 8126 bp roo LTR retrotransposon (FBti0019098). Substantial numbers of mapped reads align across the junction region of the TE and its unique flanking regions. However, there is an excess of shorter reads inside the boundaries of the TE, which are likely to be incorrectly mapped because they are shorter than the length of the active TE sequence and have mapping quality values of ~0. With suitable read length and mapping quality filters, the unique location of TE insertions in the euchromatin should be identifiable using uncorrected PacBio reads.

ds

  • One of the few remaining and persistent gaps in the euchromatic portion of the reference sequence that can be filled using PacBio sequences. This gap remains in the upcoming Release 6 reference sequence (Sue Celniker and Roger Hoskins, personal communication). This shows blasr can align PacBio reads through smaller sequence gaps that are represented by padded N’s, however the visualization of such reads may sometimes be flawed (i.e. by artificially displaying a gap in the mapped read that arises from a lack of alignment to the ambiguous nucleotides in the gap).

gap

Below we provide a reproducible transcript of the code used to (i) setup our working environment, (ii) download and unpack this D. melanogaster PacBio dataset, (iii) generate the BAM file of mapped reads released here, and (iv) generate coverage density plots for the major chromosome arms shown above. Part of the rationale for providing these scripts is to showcase how simple it is to install the entire PacBio smrtanalysis toolkit, which is the easiest way we’ve found to install the blasr long-read aligner (plus get a large number of other goodies for processing PacBio data like pacBioToCA, pbalign, quiver, etc).

These scripts are written to be used by someone with no prior knowledge of PacBio data or tools, and can be run on a Oracle Grid Engine powered linux cluster running Scientific Linux (release 6.1).  They should be easily adapted for use on any linux platform supported by PacBio’s smrtanalysis toolkit, and to run on a single machine just execute the commands inside the “quote marks” on the lines with qsub statements.

(i) First, let’s download the Release 5 reference genome plus the smrtanalysis toolkit:

#!/bin/bash

# get dm3 and reformat into single fasta file
echo "getting dm3 and reformatting into single fasta file"
wget -q http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz
tar -xvzf chromFa.tar.gz
cat chr*.fa > dm3.fa
rm chromFa.tar.gz chr*.fa

#install PacBio smrtanalysis suite (including blasr PacBio long-read mapping engine)
echo "installing PacBio smrtanalysis suite"
wget http://programs.pacificbiosciences.com/l/1652/2013-11-05/2tqk4f
bash smrtanalysis-2.1.1-centos-6.3.run --extract-only --rootdir ./
rm smrtanalysis-2.1.1-centos-6.3.run
setupDrosPacBioEnv.shview rawview file on GitHub

If you want to have the smrtanalysis tooklit in your environment permanently, you’ll need to add the following magic line to your .profile:

source install/smrtanalysis-2.1.1.128549/etc/setup.sh

(ii) Next, let’s get the D. melanogaster PacBio data and unpack it:

#!/bin/bash

#get compressed .tar archives of D. melanogaster PacBio runs
echo "get tar archives of PacBio runs"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.tgz  "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.tgz"

#get md5sums for PacBio .tar archives
echo "getting md5sums for PacBio tar archives"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.md5"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.md5"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.md5"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.md5"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.md5"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.md5  "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.md5"

#verify .tar archives have been downloaded properly
echo "verifying archives have been downloaded properly"
qsub -b y -cwd -N md5_Dro1_24NOV2013_398.md5 -hold_jid wget_Dro1_24NOV2013_398.tgz,wget_Dro1_24NOV2013_398.md5 "md5sum -c Dro1_24NOV2013_398.md5 > Dro1_24NOV2013_398.md5.checkresult"
qsub -b y -cwd -N md5_Dro2_25NOV2013_399.md5 -hold_jid wget_Dro2_25NOV2013_399.tgz,wget_Dro2_25NOV2013_399.md5 "md5sum -c Dro2_25NOV2013_399.md5 > Dro2_25NOV2013_399.md5.checkresult"
qsub -b y -cwd -N md5_Dro3_26NOV2013_400.md5 -hold_jid wget_Dro3_26NOV2013_400.tgz,wget_Dro3_26NOV2013_400.md5 "md5sum -c Dro3_26NOV2013_400.md5 > Dro3_26NOV2013_400.md5.checkresult"
qsub -b y -cwd -N md5_Dro4_28NOV2013_401.md5 -hold_jid wget_Dro4_28NOV2013_401.tgz,wget_Dro4_28NOV2013_401.md5 "md5sum -c Dro4_28NOV2013_401.md5 > Dro4_28NOV2013_401.md5.checkresult"
qsub -b y -cwd -N md5_Dro5_29NOV2013_402.md5 -hold_jid wget_Dro5_29NOV2013_402.tgz,wget_Dro5_29NOV2013_402.md5 "md5sum -c Dro5_29NOV2013_402.md5 > Dro5_29NOV2013_402.md5.checkresult"
qsub -b y -cwd -N md5_Dro6_1DEC2013_403.md5  -hold_jid wget_Dro6_1DEC2013_403.tgz,wget_Dro6_1DEC2013_403.md5  "md5sum -c Dro6_1DEC2013_403.md5  > Dro6_1DEC2013_403.md5.checkresult"

#extract PacBio runs
echo "extracting PacBio runs"
qsub -b y -cwd -N tar_Dro1_24NOV2013_398.tgz -hold_jid wget_Dro1_24NOV2013_398.tgz "tar -xvzf Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N tar_Dro2_25NOV2013_399.tgz -hold_jid wget_Dro2_25NOV2013_399.tgz "tar -xvzf Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N tar_Dro3_26NOV2013_400.tgz -hold_jid wget_Dro3_26NOV2013_400.tgz "tar -xvzf Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N tar_Dro4_28NOV2013_401.tgz -hold_jid wget_Dro4_28NOV2013_401.tgz "tar -xvzf Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N tar_Dro5_29NOV2013_402.tgz -hold_jid wget_Dro5_29NOV2013_402.tgz "tar -xvzf Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N tar_Dro6_1DEC2013_403.tgz  -hold_jid wget_Dro6_1DEC2013_403.tgz  "tar -xvzf Dro6_1DEC2013_403.tgz"

#delete .tar archives and md5 files
qsub -V -b y -cwd -N rm_tar_md -hold_jid tar_Dro1_24NOV2013_398.tgz,tar_Dro2_25NOV2013_399.tgz,tar_Dro3_26NOV2013_400.tgz,tar_Dro4_28NOV2013_401.tgz,tar_Dro5_29NOV2013_402.tgz,tar_Dro6_1DEC2013_403.tgz "rm Dro*.tgz Dro*.md5"
getDrosPacBioData.shview rawview file on GitHub

(iii) Once this download is complete, we’ll map data from all of the cells to the Release 5 reference and merge the output into a single file of mapped (.bam), retaining unmapped (.unaligned) reads:

#!/bin/bash

#map .bax.h5 files from individual PacBio cells to Release 5 using blasr
echo "mapping .bax.h5 files to Release 5"
samples=()
for input in `ls Dro*/*/*/*bax.h5` 
do
 sample=`basename $input .bax.h5`
 output1="${sample}.sam"
 output2="${sample}.unaligned"
 qsub -l cores=12 -V -b y -cwd -N blasr_${sample} "install/smrtanalysis-2.1.1.128549/analysis/bin/blasr $input dm3.fa -bestn 1 -nproc 12 -minPctIdentity 80 -sam -out $output1 -unaligned $output2"
 qsub -V -b y -cwd -N sort_${sample} -hold_jid blasr_${sample} "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools view -bS $output1 | install/smrtanalysis-2.1.1.128549/analysis/bin/samtools sort - $sample"
 samples+=(sort_${sample})
done

#merge bam files from individual cells and index merged bam file
echo "merging and indexing bam files"
holdlist=`printf -- '%s,' "${samples[@]}"`
input="m131*_p0*.bam"
output="dm3PacBio.bam"
qsub -V -b y -cwd -N merge_pacbio -hold_jid $holdlist "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools merge $output $input"
qsub -V -b y -cwd -N index_pacbio -hold_jid merge_pacbio "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools index $output"

#merge unmapped reads and remove .sam, .bam and .unaligned files from individual cells
echo "merging unmapped reads and cleaning up"
qsub -V -b y -cwd -N merge_unmapped -hold_jid index_pacbio "cat m131*unaligned > dm3PacBio.unaligned"
qsub -V -b y -cwd -N rm_cell_files -hold_jid merge_unmapped "rm m131*unaligned m131*bam m131*sam"


mapDrosPacBioData.shview rawview file on GitHub

(iv) Last, let’s generate the coverage plot shown above using the genome coverage utility in Aaron Quinlan’s BedTools package, R and imagemagick:

wget http://bergmanlab.genetics.uga.edu/data/tracks/dm3/dm3PacBio.bam
wget https://raw.github.com/bergmanlab/blogscripts/master/dm3PacBioCoverage.R
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from dm3.chromInfo" > dm3.genome
bedtools genomecov -ibam dm3PacBio.bam -g dm3.genome > dm3.genome.coverage
R --no-save < dm3PacBioCoverage.R
convert -density 300 -quality 100 dm3PacBioCoverage.pdf dm3PacBioCoverage.jpg

Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.

Error-Corrected PacBio Sequences for the D. melanogaster Reference Strain

Posted 06 Nov 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

Using PacBio and Illumina whole genome shotgun sequences we recently released for the D. melanogaster reference strain, Sergey Koren and Adam Phillippy at the University of Maryland have recently run their pacBioToCA method to generate a dataset of error-corrected PacBio reads for this dataset, which the have kindly made available here for re-use without restriction. This pilot data set is not at high enough coverage and thus a whole genome assembly was not attempted. Nevertheless, both the raw and error-corrected datasets should be of use to better understand the nature of PacBio data and the pacBioToCA pipeline as applied to Drosophila genomes.

The 2057_PacBio_corrected.tgz archive contains the following files:

  • pacbio.blasr.spec – the specification file used for the pacBioToCA run.
  • corrected.trim.fastq.bz2 – the error-corrected PacBio reads.
  • corrected.trim.lens – a file containing columns with corrected read id, read length, running total BP, running mean, and running coverage assuming 140Mbp genome size.
  • corrected.trim.names – a look-up table to map read IDs in the pacBioToCA output to the original PacBio input IDs.

Sergey also sent along a quick summary of the correction:

The correction used the latest CA from the repository (as of 10/15/13). The max number of mappings each Illumina sequence could have was set to 20, the repeat separation was set to 10. The genome size was set to 180Mbp. BLASR was used to align the Illumina sequences to PacBio sequences for correction. The input for correction was 673 Mbp of PacBio data (max: 8,231, mean: 1,587) which corresponds to approximately 4.8X of a 140Mbp genome. The correction produced 469Mbp (max: 6,883, mean: 1,410) or 3.35X. The throughput was approximately 69%. To estimate, the accuracy of the data, it was mapped back to the reference D. melanogaster. The sequences average 99.85% identity. Approximately 10% of the sequences mapped in more than a single piece to the reference. Some of these mappings were due to short indel differences between the reference and the reads or N’s in the reference. Some sequences mapped across large distances in the reference. I did not confirm if these split mappings were supported by the uncorrected reads or not.

We thank both Sergey and Adam for taking the time to run their pacBioToCA pipeline and making these data available to the community, and hope these data are of use to others in their research.

Hosting Custom Tracks for the UCSC Genome Browser on Dropbox

Posted 17 Sep 2013 — by caseybergman
Category filesharing, genome bioinformatics, UCSC genome browser

One of the most powerful features of the UCSC Genome Browser is the ability to analyze your own genome annotation data as Custom Tracks at the main site. Custom Track files can be uploaded via several mechanisms: by pasting into a text box, loading a file through your web browser, or placing a file on a web-accessible server and pasting in a URL to the remote file. Loading custom tracks via remote URLs is the preferred method for large data files, and is mandatory for file types that require an associated “index” file such as the BAM file format commonly used in next-generation sequence analysis.

When developing a new pipeline, debugging a script or doing preliminary data analysis, you may frequently need to submit slightly different versions of your custom tracks to the UCSC browser. When using the remote URL function to load your data, this may require moving your files to a web accessible directory many times during the course of a day.  This is not a big problem when working on a machine that is running a web server. But when working on a machine that is not configured to be a web server (e.g most laptops), this often requires copying your files to another machine, which is a tiresome step in visualizing or analyzing your data.

During a recent visit to the lab, Danny Miller suggested using Dropbox to host custom tracks to solve this problem. Dropbox provides a very easy way to generate files on a local machine that sync automatically to a remote web-accessible server with effectively no effort by the user. Using Dropbox to host your UCSC custom tracks turns out to be a potentially very effective solution, and it looks like several others have stumbled on the same idea. In testing this solution, we have found that there are few gotchas that one should be aware of to make this protocol work as easily as it should.

First, in July 2012 Dropbox changed their mechanism of making files web-accessible and no longer provide a “Public Folder” for new accounts. This was done ostensibly to improve security, since files in a Public Folder can be indexed by Google if they are linked to. Instead, now any file in your Dropbox folder can be made web accessible using the “Share Link” function. This mechanism for providing URLs to the UCSC Genome Browser is non-optimal for two reasons.

The first problem with the Share Link function is that the URL automatically generated by Dropbox cannot be read by the UCSC Genome Browser. For example, the link generated to the file “test.bed” in my Dropbox folder is “https://www.dropbox.com/s/7sjfbknsqhq6xfw/test.bed”, which gives an “Unrecognized format line 1” error when pasted into the UCSC Browser.  This can easily be fixed if you just want to load a single custom track  to the UCSC Browser using Dropbox by simply replacing “www.dropbox” in the URL generated by Dropbox with “dl.dropboxusercontent”. In this example, the corrected path to the file would be “https://dl.dropboxusercontent.com/s/7sjfbknsqhq6xfw/test.bed”, which can be loaded by the UCSC Genome Browser automatically.

The second problem with using the “Share Link” function for posting custom tracks to UCSC is that URL generated to each file using this function is unique. This is a problem for two reasons: (i) you need to share and modify links for each file you upload and (ii) neither you nor the UCSC Genome Browser are able to anticipate what the path would be for multiple custom track files. This is problematic for custom track file formats that require an associated index file, which is assumed by the UCSC Genome Browser to be in the same directory as your custom track, with the appropriate extension. Since Dropbox makes a unique path to the index file, even if shared, there is no way for the Genome Browser to know where it is. Both of these issues can be solved by using the Public Folder function of your Dropbox account, rather than the Share Links function.

The good news is that Dropbox does still make the Public Folder function available for older accounts and for newer accounts, you can activate a Public Folder using this “secret” link. By placing your custom tracks in your Public Folder, you now have a stable base URL to provide files the UCSC Genome Browser that does not require editing the URL, does not require explicitly sharing files, and can be anticipated by you and the browser. Following on from the example above, the link to the “test.bed” file inside a “custom_tracks” directory in my Dropbox Public Folder would be “https://dl.dropboxusercontent.com/u/dropboxuserid/custom_tracks/test.bed” (your dropboxuserid will be long integer). Thus if you are using Dropbox to host many custom tracks or files that require an index file, the Public Folder option is the way to go.

There are a couple caveats to using Dropbox to host custom tracks. The first is that you are limited to the size of Dropbox allocation, which you can increase by paying for it or inviting new users to use Dropbox. UPDATE: According to Dave Tang (see comment below),  if you use Dropbox to host large BigWig files you may get blocked by DropBox. The second is that any link you share or anything in your Public Folder is public. So any stable links to your files from webpages may ultimately be indexed by Google. Since there is no option to password protect shared files in the Public Folder of your Dropbox account, users looking for free options to sync large, password-protected custom track files with indexes to remote web-accessible servers should look into BitBucket, which has no filesize limit (unlike GitHub).

Credits: Danny Miller (Stowers Institute) had the original inspiration for this solution, and Michael Nelson (U. of Manchester) helped identify pros/cons of the various Dropbox hosting options.

A Reference “Hologenome” for D. melanogaster

Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.

In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.

makeDmelHologenome.shview rawview file on GitHub

Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:

$ wget https://github.com/bergmanlab/blogscripts/blob/master/makeDmelHologenome.sh
$ sh makeDmelHologenome.sh
$ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50

Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.

PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain

Posted 31 Jul 2013 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

As part of a collaboration with Danny Miller and Scott Hawley from the Stowers Institute, we have generated whole genome shotgun sequences using PacBio RS technology for the Drosophila melanogaster y; cn, bw, sp strain (Bloomington 2057), the same strain that was also used to assemble the D. melanogaster reference genome sequence. We’ve been meaning to release these data to the community since we got the data in April, but have been waylaid by teaching commitments and a spate of recent server problems.  Prompted by Danny’s visit to the Bergman Lab over the last two weeks and the generous release by Dmitri Petrov’s lab of a data set of Illumina long reads using the Moleculo technology for the same strain of D. melanogaster, we’ve finally gotten around to bundling these D. melanogaster PacBio sequences for general release. We’re hoping that the availability of both PacBio and Moleculo long-read data for the same strain that has one of the highest quality reference genomes for any species will allow the genomics community to investigate directly the pros/cons of each of these new exciting technologies.

These PacBio sequence data were generated by the University of Michigan DNA Sequencing Core facility on a PacBio RS using DNA from 10 adult males. Flies were starved for 4 hours to reduce microbial DNA contaminants before freezing at -80oC and prepped using the Qiagen DNeasy Blood & Tissue Kit (catalog number 69504). Six SMRT cells were used for sequencing 5 Kb fragment libraries on two Long Read mode runs: 1 cell on the first run (Run53) and 5 cells on the second run (Run55). Excluding the circular consensus (CCS) reads, and combining data from the S1 and S2 files for each of the six cells, we obtained 1,357,183,439 bp of raw DNA sequence from this experiment or roughly 7.5x coverage of the 180 Mb male D. melanogaster genome.

A ~63 Gb tar.gz of the entire PacBio long read dataset including .fasta, .fastq, and .h5 files can be found here. To help newcomers to PacBio data (like us) get over the hurdle of installing the PacBio long-read aligner blasr, we are also distributing CentOS 6/Scientific Linux 64-bit and source RPMs for blasr, made by Peter Briggs in the University of Manchester Bioinformatics Core Facility.

We have also generated 100bp paired-end Illumina data from the same stock, to aid with error-correction of the PacBio long reads. These data were generated by the Stowers Institute Molecular Biology Core facility. As above, genomic DNA was prepared from 10 starved, adult males using the Qiagen DNeasy Blood & Tissue Kit. 1ug of DNA from each was fragmented using a Covaris S220 sonicator (Covaris Inc.) to 250 base pair (bp) fragments by adjusting the treatment time to 85 seconds. Following manufacturer’s directions, short fragment libraries were made using the KAPA Library Preparation Kits (KAPA Biosystems, Cat. No. KK8201) and Bioo Scientific NEXTflex™ DNA Barcodes (Bioo Scientific, Cat. No. 514104). The resulting libraries were purified using Agencourt AMPure XP system (Beckman Coulter, Cat. No. A63880), then quantified using a Bioanalyzer (Agilent Technologies) and a Qubit Fluorometer (Life Technologies).  The library was pooled with several other strains, re-quantified and run as high output mode on six 100 bp paired-end lanes on an Illumina HiSeq 2000 instrument, using HiSeq Control Software 1.5.15.1 and Real-Time Analysis (RTA) version 1.13.48.0. Secondary Analysis version CASAVA-1.8.2 was run to demultiplex reads and generate FASTQ files.

The Illumina 100 bp paired-end dataset in .fastq format can be at ENA under accession ERX645969. Additional Illumina data for the ISO1 reference strain has been generated previously by Chuck Langley’s lab, either from mixed adult males and females (SRX040484) or haploid embryos (SRX040485, SRX040486, SRX040491) that could be used to supplement the PacBio error correction.

As with previous unpublished data we have released from the Bergman Lab, we have chosen to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomic data is published, please cite:

  • Miller, D.E., C.B. Smith, R.S. Hawley and C.M. Bergman (2013) PacBio Whole Genome Shotgun Sequences for the D. melanogaster Reference Strain. http://bergmanlab.genetics.uga.edu/?p=1971

Compendium of Drosophila Microbial Symbiont Genome Assemblies

Posted 25 Jan 2013 — by caseybergman
Category drosophila, genome bioinformatics, microbes

Following on from a recent post where we compiled genome assemblies for species in the Drosophila genus, here we bring together the growing list of genome assemblies for microbial symbionts of Drosophila species. Please add information about any other Drosophila symbiont genome sequences assemblies that might be missing from this table in the comments below.

Species
Strain
Host
Assembly Producer
NCBI
Wolbachia pipientis wMel D. melanogaster TIGR AE017196
Wolbachia pipientis wMelPop D. melanogaster O’Neill Lab AQQE00000000
Wolbachia pipientis wAu D. simulans Sinkins Lab LK055284
Wolbachia pipientis wSim D. simulans Salzberg Lab AAGC00000000
Wolbachia pipientis wRi D. simulans Andersson Lab NC_012416
Wolbachia pipientis wHa D. simulans Andersson Lab NC_021089
Wolbachia pipientis wNo D. simulans Andersson Lab NC_021084
Wolbachia pipientis wSuzi D. suzukii Anfora Lab CAOU02000001
Wolbachia pipientis wAna D. annanasae Salzberg Lab AAGB00000000
Wolbachia pipientis n.a. D. willistoni Salzberg Lab AAQP00000000
Wolbachia pipientis wRec D. recens Bordenstein Lab JQAM00000000
Pseudomonas entomophila L48 D. melanogaster Genoscope NC_008027
Commensalibacter intestini A911 D. melanogaster Lee Lab AGFR00000000
Acetobacter pomorum DM001 D. melanogaster Lee Lab AEUP00000000
Gluconobacter morbifer G707 D. melanogaster Lee Lab AGQV00000000
Providencia sneebia DSM 19967 D. melanogaster Lazzaro Lab AKKN00000000
Providencia rettgeri Dmel1 D. melanogaster Lazzaro Lab AJSB00000000
Providencia alcalifaciens Dmel2 D. melanogaster Lazzaro Lab AKKM00000000
Providencia burhodogranariea DSM 19968 D. melanogaster Lazzaro Lab AKKL00000000
Lactobacillus brevis EW D. melanogaster Lee Lab AUTD00000000
Lactobacillus plantarum WJL D. melanogaster Lee Lab AUTE00000000
Lactobacillus plantarum DmelCS_001 D. melanogaster Douglas Lab JOJT00000000
Lactobacillus fructivorans DmelCS_002 D. melanogaster Douglas Lab JOJZ00000000
Lactobacillus brevis DmelCS_003 D. melanogaster Douglas Lab JOKA00000000
Acetobacter pomorum
DmelCS_004 D. melanogaster Douglas Lab JOKL00000000
Acetobacter malorum DmelCS_005 D. melanogaster Douglas Lab JOJU00000000
Acetobacter tropicalis DmelCS_006 D. melanogaster Douglas Lab JOKM00000000
Lactococcus lactis Bpl1 D. melanogaster Douglas Lab JRFX00000000
Enterococcus faecalis Fly1 D. melanogaster Broad Institute ACAR00000000

Compendium of Drosophila Genome Assemblies

Posted 28 Apr 2012 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

[UPDATE: The UCSD Drosophila Species Stock Center now maintains a list of information about Drosophila genome sequences: https://stockcenter.ucsd.edu/info/genomeresources. The compendium here will no longer be updated. To ensure a single, up-to-date source of information, please send suggested links to the DSSC.]

Since the publication of the Drosophila melanogaster genome, the D. pseudoobscura genome, and the remaining 10 species from the Drosophila 12 Genomes Project, a number of other Drosophila genomes have been sequenced and assembled using whole genome shotgun approaches. Periodically, I find myself needing to find versions of these assemblies from the data producer, NCBI, Flybase, or UCSC, and so to help streamline this process I’ve collected these links together into a single compendium here, along with links to the original stocks used for sequencing. Hopefully this will be of use to others as well.

Species
Data Producer
NCBI
Flybase
UCSC Genome Browser
Stock Center
ID
D. melanogaster Berkeley Drosophila Genome Project AABU00000000 dmel dm3 1522
D. simulans Washington University Genome Sequencing Center AAGH00000000 dsim droSim1 n.a.
D. simulans Andolfatto Lab n.a. n.a. n.a. 1380
D. mauritiana Schlotterer Lab n.a. n.a. n.a. E-18912
D. sechellia Broad Institute AAKO00000000 dsec droSec1 1373
D. yakuba Washington University Genome Sequencing Center AAEU00000000 dyak droYak2 1385
D. santomea Andolfatto Lab n.a. n.a. n.a. 2367
D. erecta Agencourt Biosciences AAPQ00000000 dere droEre2 1374
D. ficusphila Baylor College of Medicine Genome Center AFFG00000000 n.a. n.a. 2332
D. eugracilis Baylor College of Medicine Genome Center AFPQ00000000 n.a. n.a. 2350
D. suzukii University of Edinburgh GenePool CAKG00000000 n.a. n.a. n.a
D. biarmipes Baylor College of Medicine Genome Center AFFD00000000 n.a. n.a. 2312
D. takahashii Baylor College of Medicine Genome Center AFFI00000000 n.a. n.a. 2325
D. elegans Baylor College of Medicine Genome Center AFFF00000000 n.a. n.a. 2334
D. rhopaloa Baylor College of Medicine Genome Center AFPP00000000 n.a. n.a. 2327
D. kikkawai Baylor College of Medicine Genome Center AFFH00000000 n.a. n.a. 2330
D. bipectinata Baylor College of Medicine Genome Center AFFE00000000 n.a. n.a. 2329
D. ananassae Agencourt Biosciences AAPP00000000 dana droAna3 1368
D. pseudoobscura Baylor College of Medicine Genome Center AADE00000000 dpse dp4 1277
D. persimilis Broad Institute AAIZ00000000 dper droPer1 1369
D. miranda Bachtrog Lab AJMI00000000 n.a. n.a. n.a.
D. willistoni J. Craig Venter Institute AAQB00000000 dwil droWil1 1375
D. virilis Agencourt Biosciences AANI00000000 dvir droVir3 1371
D. americana Vieira Lab n.a. n.a. n.a. n.a.
D. mojavensis Agencourt Biosciences AAPU00000000 dmoj droMoj3 1370
D. grimshawi Agencourt Biosciences AAPT00000000 dgri droGri2 1225
D. albomicans Wang/Bachtrog Labs ACVV00000000 n.a. n.a. n.a.

Wolbachia and Mitochondrial Genomes from Drosophila melanogaster Resequencing Projects

As part of ongoing efforts to characterize complete genome sequences of microbial symbioints of Drosophila species, the Bergman Lab has been involved in mining complete genomes of the Wolbachia endosymbiont from whole-genome shotgun sequences of D. melanogaster. This work is inspired by Steve Salzberg and colleagues’ pioneering paper in 2005 showing that Wolbachia genomes can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. We have adapted this technique to utilize short-read next generation sequencing data from population genomics projects in D. melanogaster, together with the reference Wolbachia genome published by Wu et al. (2005).

Currently we have identified infection status and extracted nearly-complete genomes from the two major resequencing efforts in D. melanogaster: the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP). We employ a fairly standard BWA/SAMtools  pipeline, with a few tricks that are essential for getting good quality assemblies and consensus sequences. Our first iteration of this pipeline was used to predict infection status in the DGRP strains. These results have been published in the DGRP main paper published in early 2012, and can be found in Supplemental Table 9 of the DGRP paper or more usefully in a comma-separated file at the DGRP website.

We have now applied an update version of our pipeline to both the DGRP and DPGP datasets and have used these “essentially complete” genome sequences to study the recent evolutionary history of Wolbachia in D. melanogaster. In the process of generating these data we received several inquiries about the status of this project so, in the spirit of Open Science that makes genomics such a productive field, we have released the consensus sequences and alignments of 179 Wolbachia and 290 mitochondrial genomes from the DGRP and DPGP prior to publication of our manuscript.

Since we are not the primary data producer for these assemblies, it is not approporiate to employ a data release policy based on the NHGRI guidelines. Instead, we have chosen to release these data under a Creative Commons Attribution 3.0 Unported License. If you use these assemblies or alignments in your projects, please cite the main DGRP and DPGP papers for the raw data and cite the following reference for the mtDNA or Wolbachia assemblies:

Richardson, M.F., L.A. Weinert, J.J. Welch, R.S. Linheiro, M.M. Magwire, F.M. Jiggins & C.M. Bergman (2012) Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics 8:e1003129.

A tar.gz archive contaning a reference-based multiple sequence alignment of release 1.0 of the complete D. melanogaster mtDNA and Wolbachia assemblies can be obtained here. If you have questions about the methods used to produce these assemblies, any associated meta-data from these assemblies, or the content of our manuscript please contact Casey Bergman for details.

Creative Commons License

Customizing a Local UCSC Genome Browser Installation III: Loading Custom Data

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

In the second post of this series we examined the grp and trackDb tables within an organism database in the UCSC genome browser schema, and how they are used to build a UCSC genome browser display by specifying: (i) what sort of data a track contains, (ii) how it should be displayed, and (iii) how tracks should be organized within the display. In this last post of three, we will look at how to load and subsequently display custom data into a UCSC genome browser instance. As an example, we’ll be using a set of data from the modENCODE project formatted for loading into the UCSC browser. For this you’ll need a working browser install and the dm3 organism database (see the first post in this series for details).

An important feature of the genome browser schema that permits easy customization is that you can have multiple grp and trackDb tables within an organism database.  This allows separation of tracks for different purposes, such as trackDb_privateData, trackDb_chipSeq, trackDb_UserName etc. The same goes for grp tables. Thus, you can pull in data from the main UCSC site, leave it untouched, and simply add new data by creating new trackDb and grp tables.

This post outlines the five steps for adding custom data into a local UCSC genome browser:

1) Getting the data into the correct format
2) Building the required loader code
3) Loading the data
4) Configuring the track’s display options
5) Creating the new trackDb and grp tables and configuring the browser to use it.

1) The UCSC browser supports a number of different data formats, which are described at the UCSC format FAQ. Defining where your data comes from, and what sort of data it is (e.g. an alignment, microarray expression data etc.) will go a long way towards determining what sort of file you need to load. In many cases your data will already be in one of the supported formats so the decision will have been made for you. If your data isn’t in one of the supported formats, you’ll need to decide which one is the most appropriate and then reformat the data using a suitable tool, e.g. a perl or bash script. The example data we will be loading is taken from this paper. The supplementary data contains an Excel spreadsheet that we need to reformat into a number of BED files and then load into the UCSC database.

2) Each of the data formats that can be loaded into the browser system has its own loader program. These are found in the same source tree that we used to build the browser CGIs. You will find the source code for the loaders in the Kent source tree, the BED loader is in the directory kent/src/hg/makeDb/hgLoadBed. Assuming you have followed the instructions in the first part of this blog post then to compile the BED loader you simply cd to its directory and type make.

cd kent/src/hg/makeDb/hgLoadBed
make

The hgLoadBed executable will now be in ~/bin/$MACHTYPE. If you haven’t already it is a good idea to add this directory to your path, add the following line to ~/.bashrc:

export PATH=$PATH:~/bin/$MACHTYPE

The other loader we’ll require for the example custom data included in this blog is hgBbiDbLink:

cd kent/src/hg/makeDb/hgBbiDbLink
make

3) To load data, e.g. a BED file, the command is simply:

hgLoadBed databaseName tableName pathToBEDfile

The data-loaders supplied with the browser can be divided into two broad groups. The first of these, as represented by hgLoadBed, actually load the data into a table within the organism database. For example, one of the histone-modification datasets looks like this as a BED file:

chr2L   5982    6949    ID=Embryo_0_12h_GAF_ChIP_chip.GAF       5.54
chr2L   65947   68135   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       10.54
chr2L   72137   73344   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       6.55
chr2L   107044  109963  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       8.93
chr2L   127917  130678  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       9.17

When this BED file is loaded using hgLoadBed it produces a table that has the following structure:

DESC MdEncHistonesGAFBG3DccId2651;
+------------+----------------------+------+-----+---------+-------+
| Field      | Type                 | Null | Key | Default | Extra |
+------------+----------------------+------+-----+---------+-------+
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
+------------+----------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

And contains (results set truncated):

select * from MdEncHistonesGAFBG3DccId2651;
| 754 | chrX     |   22237479 | 22239805 |
| 754 | chrX     |   22255450 | 22257285 |
| 754 | chrX     |   22257996 | 22259538 |
| 754 | chrX     |   22259653 | 22261174 |
| 754 | chrX     |   22264098 | 22265194 |
| 585 | chrYHet  |      84149 |    85164 |
| 585 | chrYHet  |     104864 |   105879 |
| 586 | chrYHet  |     150067 |   151212 |
+-----+----------+------------+----------+
5104 rows in set (0.06 sec)

The second type of loader doesn’t insert the data into a database table directly. Instead data in formats such as bigWig and bigBed are deposited in a particular location on the filesystem and a table with a single row simply points to them. For example:

desc mdEncReprocCbpAdultMaleTreat;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| fileName | varchar(255) | NO   |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)

Which contains:

select * from mdEncReprocCbpAdultMaleTreat;
+----------------------------------------------+
| fileName                                     |
+----------------------------------------------+
| /gbdb/dm3/modencode/AdultMale_treat.bw |
+----------------------------------------------+
1 row in set (0.02 sec)

The loader hgBbiDbLink is responsible for producing this simple table that points to the location of the data on the filesystem. When the browser comes to display a track that is specified as bigWig in trackDb’s type field it knows that the table specified in trackDb’s tableName field will give point it at the bigWig file rather than containing the data itself. Supposed we have placed our bigWig file in the /gbdb directory:

cp AdultTreat.bw /gbdb/dm3/modencode/.

Then we must execute hgBbiDbLink as follows:

hgBbiDbLink databaseName desiredTableName pathToDataFile

Specifically:

hgBbiDbLink dm3 mdEncReprocCbpAdultMaleTreat /gbdb/dm3/modencode/AdultTreat.bw

4) Once the data is loaded into the database we need to tell the browser how it should be displayed, this is the role of the trackDb table. We create a new trackDb table for our data using the command hgTrackDb. This is found in the same branch of the source tree and the individual loaders themselves. The way in which tracks and grouped and displayed by the browser is specified in a file which you pass to hgTrackDb called a .ra file, which specifies the contents of the different fields in the trackDb table, as well as the parent-child relationship for composite tracks. The example data provided in this blog post contains a trackDb.ra file that looks like this:

track dnaseLi
shortLabel DNAse Li 2011
longLabel DNAse Li Eisen Biggin Genome Biol 2011
group blog
priority 100
visibility hide
color 200,20,20
altcolor 200,20,20
compositeTrack on
type bed 3

    track dnaseLiStage5
    shortLabel Stage 5
    longLabel Li Dnase Stage 5
    parent dnaseLi
    priority 1

    track dnaseLiStage9
    shortLabel Stage 9
    longLabel Li Dnase Stage 9
    parent dnaseLi
    priority 2

    track dnaseLiStage10
    shortLabel Stage 10
    longLabel Li Dnase Stage 10
    parent dnaseLi
    priority 3

    track dnaseLiStage11
    shortLabel Stage 11
    longLabel Li Dnase Stage 11
    parent dnaseLi
    priority 4

    track dnaseLiStage14
    shortLabel Stage 14
    longLabel Li Dnase Stage 14
    parent dnaseLi
    priority 5

The example data (e.g. /root/ucsc_example_data) comes from 5 different stages of embryonic development so we have 5 BED files. The trackDb.ra file groups these as a composite track so we have the top level track (dnaseLi) which is the parent of the 5 individual tracks. To load these definitions into the database we execute hgTrackDb as follows:

~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

The complete set of commands to download the example data and load it are is therefore:

cd
wget http://genome.smith.man.ac.uk/downloads/blogData.tar.gz
tar xzvf blogData.tar.gz
cd ucsc_example_data
~/bin/i386/hgLoadBed dm3 dnaseLiStage5 stage5.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage9 stage9.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage10 stage10.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage11 stage11.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage14 stage14.bed
~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

5) Tell the browser to use our new trackDb table by editing the hg.conf configuration file in /var/www/cgi-bin and adding the new trackDB table name to the db.trackDb line. The db.TrackDb line takes a comma-separated list of trackDb tables that the browser will display. Assuming that our new trackDb table is called trackDb_NEW then we must edit hg.conf from this:

db.trackDb=trackDb

to this:

db.trackDb=trackDb,trackDb_NEW

You’ll notice in the supplied trackDb.ra file that the tracks belong to a group called ‘blog’. This group doesn’t currently exist in the database. There are two ways we can go about rectifying this. The first is to add blog to the existing grp table like so:

INSERT INTO grp (name,label,priority) VALUES('blog','Example Data',11);

Or you can create an entirely new grp table. The browser handles multiple grp tables in the same fashion as multiple trackDb tables, i.e. they are specified as a comma-separated list in hg.conf. To create a new grp table, execute:

CREATE TABLE grp_NEW AS SELECT * FROM grp WHERE 1=2;
INSERT INTO grp_NEW (name,label,priority) VALUES('blog','Example Data',11);

And then edit hg.conf to add grp_NEW to the db.grp line:

db.grp=grp,grp_NEW

If you now refresh your web browser the cgi scripts should create a new session with the new data tracks displayed as configured in the grp table.