Archive for the ‘population genomics’ Category

A Reference “Hologenome” for D. melanogaster

Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.

In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.

makeDmelHologenome.shview rawview file on GitHub

Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:

$ wget
$ sh
$ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50

Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.

Release of 20 European Drosophila melanogaster genomes

Posted 17 Jul 2012 — by caseybergman
Category drosophila, high throughput sequencing, population genomics

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

Background: As part of an ongoing efforts to characterize genetic diversity in the nuclear and cytoplasmic genomes of D. melanogaster, the Haddrill and Bergman Labs have collaborated to sequence the complete genomes of 20 D. melanogaster isofemale strains collected by Penny Haddrill in Montpellier, France in August 2010. These 20 genomes represent a random sample of the full collection described by Haddrill and Vespoor (2011), which also describes microsatellite variation data for these strains.

Following on from the very generous early release of D. melanogaster genomes by major resequencing efforts in Drosophila, we have decided to follow suit and release these genomes prior to publication to maximise their utility by the wider research community and prevent unnecessary duplication of effort. One major aim for our sequencing of a reasonably high number of strains from this European population is to provide a complementary dataset to help interpret the larger samples of North American and (predominantly) African strains from the Drosophila Genetic Reference Panel and Drosophila Population Genomics Project, respectively. For more on the philosophy behind why we have made the decision to release these data early, please see this blog post on genomic data release by individual labs in the next-generation sequencing era.

Methods: Genomic DNA was prepared by Penny Haddrill for each isofemale line by pooling fifty females, snap freezing them in liquid nitrogen, extracting DNA using a standard phenol-chloroform extraction protocol with ethanol and ammonium acetate precipitation. 500 bp short-insert libraries were constructed and 91 bp paired-end reads were generated using an Illumina HiSeq 2000 to an estimated coverage of ~50x per strain by BGI-Hong Kong. Basic QC on reads was performed by BGI and mapping to the Wolbachia genome following the protocol in Richardson et al. (submitted) confirmed the same infection status for as determined by PCR in Haddrill and Vespoor (2011).

Conditions for use: The Haddrill and Bergman labs intend to use these data to study patterns of genetic diversity in the nuclear and cytoplasmic genomes, to estimate the ratio of diversity on the X chromosome relative to the autosomes, to detect signatures of both positive and negative selection in the nuclear and cytoplasmic genomes, and investigate the impact of variation in recombination rate around the genome.

We have decidede to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomes is published, please cite these data as:

  • Haddrill, P. and C.M. Bergman (2012) 20 Drosophila melanogaster genomes from Montpellier, France.

[Update: These genome sequence have now been published. If you use these data in your work, please cite as Bergman & Haddrill (2015) Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents. F1000Research 2015, 4:31.]

The data: Gzipped Illumina fastq files for forward and reverse paired reads can be downloaded at the following locations. A script to dowload all files in serial can be found below the table.

[Update: These genome sequences (and 30 others) have been deposited in ENA under project accession ERP009059. If you use these data in your work, please cite accession numbers ERR705945-ERR705964 in your work.]

Wolbachia and Mitochondrial Genomes from Drosophila melanogaster Resequencing Projects

As part of ongoing efforts to characterize complete genome sequences of microbial symbioints of Drosophila species, the Bergman Lab has been involved in mining complete genomes of the Wolbachia endosymbiont from whole-genome shotgun sequences of D. melanogaster. This work is inspired by Steve Salzberg and colleagues’ pioneering paper in 2005 showing that Wolbachia genomes can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. We have adapted this technique to utilize short-read next generation sequencing data from population genomics projects in D. melanogaster, together with the reference Wolbachia genome published by Wu et al. (2005).

Currently we have identified infection status and extracted nearly-complete genomes from the two major resequencing efforts in D. melanogaster: the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP). We employ a fairly standard BWA/SAMtools  pipeline, with a few tricks that are essential for getting good quality assemblies and consensus sequences. Our first iteration of this pipeline was used to predict infection status in the DGRP strains. These results have been published in the DGRP main paper published in early 2012, and can be found in Supplemental Table 9 of the DGRP paper or more usefully in a comma-separated file at the DGRP website.

We have now applied an update version of our pipeline to both the DGRP and DPGP datasets and have used these “essentially complete” genome sequences to study the recent evolutionary history of Wolbachia in D. melanogaster. In the process of generating these data we received several inquiries about the status of this project so, in the spirit of Open Science that makes genomics such a productive field, we have released the consensus sequences and alignments of 179 Wolbachia and 290 mitochondrial genomes from the DGRP and DPGP prior to publication of our manuscript.

Since we are not the primary data producer for these assemblies, it is not approporiate to employ a data release policy based on the NHGRI guidelines. Instead, we have chosen to release these data under a Creative Commons Attribution 3.0 Unported License. If you use these assemblies or alignments in your projects, please cite the main DGRP and DPGP papers for the raw data and cite the following reference for the mtDNA or Wolbachia assemblies:

Richardson, M.F., L.A. Weinert, J.J. Welch, R.S. Linheiro, M.M. Magwire, F.M. Jiggins & C.M. Bergman (2012) Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics 8:e1003129.

A tar.gz archive contaning a reference-based multiple sequence alignment of release 1.0 of the complete D. melanogaster mtDNA and Wolbachia assemblies can be obtained here. If you have questions about the methods used to produce these assemblies, any associated meta-data from these assemblies, or the content of our manuscript please contact Casey Bergman for details.

Creative Commons License

Mapping between SRA IDs and accessing metadata with the Bioconductor SRAdb package

Posted 12 Oct 2011 — by caseybergman
Category drosophila, high throughput sequencing, population genomics, R

NCBI’s Sequence Read Archive (SRA) is one of the primary repositories for next-generation sequencing data.  Submissions to SRA are often complex, with multiple samples from a given submission, or multiple sequencing runs associated with a given sample. As a result, a given SRA submission “envelope” can be associated with database records at several levels, each with corresponding identifiers and nomenclature. Valid levels in the SRA data model include: the study (prefixed with SRP), the experiment (prefixed with SRX), the sample (prefixed with SRS), and the run (prefixed with SRR). Mapping identifiers between the different parts of an SRA submission is important when you need to access meta-data from different parts of a submission or you need to aggregate data across different levels of a submission (e.g. sequencing runs from the same biological sample).

In the past, we’ve been guilty of using PERL to screen-scrape meta-data from the SRA html pages to aggregate accessions from different sequencing runs conducted by the Drosophila Genetic Reference Panel (DGRP) project. However, this is a non-optimal strategy because of changes to the SRA html format that make this approach brittle. Recently, I stumbled across a much better way to map IDs and access meta-data in SRA, using the SRAdb package from Bioconductor. The developers of SRAdb provide a nicely repackaged version of all SRA meta-data in SQLlite that can easily be accessed using a few lines of R.

The following example shows how to map IDs at all levels for SRA accessions from the two major Drosophila resequencing projects the DGRP (SRA study SRP000694) and the Drosophila Population Genomics Project (SRA studies SRP000224 & SRP005599).

# Script to generate SRA metadata for Drosophila #
# melaogaster population genomics projects       #
# Casey M. Bergman (University of Manchester)    #

#install the SRAdb bioconductor package
#download and connect to the SRA SQLlite database
sqlfile <- getSRAdbFile()
sra_connection <- dbConnect(SQLite(), "SRAmetadb.sqlite")
# make linking table between SRP, SRA, SRS, SRX & SRR for DPGP & DGRP projects
conversion <- sraConvert(c("SRP000694","SRP000224", "SRP005599"), sra_con = sra_connection)
# open outfile & print header
outFile <- file("DrosophilaPopulationGenomicData.tsv", "w")
cat("Project\tAccession\tSample\tExperiment\tRun\tSampleAlias\tSampleDescription\n", file=outFile)
# loop through conversion table and look-up meta-data for each sample
for (i in 1:nrow(conversion)) {
sample_metadata_sql <- paste("select sample_alias, description from sample where sample_accession like '", conversion&#91;i,3&#93;, "'", sep="")
sample_metadata_results <- dbGetQuery(sra_connection, sample_metadata_sql)
cat(conversion&#91;i,1&#93;, conversion&#91;i,2&#93;, conversion&#91;i,3&#93;, conversion&#91;i,4&#93;, conversion&#91;i,5&#93;, sample_metadata_results&#91;1,1&#93;, sample_metadata_results&#91;1,2&#93;, "\n", file=outFile, sep = "\t")
#close filehandle

#clean up
system("rm SRAmetadb.sqlite")
&#91;/sourcecode&#93;<div class="wp-git-embed" style="margin-bottom:10px; border:1px solid #CCC; text-align:right; width:99%; margin-top:-13px; font-size:11px; font-style:italic;"><span style="display:inline-block; padding:4px;">getDrosPopGenData.R</span><a style="display:inline-block; padding:4px 6px;" href="" target="_blank">view raw</a><a style="display:inline-block; padding:4px 6px; float:left;" href="" target="_blank">view file on <strong>GitHub</strong></a></div>

This script can be downloaded and run as follows:

$ wget
$ R --no-save &gt; getDrosPopGenSraMetaData.R

For those interested in the actual output of this script please find the DrosophilaPopulationGenomicData.tsv file here.

Credits: Thanks to Pedro Olivares-Chauvet for R syntax help and Jack Zhu for a prompt bug fix in the SRAdb files.

Installing libsequence and analysis tools on OSX

After being relegated to the margins of biology for almost a century, the field of population genetics is now moving closer the the center of mainstream biology in the era of next-generation sequencing (NGS). In addition to providing genome-wide DNA variation data to study classical questions in evolution that are motivated by population genetic theory, NGS now permits population-genetic based techniques like GWAS (genome-wide association studies) and eQTL (expression quantitative trait locus) analysis to allow biologists to identify and map functional regions across the genome.

Handling big data requires industrial strength bioinformatics tool-kits, which are increasingly available for many aspects of genome bioinformatics (e.g. the Kent source tree, SAMtools, BEDtools, etc.). However, for molecular population genetics at the command line, researchers have a more limited palette that can be used for genome-scale data, including: VariScan, the PERL PopGen modules, and the libsequence C++ library.

Motivated by a recent question on BioStar and a request from a colleague to generate a table of polymorphic sites across a bacterial genome, I’ve recently had a play with the last of these — Kevin Thornton’s libsequence and accompanying analysis toolkit — which includes a utility called compute which is a “mini-DNAsp for the Unix command-line.” As ever, getting this installed on OSX was more of a challenge than desired, with a couple of dependencies (including Boost and GSL), but in the end was do-able as follows:

$ wget
$ tar -xvzf boost_1_47_0.tar.gz
$ cd boost_1_47_0
$ ./
$ sudo ./b2 install
$ cd ..

$ wget
$ tar -xvzf libsequence-1.7.3.tar.gz
$ cd libsequence-1.7.3
$ ./configure
$ make
$ sudo make install
$ cd ..

$ wget
$ tar -xvzf gsl-1.15.tar.gz
$ cd gsl-1.15
$ ./configure --disable-shared --disable-dependency-tracking
$ make
$ sudo make install
$ cd ..

$ wget
$ tar -xvzf analysis-0.8.0.tar.gz
$ cd analysis-0.8.0
$ ./configure
$ make
$ sudo make install
$ cd ..

With this recipe, compute and friends should now be happily installed in /usr/local/bin/.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8.

Enhanced by Zemanta

On genome coordinate systems and transposable element annotation

[Update: For an extended discussion of the issues in this post, see: Bergman (2012) A proposal for the reference-based annotation of de novo transposable element insertions” Mobile Genetic Elements 2:51 – 54]

Before embarking on any genome annotation effort, it is necessary to establish the best way to represent the biological feature under investigation. This post discusses how best to represent the annotation of transposable element (TE) insertions that are mapped to (but not present in) a reference genome sequence (e.g. from mutagenesis or re-sequencing studies), and how the standard coordinate system in use today causes problems for the annotation of TE insertions.

There are two major coordinate systems in genome bioinformatics, that differ primarily in whether they anchor genomic feature to (“base coordinate systems”) or between (“interbase coordinate systems”) nucleotide positions. Most genome annotation portals (e.g. NCBI or Ensembl), bioinformatics software (e.g. BLAST) and annotation file formats (e.g. GFF) use the base coordinate system, which represents a feature starting at the first nucleotide as position 1. In contrast, a growing number of systems (e.g. UCSC, Chado, DAS2) employ the interbase coordinate system, whereby a feature starting at the first nucleotide is represented as position 0. Note, the UCSC genome bioinformatics team actually use both systems and refer to the base coordinate system as “one-based, fully-closed” (used in the UCSC genome browser display) and interbase coordinate system as “zero-based, half-open” (used in their tools and file formats), leading to a FAQ about this issue by users. The interbase coodinate system is also referred to as “space-based” by some authors.

The differences between base (bottom) and interbase (top) coordinate system can be visualized in the following diagram (taken from the Chado wiki).

There are several advantage for using the interbase coordinate system including: (i) the ability to represent features that occur between nucleotides (like a splice site), (ii) simpler arithmetic for computing the length of features (length=end-start) and overlaps (max(start1,start2), min(end1,end2)) and (iii) more rational conversion of coordinates from the positive to the negative strand (see discussion here).

So why is the choice of coordinate system important for the annotation of TEs mapped to a reference sequence? The short answer is that TEs (and other insertions) that are not a part of the reference sequence occur between nucleotides in the reference coordinate system, and therefore it is difficult to accurately represent the location of a TE on base coordinates. Nevertheless, base coordinate systems dominate most of genome bioinformatics and are an established framework that one has to work within.

How then should we annotate TE insertions on base coordinates that are mapped precisely to a reference genome? If a TE insertion in reality occurs between positions X and X+1 in a genome, do we annotate the start and end position both at the same nucleotide? If so, do we annotate the start/stop coordinate at position X, or both at position X+1? If we chose to annotate the insertion at position X, then we need to invoke a rule that the TE inserts after nucleotide X. However this solution breaks down if the insertion is on the negative strand, since we either need to map a negative strand insertion to X+1 or have a different rule for interpreting the placement of the TE on positive and negative strands. Alternatively, do we annotate the TE as starting at X and ending at X+1, attempting to fake interbase coordinates on a base coordinate system, but at face value implying that the TE insertion is not mapped precisely and spans 2 bp in the genome.

After grappling with this issue for some time, it seems that neither of these solutions is sufficient to deal with the complexities of TE insertion and reference mapping. To understand why, we must consider the mechanisms of TE integration and how TE insertions are mapped to the genome. Most TEs create staggered cuts to the genomic DNA that are filled on integration into the genome leading to short target site duplications (TSDs). Most TEs also target a palindromic sequence, and insert randomly with respect to orientation. A new TE insertion is typically mapped to the genome by sequencing a fragment that spans the TE into unique flanking DNA, either by directed (e.g. inverse/linker PCR) or random (e.g. shotgun re-sequencing) approaches. The TE-flank fragment can be obtained from the 5′ or 3′ end of the TE. However, where one places the TE insertion depends on whether one uses the TE-flank from the 5′ or 3′ end and the orientation of the TE insertion in the genome. As shown in the following diagram, for an insertion on the positive strand (>>>), a TE-flank fragment from the 5′ end is annotated to occur at the 3′ end of the TSD (shown in bold), whereas a 3′ TE-flank fragment is placed at the 5′ end of the TSD.  For an insertion on the negative strand (<<<), the opposite effect occurs. In both cases, TE-flank fragments from the 5′ and 3′ end map the TE insertion to different locations in the genome.

Thus, where one chooses to annotate a TE insertion relative to the TSD is dependent on the orientation of the TE insertion and which end is mapped to the genome. As a consequence, both the single-base and two-base representations proposed above are flawed, since TE insertions into the same target site are annotated at two different locations on the positive and negative strand. This issue lead us (in retrospect) to misinterpret some aspects of the P-element target site preference in a recent paper, since FlyBase uses a single-base coordinate system to annotate TE insertions.

As an alternative, I propose that the most efficient and accurate way to represent TE insertions mapped to a reference genome on base coordinates is to annotate the span of the TSD and label the orientation of the TE in the strand field. This formulation allows one to bypass having to chose where to locate the TE relative to the TSD (5′ vs. 3′, as is required under the one-base/two-base annotation framework), and can represent insertions into the same target site that occur on different strands. Furthermore, this solution allows one to use both 5′ and 3′ TE-flank information. In fact, the overlap between coordinates from the 5′ and 3′ TE-flank fragments defines the TSD. Finally, this solution requires no prior information about TSD length for a given TE family, and also accommodates TE families that generate variable length TSDs since the TSD is annotated on a per TE basis.

The only problem left open by this proposed solution is for TEs that do not create a TSD, which have been reported to exist. Any suggestions for a general solution that also allows for annotation of TE insertions without TSDs would be much appreciated….

Compiling Jody Hey’s Multilocus HKA on OSX

Posted 09 Mar 2009 — by caseybergman
Category molecular evolution, OSX hacks, population genomics

The Hudson, Kreitman, Aguade (HKA) test is one of the most widely used tools for testing the neutral theory of molecular evolution, combining information from both polymorphism and divergence among closely related species. The HKA test classically was applied to two loci, but can be extended to multiple loci as well, using software such as Jody Hey‘s Multilocus HKA program.

Unfortunately the source code for this software has a couple of Windows-specific features that make it difficult to compile on OS X. To get it compile on OS X (and linux) our sysadmin Nick Gresham has developed the following solution:

1) change the following .h and .c filenames in the source code distribution to lowercase:

$ mv HKAEXP.C hkaexp.c
$ mv HKAFUNCS.H hkafuncs.h
$ mv HKADIST.C hkadist.c
$ mv HKAEXTRN.H hkaextrn.h
$ mv HKAPRAM1.C hkapram1.c
$ mv HKA.C hka.c
$ mv HKASIMB.C hkasimb.c
$ mv HKAFILE.C hkafile.c
$ mv SETS192.H sets192.h
$ mv HKAPREP.H hkaprep.h

2) change line 44 of hkafile.c to the following:

double gamma(double x);

3) create a Makefile with the following commands and make.

CC = gcc

all: hka

hka: hkadist.o hkaexp.o hkafile.o hkapram1.o hkasimb.o hka.o


$(CC) $(ALL_CFLAGS) -c $< -o $@


-rm *.o *~ core hka

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Tabs in the makefile have been faked using the indentation function in the wordpress editor.