A Reference “Hologenome” for D. melanogaster

Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.

In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.

makeDmelHologenome.shview rawview file on GitHub

Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:

$ wget https://github.com/bergmanlab/blogscripts/blob/master/makeDmelHologenome.sh
$ sh makeDmelHologenome.sh
$ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50

Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.


Add Your Comment