Reference genomes are crucial for many applications in next-generation sequencing (NGS) analyses, especially in whole genome resquencing studies that form the basis of population genomics. Typically, one uses an off-the-shelf reference genome assembly for your organism of interest (e.g. human, Drosophila, Arabidopsis) obtained from the NCBI, UCSC or Ensembl genome databases. However, single-organism reference genomes neglect the reality that most organisms live in symbiosis with a large number of microbial species. When performing a whole-genome shotgun sequencing experiment from a macro-organisms like D. melanogaster, it is inevitable that some proportion of its symbionts will also be sequenced, especially endosymbionts that live intracellularly like Wolbachia. To reduce mismapping and generate materials for population genomics of symbionts and their hosts, it is therefore approporiate to map to the genome of the species of interest as well as to the genomes of symbiotic microbial species, where available. However, reference genomes are currently stored on a per-species basis and the “hologenome” for any organism does not yet exist in a readily accessible form.
In my limited experience, attempting to construct such a “hologenome” can be very a tedious and potentially error-prone process since reference genomes for model metazoan species and microbes are not always available in the same location or format. In gearing up for the second iteration of our work on the population genomics of microbial symbionts of D. melanogaster, I have decided to script the construction of a D. melanogaster reference hologenome for all microbial symbionts associated with D. melanogaster whose genomes are publicly available, which I would thought may be of use to others. The code for this process is as follows and should work on any UNIX-based machine.
#!/bin/bash ################################################ # Script to construct reference "hologenome" # # for D. melanogaster and associated microbes # # Casey M. Bergman (University of Manchester) # ################################################ # download D. melanogaster (dm3) reference genome from UCSC genome database # unpack dm3, remove composite mtDNA, extract mtDNA from chrU & combine into # one .fa file excluding chrU & chrUextra wget -q http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz tar -xvzf chromFa.tar.gz rm chrM.fa echo ">chrM_iso1" > chrM.fa faFrag chrU.fa 5288527 5305749 stdout | grep -v chrU >> chrM.fa cat chr2L.fa chr2LHet.fa chr2R.fa chr2RHet.fa chr3L.fa chr3LHet.fa chr3R.fa chr3RHet.fa chr4.fa chrX.fa chrXHet.fa chrYHet.fa chrM.fa > dm3.fa rm chromFa.tar.gz chr2L.fa chr2LHet.fa chr2R.fa chr2RHet.fa chr3L.fa chr3LHet.fa chr3R.fa chr3RHet.fa chr4.fa chrX.fa chrXHet.fa chrYHet.fa chrU.fa chrUextra.fa chrM.fa # download yeast (sacCer3) reference genome from UCSC genome database # unpack and rename sacCer3 chroms & combine into one .fa file wget -q http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz tar -xvzf chromFa.tar.gz for i in chr*fa; do cat $i | sed 's/>chr/>sacCer3_chr/g' >> sacCer3.fa; done rm chromFa.tar.gz chrI.fa chrII.fa chrIII.fa chrIV.fa chrIX.fa chrM.fa chrV.fa chrVI.fa chrVII.fa chrVIII.fa chrX.fa chrXI.fa chrXII.fa chrXIII.fa chrXIV.fa chrXV.fa chrXVI.fa # download genomes for microbes known to be associated with D. melanogaster # from NCBI, combine multi-contig draft assemblies into single fasta file # per species & rename fasta headers to human readable form # Wolbachia pipientis wget -q ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Wolbachia_endosymbiont_of_Drosophila_melanogaster_uid272/AE017196.fna echo ">w_pipientis" >> w_pipientis.fa grep -v \> AE017196.fna >> w_pipientis.fa rm AE017196.fna # Pseudomonas entomophila wget -q ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Pseudomonas_entomophila_L48_uid58639/NC_008027.fna echo ">p_entomophila" >> p_entomophila.fa grep -v \> NC_008027.fna >> p_entomophila.fa rm NC_008027.fna # Commensalibacter intestini wget -q ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Commensalibacter_intestini_A911_uid73359/AGFR00000000.contig.fna.tgz tar -xvzf AGFR00000000.contig.fna.tgz cat AGFR01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Commensalibacter_intestini_A911_74_/>c_intestini_/g' > c_intestini.fa rm AGFR01*fna AGFR00000000.contig.fna.tgz # Acetobacter pomorum wget -q ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Acetobacter_pomorum_DM001_uid60787/AEUP00000000.contig.fna.tgz tar -xvzf AEUP00000000.contig.fna.tgz cat AEUP01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Acetobacter_pomorum_DM001_Contig00/>a_pomorum_/g' > a_pomorum.fa rm AEUP01*fna AEUP00000000.contig.fna.tgz # Gluconobacter morbifer wget -q ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Gluconobacter_morbifer_G707_uid73361/AGQV00000000.contig.fna.tgz tar -xvzf AGQV00000000.contig.fna.tgz cat AGQV01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/>Gluconobacter_morbifer_G707_75_/>g_morbifer_/g' > g_morbifer.fa rm AGQV01*fna AGQV00000000.contig.fna.tgz # Providencia burhodogranariea wget -q ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Providencia_burhodogranariea_DSM_19968_uid82573/AKKL00000000.contig.fna.tgz tar -xvzf AKKL00000000.contig.fna.tgz cat AKKL01*fna | sed 's/>.*|/>/g' | sed 's/> />/' | awk -F, '{ print $1; }' | sed 's/ /_/g' | sed 's/Providencia_burhodogranariea_DSM_19968_contig000/p_burhodogranariea_/g' > p_burhodogranariea.fa rm AKKL01*fna AKKL00000000.contig.fna.tgz # download genomes for microbes known to be associated with D. melanogaster # from Ensembl & rename fasta headers to human readable form # Providencia alcalifaciens wget -q ftp://ftp.ensemblgenomes.org/pub/release-19/bacteria/fasta/bacteria_18_collection/providencia_alcalifaciens_dmel2/dna/Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa.gz gunzip Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa.gz cat Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>contig000/>p_alcalifaciens_/g' > p_alcalifaciens.fa rm Providencia_alcalifaciens_dmel2.GCA_000314875.2.19.dna.toplevel.fa* # Providencia rettgeri wget -q ftp://ftp.ensemblgenomes.org/pub/release-19/bacteria/fasta/bacteria_23_collection/providencia_rettgeri_dmel1/dna/Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa.gz gunzip Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa.gz cat Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>Contig/>p_rettgeri_/g' > p_rettgeri.fa rm Providencia_rettgeri_dmel1.GCA_000314835.1.19.dna.toplevel.fa* # Enterococcus faecalis wget -q ftp://ftp.ensemblgenomes.org/pub/release-19/bacteria/fasta/bacteria_15_collection/enterococcus_faecalis_fly1/dna/Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa.gz gunzip Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa.gz cat Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa | sed 's/dna.*//g' | sed 's/>GG/>e_faecalis_GG/g' > e_faecalis.fa rm Enterococcus_faecalis_fly1.GCA_000157415.1.19.dna.toplevel.fa* # create D. melanogaster "hologenome" # from individual species fasta files cat dm3.fa sacCer3.fa w_pipientis.fa p_entomophila.fa c_intestini.fa a_pomorum.fa g_morbifer.fa p_burhodogranariea.fa p_alcalifaciens.fa p_rettgeri.fa e_faecalis.fa > dm3_hologenome.fa
Finally, since some tools (e.g. SAMtools faidx) require all reference genome sequences to have the same number of characters per line, but since different genome databases use different numbers of characters per line the file above will have heterogeneous character counts for different species. To fix this, I use fasta_formatter from the FASTX toolkit to convert the dm3_hologenome.fa into a file with fixed character lengths. To download and run this script, with the conversion to fixed line lengths, execute the following:
$ wget https://github.com/bergmanlab/blogscripts/blob/master/makeDmelHologenome.sh $ sh makeDmelHologenome.sh $ fasta_formatter -i dm3_hologenome.fa -o dm3_hologenome_v1.fa -w 50
Credits: Thanks go to Douda Bensasson for some SED tips during the construction of this script.