Error-Corrected PacBio Sequences for the D. melanogaster Reference Strain

No Comments
Posted 06 Nov 2013 in drosophila, genome bioinformatics, high throughput sequencing

Using PacBio and Illumina whole genome shotgun sequences we recently released for the D. melanogaster reference strain, Sergey Koren and Adam Phillippy at the University of Maryland have recently run their pacBioToCA method to generate a dataset of error-corrected PacBio reads for this dataset, which the have kindly made available here for re-use without restriction. This pilot data set is not at high enough coverage and thus a whole genome assembly was not attempted. Nevertheless, both the raw and error-corrected datasets should be of use to better understand the nature of PacBio data and the pacBioToCA pipeline as applied to Drosophila genomes.

The 2057_PacBio_corrected.tgz archive contains the following files:

  • pacbio.blasr.spec – the specification file used for the pacBioToCA run.
  • corrected.trim.fastq.bz2 – the error-corrected PacBio reads.
  • corrected.trim.lens – a file containing columns with corrected read id, read length, running total BP, running mean, and running coverage assuming 140Mbp genome size.
  • corrected.trim.names – a look-up table to map read IDs in the pacBioToCA output to the original PacBio input IDs.

Sergey also sent along a quick summary of the correction:

The correction used the latest CA from the repository (as of 10/15/13). The max number of mappings each Illumina sequence could have was set to 20, the repeat separation was set to 10. The genome size was set to 180Mbp. BLASR was used to align the Illumina sequences to PacBio sequences for correction. The input for correction was 673 Mbp of PacBio data (max: 8,231, mean: 1,587) which corresponds to approximately 4.8X of a 140Mbp genome. The correction produced 469Mbp (max: 6,883, mean: 1,410) or 3.35X. The throughput was approximately 69%. To estimate, the accuracy of the data, it was mapped back to the reference D. melanogaster. The sequences average 99.85% identity. Approximately 10% of the sequences mapped in more than a single piece to the reference. Some of these mappings were due to short indel differences between the reference and the reads or N’s in the reference. Some sequences mapped across large distances in the reference. I did not confirm if these split mappings were supported by the uncorrected reads or not.

We thank both Sergey and Adam for taking the time to run their pacBioToCA pipeline and making these data available to the community, and hope these data are of use to others in their research.


Add Your Comment