Archive for the ‘R’ Category

Mapping between SRA IDs and accessing metadata with the Bioconductor SRAdb package

Posted 12 Oct 2011 — by caseybergman
Category drosophila, high throughput sequencing, population genomics, R

NCBI’s Sequence Read Archive (SRA) is one of the primary repositories for next-generation sequencing data.¬† Submissions to SRA are often complex, with multiple samples from a given submission, or multiple sequencing runs associated with a given sample. As a result, a given SRA submission “envelope” can be associated with database records at several levels, each with corresponding identifiers and nomenclature. Valid levels in the SRA data model include: the study (prefixed with SRP), the experiment (prefixed with SRX), the sample (prefixed with SRS), and the run (prefixed with SRR). Mapping identifiers between the different parts of an SRA submission is important when you need to access meta-data from different parts of a submission or you need to aggregate data across different levels of a submission (e.g. sequencing runs from the same biological sample).

In the past, we’ve been guilty of using PERL to screen-scrape meta-data from the SRA html pages to aggregate accessions from different sequencing runs conducted by the Drosophila Genetic Reference Panel (DGRP) project. However, this is a non-optimal strategy because of changes to the SRA html format that make this approach brittle. Recently, I stumbled across a much better way to map IDs and access meta-data in SRA, using the SRAdb package from Bioconductor. The developers of SRAdb provide a nicely repackaged version of all SRA meta-data in SQLlite that can easily be accessed using a few lines of R.

The following example shows how to map IDs at all levels for SRA accessions from the two major Drosophila resequencing projects the DGRP (SRA study SRP000694) and the Drosophila Population Genomics Project (SRA studies SRP000224 & SRP005599).

##################################################
# Script to generate SRA metadata for Drosophila #
# melaogaster population genomics projects       #
# Casey M. Bergman (University of Manchester)    #
##################################################

#install the SRAdb bioconductor package
source("http://www.bioconductor.org/biocLite.R")
biocLite("SRAdb")
library(SRAdb)
 
#download and connect to the SRA SQLlite database
sqlfile <- getSRAdbFile()
sra_connection <- dbConnect(SQLite(), "SRAmetadb.sqlite")
 
# make linking table between SRP, SRA, SRS, SRX & SRR for DPGP & DGRP projects
conversion <- sraConvert(c("SRP000694","SRP000224", "SRP005599"), sra_con = sra_connection)
 
# open outfile & print header
outFile <- file("DrosophilaPopulationGenomicData.tsv", "w")
cat("Project\tAccession\tSample\tExperiment\tRun\tSampleAlias\tSampleDescription\n", file=outFile)
 
# loop through conversion table and look-up meta-data for each sample
for (i in 1:nrow(conversion)) {
sample_metadata_sql <- paste("select sample_alias, description from sample where sample_accession like '", conversion&#91;i,3&#93;, "'", sep="")
sample_metadata_results <- dbGetQuery(sra_connection, sample_metadata_sql)
cat(conversion&#91;i,1&#93;, conversion&#91;i,2&#93;, conversion&#91;i,3&#93;, conversion&#91;i,4&#93;, conversion&#91;i,5&#93;, sample_metadata_results&#91;1,1&#93;, sample_metadata_results&#91;1,2&#93;, "\n", file=outFile, sep = "\t")
}
 
#close filehandle
close(outFile)

#clean up
system("rm SRAmetadb.sqlite")
&#91;/sourcecode&#93;<div class="wp-git-embed" style="margin-bottom:10px; border:1px solid #CCC; text-align:right; width:99%; margin-top:-13px; font-size:11px; font-style:italic;"><span style="display:inline-block; padding:4px;">getDrosPopGenData.R</span><a style="display:inline-block; padding:4px 6px;" href="https://raw.github.com/bergmanlab/blogscripts/master/getDrosPopGenData.R" target="_blank">view raw</a><a style="display:inline-block; padding:4px 6px; float:left;" href="https://github.com/bergmanlab/blogscripts/blob/master/getDrosPopGenData.R" target="_blank">view file on <strong>GitHub</strong></a></div>

This script can be downloaded and run as follows:


$ wget https://github.com/bergmanlab/blogscripts/blob/master/getDrosPopGenData.R
$ R --no-save &gt; getDrosPopGenSraMetaData.R

For those interested in the actual output of this script please find the DrosophilaPopulationGenomicData.tsv file here.

Credits: Thanks to Pedro Olivares-Chauvet for R syntax help and Jack Zhu for a prompt bug fix in the SRAdb files.