Archive for the ‘NCBI’ Category

Automating ASCP-based submission of NGS data to ENA using expect

Posted 12 Dec 2014 — by caseybergman
Category EBI, filesharing, genome bioinformatics, high throughput sequencing, linux, NCBI

Submitting next-generation sequencing (NGS) data to International Nucleotide Sequence Database Collaboration repositories such as EBI or NCBI is an essential, but time-consuming, phase of many projects in modern biology. Therefore, mechanisms to speed up NGS data submission are a welcome addition to many biologists toolkit.  The European Nucleotide Archive (ENA) provides detailed instructions on how to submit NGS data to their “dropbox” staging area, which is a key part of the overall submission process for an NGS project. This can be done in three ways:

  • Using ENA’s Webin client in your web browser,
  • Using a standard FTP client at the command line, or
  • Using Aspera’s ASCP client at the command line.

For users with many samples to submit, the latter two command line options are clearly preferred methods. While FTP is installed by default on most linux systems, transfer of large NGS data files by FTP is slow, and on some systems (such as ours) FTP is specifically disabled because of security concerns.

In contrast, ASCP is not installed by default on most linux systems, but provides a very fast method to transfer large data files to EBI or NCBI. One of the downsides of using ASCP is that it interactively prompts users for password information for each file transferred. This requires babysitting your ASCP-based command line submission and supplying the same password for each file, thereby undermining much of the automation that a command line solution should provide.

In searching around for solutions to the ASCP-babysitting problem, I stumbled on documentation page entitled “Expect script for automating Aspera uploads to the EBI ENA” written by Robert Davey at the The Genome Analysis Centre. I had never heard of the expect scripting language prior to reading this post, but it provides exactly the solution I was looking for:

[Expect] is used to automate control of interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, ssh, and others. Expect uses pseudo terminals (Unix) or emulates a console (Windows), starts the target program, and then communicates with it, just as a human would, via the terminal or console interface. (Wikipedia)

Robert’s expect script was a little more complicated than I needed for my purposes, and required a few tweaks to conform to updates to EBI’s ASCP submission process. Without too much trouble, I cobbled together a modified version that solves the automated transfer of any number of data files to the top level of your ENA dropbox:

#!/usr/bin/expect
 
set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
 
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l80M -d $line $dropbox@webin.ebi.ac.uk:.
    expect "Password:"
    send "$pass\r"
    expect eof
  }
}

This script requires expect and ASCP to be installed globally on your system, and for the user to provide three arguments:

  • a file of filenames (with the full path) to the files you would like to submit to ENA
  • the ID for your Webin submission account, and
  • the password for your Webin submission account

For example, if you have a directory of gzip’ed fastq files that you would like to submit to ENA, all you would need to submit your files in bulk would be to navigate to that directory and do something like the following:

#download the script above from github
wget --no-check-certificate https://raw.githubusercontent.com/bergmanlab/blogscripts/master/ena_submit.exp

#create file of filenames
j=`pwd`
for k in `ls *gz`; do echo $j/$k; done > fofn

#perform ASCP submission, note: replace with your ENA (Webin ID and Password)
expect ena_submit.exp fofn Webin-ID Webin-Password