Archive for the ‘filesharing’ Category

Automating ASCP-based submission of NGS data to ENA using expect

Posted 12 Dec 2014 — by caseybergman
Category EBI, filesharing, genome bioinformatics, high throughput sequencing, linux, NCBI

Submitting next-generation sequencing (NGS) data to International Nucleotide Sequence Database Collaboration repositories such as EBI or NCBI is an essential, but time-consuming, phase of many projects in modern biology. Therefore, mechanisms to speed up NGS data submission are a welcome addition to many biologists toolkit.  The European Nucleotide Archive (ENA) provides detailed instructions on how to submit NGS data to their “dropbox” staging area, which is a key part of the overall submission process for an NGS project. This can be done in three ways:

  • Using ENA’s Webin client in your web browser,
  • Using a standard FTP client at the command line, or
  • Using Aspera’s ASCP client at the command line.

For users with many samples to submit, the latter two command line options are clearly preferred methods. While FTP is installed by default on most linux systems, transfer of large NGS data files by FTP is slow, and on some systems (such as ours) FTP is specifically disabled because of security concerns.

In contrast, ASCP is not installed by default on most linux systems, but provides a very fast method to transfer large data files to EBI or NCBI. One of the downsides of using ASCP is that it interactively prompts users for password information for each file transferred. This requires babysitting your ASCP-based command line submission and supplying the same password for each file, thereby undermining much of the automation that a command line solution should provide.

In searching around for solutions to the ASCP-babysitting problem, I stumbled on documentation page entitled “Expect script for automating Aspera uploads to the EBI ENA” written by Robert Davey at the The Genome Analysis Centre. I had never heard of the expect scripting language prior to reading this post, but it provides exactly the solution I was looking for:

[Expect] is used to automate control of interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, ssh, and others. Expect uses pseudo terminals (Unix) or emulates a console (Windows), starts the target program, and then communicates with it, just as a human would, via the terminal or console interface. (Wikipedia)

Robert’s expect script was a little more complicated than I needed for my purposes, and required a few tweaks to conform to updates to EBI’s ASCP submission process. Without too much trouble, I cobbled together a modified version that solves the automated transfer of any number of data files to the top level of your ENA dropbox:

#!/usr/bin/expect
 
set fofn [lindex $argv 0]
set dropbox [lindex $argv 1]
set pass [lindex $argv 2]

set files [open $fofn]
set subs [read $files]

set direxist 0
set timeout -1
 
foreach line [split $subs \n] {
  if { "" != $line } {
    set seqfile [exec basename $line]
    set lst [split $line "/"]
    spawn ascp -QT -l80M -d $line $dropbox@webin.ebi.ac.uk:.
    expect "Password:"
    send "$pass\r"
    expect eof
  }
}

This script requires expect and ASCP to be installed globally on your system, and for the user to provide three arguments:

  • a file of filenames (with the full path) to the files you would like to submit to ENA
  • the ID for your Webin submission account, and
  • the password for your Webin submission account

For example, if you have a directory of gzip’ed fastq files that you would like to submit to ENA, all you would need to submit your files in bulk would be to navigate to that directory and do something like the following:

#download the script above from github
wget --no-check-certificate https://raw.githubusercontent.com/bergmanlab/blogscripts/master/ena_submit.exp

#create file of filenames
j=`pwd`
for k in `ls *gz`; do echo $j/$k; done > fofn

#perform ASCP submission, note: replace with your ENA (Webin ID and Password)
expect ena_submit.exp fofn Webin-ID Webin-Password

Hosting Custom Tracks for the UCSC Genome Browser on Dropbox

Posted 17 Sep 2013 — by caseybergman
Category filesharing, genome bioinformatics, UCSC genome browser

One of the most powerful features of the UCSC Genome Browser is the ability to analyze your own genome annotation data as Custom Tracks at the main site. Custom Track files can be uploaded via several mechanisms: by pasting into a text box, loading a file through your web browser, or placing a file on a web-accessible server and pasting in a URL to the remote file. Loading custom tracks via remote URLs is the preferred method for large data files, and is mandatory for file types that require an associated “index” file such as the BAM file format commonly used in next-generation sequence analysis.

When developing a new pipeline, debugging a script or doing preliminary data analysis, you may frequently need to submit slightly different versions of your custom tracks to the UCSC browser. When using the remote URL function to load your data, this may require moving your files to a web accessible directory many times during the course of a day.  This is not a big problem when working on a machine that is running a web server. But when working on a machine that is not configured to be a web server (e.g most laptops), this often requires copying your files to another machine, which is a tiresome step in visualizing or analyzing your data.

During a recent visit to the lab, Danny Miller suggested using Dropbox to host custom tracks to solve this problem. Dropbox provides a very easy way to generate files on a local machine that sync automatically to a remote web-accessible server with effectively no effort by the user. Using Dropbox to host your UCSC custom tracks turns out to be a potentially very effective solution, and it looks like several others have stumbled on the same idea. In testing this solution, we have found that there are few gotchas that one should be aware of to make this protocol work as easily as it should.

First, in July 2012 Dropbox changed their mechanism of making files web-accessible and no longer provide a “Public Folder” for new accounts. This was done ostensibly to improve security, since files in a Public Folder can be indexed by Google if they are linked to. Instead, now any file in your Dropbox folder can be made web accessible using the “Share Link” function. This mechanism for providing URLs to the UCSC Genome Browser is non-optimal for two reasons.

The first problem with the Share Link function is that the URL automatically generated by Dropbox cannot be read by the UCSC Genome Browser. For example, the link generated to the file “test.bed” in my Dropbox folder is “https://www.dropbox.com/s/7sjfbknsqhq6xfw/test.bed”, which gives an “Unrecognized format line 1” error when pasted into the UCSC Browser.  This can easily be fixed if you just want to load a single custom track  to the UCSC Browser using Dropbox by simply replacing “www.dropbox” in the URL generated by Dropbox with “dl.dropboxusercontent”. In this example, the corrected path to the file would be “https://dl.dropboxusercontent.com/s/7sjfbknsqhq6xfw/test.bed”, which can be loaded by the UCSC Genome Browser automatically.

The second problem with using the “Share Link” function for posting custom tracks to UCSC is that URL generated to each file using this function is unique. This is a problem for two reasons: (i) you need to share and modify links for each file you upload and (ii) neither you nor the UCSC Genome Browser are able to anticipate what the path would be for multiple custom track files. This is problematic for custom track file formats that require an associated index file, which is assumed by the UCSC Genome Browser to be in the same directory as your custom track, with the appropriate extension. Since Dropbox makes a unique path to the index file, even if shared, there is no way for the Genome Browser to know where it is. Both of these issues can be solved by using the Public Folder function of your Dropbox account, rather than the Share Links function.

The good news is that Dropbox does still make the Public Folder function available for older accounts and for newer accounts, you can activate a Public Folder using this “secret” link. By placing your custom tracks in your Public Folder, you now have a stable base URL to provide files the UCSC Genome Browser that does not require editing the URL, does not require explicitly sharing files, and can be anticipated by you and the browser. Following on from the example above, the link to the “test.bed” file inside a “custom_tracks” directory in my Dropbox Public Folder would be “https://dl.dropboxusercontent.com/u/dropboxuserid/custom_tracks/test.bed” (your dropboxuserid will be long integer). Thus if you are using Dropbox to host many custom tracks or files that require an index file, the Public Folder option is the way to go.

There are a couple caveats to using Dropbox to host custom tracks. The first is that you are limited to the size of Dropbox allocation, which you can increase by paying for it or inviting new users to use Dropbox. UPDATE: According to Dave Tang (see comment below),  if you use Dropbox to host large BigWig files you may get blocked by DropBox. The second is that any link you share or anything in your Public Folder is public. So any stable links to your files from webpages may ultimately be indexed by Google. Since there is no option to password protect shared files in the Public Folder of your Dropbox account, users looking for free options to sync large, password-protected custom track files with indexes to remote web-accessible servers should look into BitBucket, which has no filesize limit (unlike GitHub).

Credits: Danny Miller (Stowers Institute) had the original inspiration for this solution, and Michael Nelson (U. of Manchester) helped identify pros/cons of the various Dropbox hosting options.

Using Webdav to share files

Posted 17 Oct 2009 — by maxhaussler
Category apache, filesharing

I hate sending around the same text document all the time to the same people. With Google Docs still being too buggy for serious work and completely missing reference management and etherpad and zoho writer not being up to the task either, I am searching for a better solution. At school, some people used Microsoft’s SharePoint Server, which allows to open documents through the internet and also to save them back to the server, at a hefty price. It is using WebDAV as one of the supported protocols to access the files. WebDAV resembles http, except that it allows browsing folders and saving files to the server. Apache’s WebDAV module is a free implementation. You can keep your documents on a webserver, open them with Microsoft Office and OpenOffice and directly save them back on your webserver. Actually you can put all sorts of files into the WebDAV folder that you want to share with other people. The result is very similar to the commercial offer of dropbox.com, except that you don’t need to install a client and you are not limited to 2GB.

Here is how I installed WebDav on my own machine as root (adapted from the debian website):

  1. Enabled modules in apache that handle dav and authentication (Debian: “a2ensmod dav* auth_digest”)
  2. Added this to my apache configuration file (e.g. /etc/apache2/http.conf):
    Alias /dav /var/www/dav
    DAVLockDB /var/www/davLock/davLock

    AllowOverride None
    Options None
    Order allow,deny
    allow from all


    DAV On
    DAVMinTimeout 600
    AuthType Digest
    AuthName “webdav”
    AuthUserFile /etc/apache2/webdav-password
    Require valid-user
  3. Create the directories for the files and the lockfiles: “mkdir /var/www/dav /var/www/davLock”. Change owner to apache user: “chown www-data /var/www/dav /var/www/davLock” to make both writeable for the apache daemon.
  4. Create the user and the password: “htdigest -c /etc/apache2/webdav-password webdav myuser
  5. Restart apache: /etc/init.d/apache2 restart

You should then be able to access the folder from the MacOS Finder via “Go / Connect to Server” by entering your server name like http://myserver.com/das with the username myuser and the password that you specified in htdigest.

(To make webdav work with windows, see the comments below)