Hosting Custom Tracks for the UCSC Genome Browser on Dropbox

10 Comments
Posted 17 Sep 2013 in filesharing, genome bioinformatics, UCSC genome browser

One of the most powerful features of the UCSC Genome Browser is the ability to analyze your own genome annotation data as Custom Tracks at the main site. Custom Track files can be uploaded via several mechanisms: by pasting into a text box, loading a file through your web browser, or placing a file on a web-accessible server and pasting in a URL to the remote file. Loading custom tracks via remote URLs is the preferred method for large data files, and is mandatory for file types that require an associated “index” file such as the BAM file format commonly used in next-generation sequence analysis.

When developing a new pipeline, debugging a script or doing preliminary data analysis, you may frequently need to submit slightly different versions of your custom tracks to the UCSC browser. When using the remote URL function to load your data, this may require moving your files to a web accessible directory many times during the course of a day.  This is not a big problem when working on a machine that is running a web server. But when working on a machine that is not configured to be a web server (e.g most laptops), this often requires copying your files to another machine, which is a tiresome step in visualizing or analyzing your data.

During a recent visit to the lab, Danny Miller suggested using Dropbox to host custom tracks to solve this problem. Dropbox provides a very easy way to generate files on a local machine that sync automatically to a remote web-accessible server with effectively no effort by the user. Using Dropbox to host your UCSC custom tracks turns out to be a potentially very effective solution, and it looks like several others have stumbled on the same idea. In testing this solution, we have found that there are few gotchas that one should be aware of to make this protocol work as easily as it should.

First, in July 2012 Dropbox changed their mechanism of making files web-accessible and no longer provide a “Public Folder” for new accounts. This was done ostensibly to improve security, since files in a Public Folder can be indexed by Google if they are linked to. Instead, now any file in your Dropbox folder can be made web accessible using the “Share Link” function. This mechanism for providing URLs to the UCSC Genome Browser is non-optimal for two reasons.

The first problem with the Share Link function is that the URL automatically generated by Dropbox cannot be read by the UCSC Genome Browser. For example, the link generated to the file “test.bed” in my Dropbox folder is “https://www.dropbox.com/s/7sjfbknsqhq6xfw/test.bed”, which gives an “Unrecognized format line 1” error when pasted into the UCSC Browser.  This can easily be fixed if you just want to load a single custom track  to the UCSC Browser using Dropbox by simply replacing “www.dropbox” in the URL generated by Dropbox with “dl.dropboxusercontent”. In this example, the corrected path to the file would be “https://dl.dropboxusercontent.com/s/7sjfbknsqhq6xfw/test.bed”, which can be loaded by the UCSC Genome Browser automatically.

The second problem with using the “Share Link” function for posting custom tracks to UCSC is that URL generated to each file using this function is unique. This is a problem for two reasons: (i) you need to share and modify links for each file you upload and (ii) neither you nor the UCSC Genome Browser are able to anticipate what the path would be for multiple custom track files. This is problematic for custom track file formats that require an associated index file, which is assumed by the UCSC Genome Browser to be in the same directory as your custom track, with the appropriate extension. Since Dropbox makes a unique path to the index file, even if shared, there is no way for the Genome Browser to know where it is. Both of these issues can be solved by using the Public Folder function of your Dropbox account, rather than the Share Links function.

The good news is that Dropbox does still make the Public Folder function available for older accounts and for newer accounts, you can activate a Public Folder using this “secret” link. By placing your custom tracks in your Public Folder, you now have a stable base URL to provide files the UCSC Genome Browser that does not require editing the URL, does not require explicitly sharing files, and can be anticipated by you and the browser. Following on from the example above, the link to the “test.bed” file inside a “custom_tracks” directory in my Dropbox Public Folder would be “https://dl.dropboxusercontent.com/u/dropboxuserid/custom_tracks/test.bed” (your dropboxuserid will be long integer). Thus if you are using Dropbox to host many custom tracks or files that require an index file, the Public Folder option is the way to go.

There are a couple caveats to using Dropbox to host custom tracks. The first is that you are limited to the size of Dropbox allocation, which you can increase by paying for it or inviting new users to use Dropbox. UPDATE: According to Dave Tang (see comment below),  if you use Dropbox to host large BigWig files you may get blocked by DropBox. The second is that any link you share or anything in your Public Folder is public. So any stable links to your files from webpages may ultimately be indexed by Google. Since there is no option to password protect shared files in the Public Folder of your Dropbox account, users looking for free options to sync large, password-protected custom track files with indexes to remote web-accessible servers should look into BitBucket, which has no filesize limit (unlike GitHub).

Credits: Danny Miller (Stowers Institute) had the original inspiration for this solution, and Michael Nelson (U. of Manchester) helped identify pros/cons of the various Dropbox hosting options.


10 Comments

  1. I had the same idea of hosting my files from my Dropbox Public folder but for bigWig files. I had a track hub set up on my personal server but due to space limitations I couldn’t host the files there. A nice feature of the trackDb.txt file is that I can use the URL of a bigWig file, i.e. I don’t need to host files on the server but else where, in this case my Public Dropbox folder.

    However, after making a few queries on the UCSC Genome Browser, I got an email notification from Dropbox that my Public folder has been suspended for generating too much traffic. It was then I purchased a pro account, which supposedly allows a 200 GB per day bandwidth limit. So much for that. I can’t remember how many queries I made on the browser, but I cannot fathom how I could use up this limit in < 15 minutes.

    I wrote to Dropbox asking them how I could use so much bandwidth by making at most 30 or so queries on the genome browser and this is their reply:

    Subject: Your Dropbox public links have been suspended

    Hi Dave,

    Thank you for contacting us.

    Unfortunately no information about Public link downloads are stored by the Dropbox servers and so there are no statistics or metrics that are available to view. Sorry for any inconvenience this causes.

    If there is anything else I can help you with, please let me know.

    Regards,
    Sonya

    My guess is that if too many queries are made to the Dropbox Public folder, they are very sensitive about this. I wrote to the UCSC Genome Browser team and they replied that they don't know the traffic generated from the browser to a server.

    My solution? My workplace was nice enough to provide me with my own publicly accessible folder that's hosted on an Apache2 web server. So I'm using that instead. I don't have access to the log, so I can't check the usage.

    Hope that helps,

    Dave

  2. caseybergman

    Hi Dave –

    Thanks for the heads up on this problem and sharing your experience. As usual, you are (more than) one step ahead of things. We’ve only been using this for small numbers of BED, BAM and VCF files for Drosophila but had no such experience like yours (yet!). I wonder if this might be more of an issue for people working on the human genome to watch out for.

    Regardless, it is really useful to have this info written down somewhere so that people don’t follow the same potentially fruitless path. Maybe this “solution” will be ultimately and anti-solution, but still save people time in the end. It should still be useful for the odd one-off file exchange and testing small files. I’ll update if we end up getting blocked. Thanks again for taking the time to comment.

    Best,
    Casey

  3. Max

    If the file is a bam/bigwig/bigbed file, UCSC will read bits and pieces all across the file from dropbox, as you scroll over the chromosome. That might lead them to over-estimate the bandwidth actually used. I’m not sure what could be done against this, without some help from a dropbox engineer.

    For bed files as custom tracks, UCSC should read the file only once so that shouldn’t be a problem.

  4. caseybergman

    This is a good point. I suspect there will be less chance of a problem for static files that are just served once to the browser. But for bigdata files that are accessed multiply, this might lead to a bandwidth issue. I wonder if bigWig/bigBed might be particularly problematic since they doesn’t have a separate index file like VCF or BAM. That is, it might be a combination of mutuple accesses and filetype that might be leading to the bandwith problem.

  5. Curtis Hendrickson

    Thanks for providing a good summary of this approach and it’s issues (we use both this approach and our own public apache server for larger files).

    I would love to see write-up of how to implement the password protected custom track using bitbucket.org, which you mention as an alternative. I don’t see any documentation on UCSC on implementing custom tracks backed by data on authentication enabled servers, or how those credentials would get entered into the browser.

    Thanks!

    Regards,
    Curtis

  6. Morri Feldman

    I also got blocked by dropbox when I tried to host a few bigwig files.

    I’m currently using S3 from Amazon Web Services to host my custom tracks and I’m pretty happy with it. The aws free trial gives you 5GB of S3. You can create a public bucket using these instructions.
    http://havecamerawilltravel.com/tidbits/how-to-allow-public-access-to-an-amazon-s3-bucket/

    Once you have the bucket setup you can put items in it using a command line tools like s3cmd, or even mount it as a local drive using various third party solutions.

    –morri

  7. Andrew

    I also tried hosting some bigwigs on dropbox for use on the genome browser, and hit the 20Gb limit in a few hours of a few of us searching the bigwigs. I’m now a pro user, but don’t know if this is an especially scalable solution for me – does anyone who host their own files have any estimates for how much traffic on UCSC gets you to these 20Gb or 200Gb limits with bigwig files?

  8. Andrew: this is a very difficult question, because depending on what zoom level you’re operating on and how the summary levels are configured in your bigWig, the data transfer can vary widely. The easiest way would be to host a bigWig file on your server and measure it.

    This whole discussion suggests that dropbox is not the best place to host bigWig files. Something like S3 would be better, but S3 has a pay-per-GB model, which might be difficult to get through many university accounting departments… do you agree? It seems that what we need is a hosting provider that has a) a simple interface b) a flatrate model c) supports parallel requests and d) has a very generous transfer allowance. Or some way to extend it.

    We have something in development that solves most of these problems and if you drop me an email Andrew, I can send you a link to it.

  9. For anyone who comes across this thread, what I promised as a “solution” is actually somewhat overkill. But you can use it to open a track hub or a BAM from your local disk, without a webserver. Here are more details: https://genome.ucsc.edu/goldenPath/help/gbib.html

    NB: If you just want to load a bam or bigwig file, there is no need to type the whole “track name=xxx bigDataUrl=XXX” line anymore. Just paste the URL to your file and the rest will be auto-generated.

  10. I’ve spent quite a while on the Dropbox home pages and the UCSC source code. Here are some findings:

    1) DropBox is not providing public folders anymore. You have to pay money now for this feature.

    2) Even with public folders, Dropbox does not allow UCSC to do parallel loading (as observed above). I can switch off parallel loading in the genome browser, but this will make any bigger track hub very slow, upwards of a few tracks, it becomes noticable.

    3) I contacted Dropbox and asked how this is supposed to work without public folders. It is possible to get links to files using the DropBox Chooser, but these links are only valid for 4 hours.

    So in total, it would be possible to support loading a single bigBed or bigWig file using the public folder in a business account or support loading a single file for 4 hours. Depending on the traffic, dropbox might block UCSC at any moment. So if this solution becomes popular, we might get blocked.

    The drawbacks of files stored in Dropbox folders seem to be too high so we’ve currently stopped any further development in this direction.

    Microsoft One Drive seems to be almost identical in the drawbacks, Amazon has no public folder that I can see.

    If anyone knows a storage provider that supports parallel downloading (at least to some extent) and links to file names that are not garbled or temporary, I’m very interested.



Add Your Comment