Archive for the ‘galaxy’ Category

Tutorial: Using the UCSC Genome Browser and Galaxy to study regulatory sequence evolution in Drosophila.

One of the most enjoyable parts of teaching genomics and bioinformatics introducing people to the UCSC Genome Browser and Galaxy systems.  Both systems are powerful, intuitive, reliable and user-friendly services, and lend themselves easily to student practicals, as the good folks at Open Helix have amply demonstrated. Over the last few years, I’ve developed a fairly reliable advanced undergraduate/early graduate teaching practical that uses the UCSC Genome Browser and Galaxy to  study regulatory sequence evolution in Drosophila, based on data I curated a few years back into the Drosophila DNAse I footprint database.  As part of opening up the resources from my group, I thought I would post this tutorial with the hope that someone else can use it for teaching purposes or their own edification. The UCSC half can be done pretty reliably in a 50 minute session, and the Galaxy part is much more variable – some people take 20 min or less and others up to an hour.  Feedback and comments are most welcome.

Enjoy!

Aims

  • Become familiar with the UCSC Genome Browser.
  • Become familiar with the UCSC Table Browser.
  • Become familiar with Galaxy Analysis Tools and Histories.
  • Study the conservation of transcription factor binding sites in Drosophila.

Introduction

This lab is an exercise in performing a comparative genomic analysis of transcription factor binding sites (TFBSs) in Drosophila. You will use the UCSC Genome Browser, Table Browser and Galaxy to identify TFBSs that are conserved across multiple Drosophila species. Highly conserved TFBSs are likely to play important roles in Drosophila development. TFBSs that are not conserved may represent regulatory sequences that have contributed to developmental evolution across Drosophila species, or those that simply have been lost as part of the process of TFBSs turnover in a conserved regulatory element. The skill you learn in this lab will be generally applicable to many questions for a wide range of species with genome sequences in the UCSC Genome Database.

Finding the even-skipped region at the UCSC Genome Browser

1) Go to the UCSC Genome Bioinformatics site at http://genome.ucsc.edu.

2) Select the “Genome Browser” option from the light blue panel on the left hand side of the page.

3) From the Gateway page, select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus. Take a minute to read the text at the bottom of the page “About the D. melanogaster Apr. 2004 (dm2) assembly”. [Note: it is essential that you make sure you have selected the correct genome assembly at this step.]

4) Enter the gene “eve” (an abbreviated name for the gene even-skipped) into the “position or search term” box and click the “submit” button.

5) From the search results page, select the top link under the “FlyBase Protein-Coding Genes” header taking you to “eve at chr2R:5491054-5492592”. This will take you to a graphical view of the even-skipped gene with boxes indication exons and lines indication introns. (Note: the thick part of the boxes denote the parts of exons are translated into protein, the thinner parts are UTRs)

Customising the UCSC Genome Browser to display transcription factor binding sites in the even-skipped region

1) The Genome Browser page displays information about a selected genomic region and contains several components. They are, from top to bottom: a list of links to complementary tools and resources on the dark blue background; buttons for navigating left and right and zooming in and out on the chromosome; search boxes for jumping to new regions of the genome; a pictoral representation of the region shown on the chromosome; the main display window, comprising several tracks showing different types of information about the genomic region on display; a set of buttons to modify the main display; and several rows of pull-down menus, each controlling the display status of an individual track in the main display window.

2) Click the “hide all” button in the second panel of display controls to remove all tracks from the browser. Then select the “full” option from the drop-down FlyBase Genes menu under “Genes and Gene Prediction Tracks” and click refresh.

3) In the main display window, the top feature in light blue is the 2-exon gene eve. Click on one of the blue exons of the eve gene. This will send you to a detailed page about the eve gene. Scroll down to see the data linked to eve, including information imported from external sources like FlyBase, links to other resources like “Orthologous Genes in Other Species” also at the UCSC genome browser, and links to other resources outside the UCSC Database such as “Protein Domain and Structure Information” at the InterPro database. [Note: you can minimise subsections of this page by clicking the “-” button next to each major heading.]

3) Click the white “Genome Browser” link on the top blue banner to return to the main browser page. Scroll down the page and read the names of the various tracks available for this genome sequence. Sets of tracks are organised into logical groups (e.g. “Genes and Gene Prediction Tracks”). To find out more information about the data in any track, you can click on the blue title above each track’s pulldown menu, which leads to a detail page explaining the track and additional track-specific display controls. [Note: the same detail page can be displayed by clicking on the blue/grey rectangle at the very left of the track display in the main display window.]

4) Click on the blue title for the “FlyReg” track under the “Expression and regulation” heading. Take a minute to read about the FlyReg track and where the data comes from. Set the display mode pull-down menu to “pack” and then click “refresh”. [Note: visit http://www.flyreg.org/ for more information.]

5) Click the “zoom out 10x” button in the top right corner of the page to expand your window to display ten times the current sequence length. Each annotated regulatory element you are seeing in the FlyReg track is an experimentally-validated transcription factor binding site (TFBS) that regulates eve.

6) Click directly on one of the brown TFBSs features in the FlyReg track in the 5′ intergenic region upstream of the eve gene. As with all data in different tracks, clicking on a feature will send you to a detail page about the feature, in this case the TFBS. In the detail page, click on the “Position” link and this will return you to the main Genome Browser window window showing just the coordinates of the TFBS you just selected.

Investigating the conservation of individual TFBSs using the UCSC Genome Browser

1) Select the “full” option from the Conservation drop-down menu under “Comparative genomics” and click refresh.

2) The browser should now be displaying exactly one TFBS and conservation of this TFBS using the “12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores” track. Is this TFBS conserved across the Drosophila species? If so, which ones? Is this TFBS conserved in Anopheles gambiae, Tribolium castaneum or Apis mellifera?

3) Click the “zoom out 10x” button twice and select a different TFBS from the FlyReg Track, click on the “Position” link in the detail page and evaluate if this TFBS is conserved and in which species. Now repeat for every TFBS in the genome (only joking!).

4) To see the general correspondence between TFBS and highly conserved sequences, set the pull-down menu to “pack” for the “Most Conserved” Track. Zoom out 10x so you are displaying a few hundred bp. How well do the most highly conserved sequences correspond the TFBSs? Are TFBSs typically longer or shorter than a highly conserved region? Zoom out 10x so you are displaying a ~1 Kbp. Are there more TFBSs than conserved sequences or vice versa? Why might this be the case?

Investigating the conservation of all known TFBSs using the UCSC Table Browser

1) Click the “Tools” menu on the dark blue background at the top of the Genome Browser window and select “Table Browser”. This will send you to an alternative interface to access data in the UCSC Genome Database called the “Table Browser.” The pull-down menus you see here correspond to the same tracks you saw in the Genome Browser.

2) Select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus.

3) Select “Expression and Regulation” and “FlyReg” from the “group” and “track” pull-down menus, respectively.

4) Click the radio button (the circle with a dot indicating which option is selected) next to “genome” as the “region” you will analyse. This will select the whole genome for inclusion in your analysis.

5) Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with >1,300 hyperlinks that send you to all the annotated TFBS in Drosophila, each of which corresponds to one row in the FlyReg data track. [Note: this is a general method to export data from any track the whole genome or a specific region.]

6) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. We are now going to use the Table Browser to ask “how many TFBS in Drosophila are found in highly conserved sequences?” We will do this by using the Table Browser to overlap all of the TFBS with all of the Most Conserved segments of the genome.

7) Click the “create” button next to the “intersection” option. This will send you to a page where you can set conditions for the overlap analysis of the FlyReg TFBS with the Most Conserved regions.

8 ) Select “Comparative Genomics” and “Most Conserved” from the “group” and “track” pull-down menus. Click the “All FlyReg records that have at least X% overlap with Most Conserved”. Set the X% value to “100” and click “submit” to return to the main Table Browser page.

9) Notice that the Table Browser now shows the “intersection with phastConsElements15way” option is selected. Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with hyperlinks that send you to all the annotated TFBS in Drosophila that are 100% contained in the Most Conserved regions of the genome. Click on a few links to convince yourself that this analysis has generated the correct results. At this point you have a great result – all fully conserved TFBS in Drosophila – but it is difficult to quantitatively summarize these data in the Table Browser, so let’s move on to Galaxy where analysis like this is a lot easier.

Quantifying TFBS conservation using Galaxy

1) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. Select “BED – browser extensible data” from the “output format” pull-down menu and check the “Galaxy” check box. Now click the “get output” button at the bottom of the page. This will send you to a detail page. Leave all the default options set here and click “Send query to Galaxy”.

2) This will launch a new Galaxy session as an anonymous user. [For more information on Galaxy, visit http://usegalaxy.org] The results of this Table Browser query will be executed as a Galaxy “analysis” which will appear in the right hand “History” pane of the Galaxy browser window as “1: UCSC Main on D. melanogaster: flyreg2 (genome)”. While the Table Browser query is running, this analysis will be grey in the history pane, but will turn green when it is completed.

3) When the analysis has completed, you will have created a new dataset numbered “1” in your Galaxy history. Click on the link entitled “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the history to reveal a dropdown which contains a snapshot of this dataset. [This dataset should contain 533 regions. If it doesn’t stop here and get help before moving on.] Now click on the eye icon to reveal the contents of this data set in the middle pane of the Galaxy window. Use the scroll bar on the right hand side of the middle pane to browse the contents of this data set.

4) Click on the pencil icon to “Edit attributes” of this data set. In the middle pane replace “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the “Name” text box with something shorter and more descriptive like “conserved TFBS”. Click the “Save” button at the bottom of the middle pane.

5) Now let’s get the complete set of FlyReg TFBS by querying the UCSC Table Browser from inside Galaxy. Click “Get Data” under “Tools” in the left hand pane of the Galaxy page. This will explode a list of options to get data into Galaxy. Click “UCSC Main” which will bring up the Table Browser inside the middle pane of the Galaxy page. Click the “clear” button next to “intersection with phastConsElements15way:”. Make sure the “Galaxy” check box is selected and click “get output” and then click “Send query to Galaxy” on the next page. This will create a new dataset “2: UCSC Main on D. melanogaster: flyreg2 (genome)” which you can rename “all TFBS” as above using the pencil icon. [Note: this dataset should have 1,362 regions in it; if your does not, please stop and ask for help.]

6) You can now perform many different analyses on these datasets using the many “Tools” available in the left hand pane of the Galaxy window. Let’s start by summarizing how many TFBS are present in the “1: conserved TFBS” and “2: all TFBS” datasets. To do this, click the “Statistics” link on the left hand side, which will open up a set of other analysis tools including the “Count” tool. Click on the “Count” tool, and in the middle pane select “1: conserved TFBS” in the “from dataset” menu and click on “c4” to activate the counting to occur on column 4 containing the name of the TFBS in the “1: conserved TFBS” dataset. Repeat this analysis for the “2: all TFBS” dataset. This should lead to two more datasets of 70 and 88 lines, respectively, which you should again rename to something more meaningful than the default values, such as “3: conserved TFBS counts” and 4: all TFBS counts”.

7) Now let’s use the “Join, Subtract and Group->Join two Datasets” tool to join the results of the two previous analyses into one merged dataset. Click “Join, Subtract and Group->Join two Datasets”, select “4: all TFBS counts” in the “Join” drop-down menu, “c2” for the “using column” drop-down menu, “3: conserved TFBS counts” in the “with” drop-down menu and “c2” for the “and column” drop-down menu. Set the “Keep lines of first input that do not join with second input:”, “Keep lines of first input that are incomplete:”, and “Fill empty columns:” drop-down menus to “yes”. Setting the ” Fill empty columns” menu to yes will pull up a few more menus, which should be set as follows: “Only fill unjoined rows:” set to “Yes”; “Fill Columns by:” set to “single Fill Value”, and “Fill value” to “0”. Now click “Execute”. What you have just achieved is one of the trickier basic operations on bioinformatics, and is the underlying process in most relational database queries.  Pat yourself on the back!

8 ) Now let’s try to do some science and ask the question: “what is the proportion of conserved TFBS for each transcription factor?” This will give us some insight into whether TFBS turnover is the same for all TFs or might be higher or lower for some TFs. To do this, use the “Text manipulation->compute” Tool and set the “Add expression” box to “1-((c1-c3)/c1)” and “as a new column to:” to “5: Join two Datasets on data 3 and data 4” and click “Execute”. This will add a new column to the joined dataset with the proportion of conserved TFBS for each transcription factor.

For Further Exploration…

1) Format the output of your last analysis in a more meaningful manner using the “Filter and Sort->Sort” tool.

2) Plot the distribution of the proportion of conserved TFBS per TF using the “Graph/Display Data->Histogram” tool.

3) Go back to the original TFBS datasets and derive new datasets to investigate if different chromosomes have different rates of TFBS evolution?

4) Develop your own question about TFBS evolution, create a custom analysis pipeline in Galaxy and wow us with your findings.

Running Taverna workflows in Galaxy on OSX

Posted 28 Jul 2011 — by caseybergman
Category galaxy, OSX hacks, taverna

Recently I’ve been bitten by the Galaxy bug, primarily because I needed a mechanism this year to supervise final year undergraduate projects of students without a strong background in bioinformatics. This was a great success, since students seem to pick up the interface really easily and I was able to track and comment on their progress explicitly via shared histories and workflows.

Because of this experience, I’ve become much more interested in using workflow systems  to run and manage my bioinformatics pipelines in my research projects rather than relying on READMEs and UNIX shell scripts. Recent news that Kostas Karasavvas from NBIC has developed eGalaxy, a mechanisms to run Taverna 2 workflows using Galaxy is in my view a game-changer for the more widespread use of workflows by practicing bioinformaticians like myself, since it will permit mash-ups between the two main workflow systems and deployment of the large pre-established library of Taverna workflows in myExperiment to be used in a local Galaxy installation.

The easiest way of getting a Taverna workflow running in Galaxy is to search myExperiment for Taverna 2 workflows, and click the “Download Workflow as a Galaxy tool” button in the “Download” section of the page. This will send you to a “Galaxy tool download” page with instruction on how to get the Taverna workflow installed as a tool in Galaxy. The instructions are a bit spare at the moment and require familiarity with installing Galaxy locally and adding tools to a local Galaxy installation. They also only have have installation notes for Debian-based systems, but with the help of Rob Haines from the Taverna team, I’ve been able to get a stable protocol working for OSX as well.

To give a bit more context, Taverna workflows are run in Galaxy is as Ruby scripts that are added to your Galaxy tools directory like any other custom tool.  Executing the Ruby script tool launches a connection to a remote Taverna 2 server, where the workflow is run. Results are then returned back to the Ruby script and thence to Galaxy. Like all Galaxy tools, installing a Taverna tools requires the tool itself (a script or other executable program) and a description of the tools’ inputs/outputs in XML format to be placed in the “tools” directory, plus a notification to Galaxy that the tool exists in the tool_conf.xml file in Galaxy main directory.

The Ruby script generated by myExperiment requires a few ruby packages (aka “gems”) that are installed by the RubyGems. Both Ruby and RubyGems are installed by default on OSX (in /usr/bin) so your kit is nearly complete. The following steps should allow you to run a test Taverna workflow to make sure your configuration is working properly on a OSX 10.6 machine. To help consolidate install notes for the entire process in one place, I’m copying the key steps for a local Galaxy installation here as well.

1) Install Mercurial version control system for OSX from here, and add make sure /usr/local/bin/ is in your path.

2) Checkout the Galaxy codebase using Mercurial in your home directory ($HOME):

$ hg clone https://bitbucket.org/galaxy/galaxy-dist/

3) Create a Taverna tools directory in your Galaxy distribution:

$ mkdir $HOME/galaxy-dist/tools/tavernaTools

4) Install the RubyGems needed for the Taverna tool to run. The critical gems to install are t2-server (which is needed to connect to the taverna server that runs the workflow) and rubyzip (which is needed for compression of Galaxy results). Installation of t2-server will automatically install the libxml-ruby and hirb gems it is dependent on. libxml-ruby calls on the the libxml2 C XML parser, which is also installed by default on OSX in /usr/include/libxml2/

$ sudo gem install t2-server
$ sudo gem install rubyzip

5) Select a Taverna 2 workflow from myExperiment and download Ruby script and XML file. For testing, use a workflow that does not require any input files, e.g. http://www.myexperiment.org/workflows/823/versions/1/galaxy_tool

6) Paste “http://test.mybiobank.org/taverna-server” into the “Taverna server URL:” textbox.

7) Click the “Download Galaxy tool” button, e.g. to your Downloads folder.

8) Unzip the Taverna 2 Galaxy tool and move the Ruby script and XML file into your Taverna tools directory, e.g.

$ unzip $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.zip
$ mv $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.xml $HOME/galaxy-dist/tools/tavernaTools
$ mv $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.rb $HOME/galaxy-dist/tools/tavernaTools

9) Edit your tool_conf.xml file to include a new section for Taverna tools, e.g.

 <section name="Taverna Tools" id="tavernaTools">
    <tool file="tavernaTools/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.xml" />
 </section>

 
10) Start the Galaxy server by running the run.sh script:

$ sh $HOME/galaxy-dist/run.sh

11) Open http://127.0.0.1:8080 in your web browser and you should see a “Taverna Tools” tool heading above the “Get Data” Tools heading, which when clicked on reveals the newly installed “Fetch PDB flatfile from RCSB server” tool. You can now run this Taverna tool like any other analysis tool in Galaxy. For this particular tool, this involves inputting a PDB id and clicking “execute”. Successful completion of the job should return a PDB file in the main Galaxy window, e.g.

In early stages testing this protocol, I selected a different Taverna workflow (916) that did not run out of the box and gave less-than-helpful error messages in Galaxy. Trouble shooting with Rob Haines pinpointed this problem to the http://test.mybiobank.org Taverna server not being able to execute aspects of this workflow.  When tested on an alternate Taverna server, the workflow did run and completed with expected results. So if you experience a failed attempt at running a Taverna workflow on Galaxy it may not have anything to do with your kit or the workflow in question.

From this initial experience, looking forward (from mid-2011) I’d like to see the eGalaxy system include a mechanism to generate tests for each Taverna tool automatically, either at the time of tool download from myExperiment or as a part of the Galaxy testing system. I’d also like to see a industrial-scale Taverna sever hosted somewhere (preferably by the Galaxy or Taverna teams) so all Taverna tools can be used reliably out of the box on at least one tested server. In any event, I’m now convinced that the eGalaxy project is what it says on the tin, and can only improve with more folks trying it out and contributing feedback.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8. Credits to Rob Haines for help trouble shooting andgiving me a detailed walk through on the mechanics of the Taverna-Galaxy integration as well as to Kostas Karasavvas of NBIC for having the inspiration to intiate the eGalaxy project.