Archive for the ‘regulatory sequences’ Category

Tutorial: Using the UCSC Genome Browser and Galaxy to study regulatory sequence evolution in Drosophila.

One of the most enjoyable parts of teaching genomics and bioinformatics introducing people to the UCSC Genome Browser and Galaxy systems.  Both systems are powerful, intuitive, reliable and user-friendly services, and lend themselves easily to student practicals, as the good folks at Open Helix have amply demonstrated. Over the last few years, I’ve developed a fairly reliable advanced undergraduate/early graduate teaching practical that uses the UCSC Genome Browser and Galaxy to  study regulatory sequence evolution in Drosophila, based on data I curated a few years back into the Drosophila DNAse I footprint database.  As part of opening up the resources from my group, I thought I would post this tutorial with the hope that someone else can use it for teaching purposes or their own edification. The UCSC half can be done pretty reliably in a 50 minute session, and the Galaxy part is much more variable – some people take 20 min or less and others up to an hour.  Feedback and comments are most welcome.



  • Become familiar with the UCSC Genome Browser.
  • Become familiar with the UCSC Table Browser.
  • Become familiar with Galaxy Analysis Tools and Histories.
  • Study the conservation of transcription factor binding sites in Drosophila.


This lab is an exercise in performing a comparative genomic analysis of transcription factor binding sites (TFBSs) in Drosophila. You will use the UCSC Genome Browser, Table Browser and Galaxy to identify TFBSs that are conserved across multiple Drosophila species. Highly conserved TFBSs are likely to play important roles in Drosophila development. TFBSs that are not conserved may represent regulatory sequences that have contributed to developmental evolution across Drosophila species, or those that simply have been lost as part of the process of TFBSs turnover in a conserved regulatory element. The skill you learn in this lab will be generally applicable to many questions for a wide range of species with genome sequences in the UCSC Genome Database.

Finding the even-skipped region at the UCSC Genome Browser

1) Go to the UCSC Genome Bioinformatics site at

2) Select the “Genome Browser” option from the light blue panel on the left hand side of the page.

3) From the Gateway page, select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus. Take a minute to read the text at the bottom of the page “About the D. melanogaster Apr. 2004 (dm2) assembly”. [Note: it is essential that you make sure you have selected the correct genome assembly at this step.]

4) Enter the gene “eve” (an abbreviated name for the gene even-skipped) into the “position or search term” box and click the “submit” button.

5) From the search results page, select the top link under the “FlyBase Protein-Coding Genes” header taking you to “eve at chr2R:5491054-5492592”. This will take you to a graphical view of the even-skipped gene with boxes indication exons and lines indication introns. (Note: the thick part of the boxes denote the parts of exons are translated into protein, the thinner parts are UTRs)

Customising the UCSC Genome Browser to display transcription factor binding sites in the even-skipped region

1) The Genome Browser page displays information about a selected genomic region and contains several components. They are, from top to bottom: a list of links to complementary tools and resources on the dark blue background; buttons for navigating left and right and zooming in and out on the chromosome; search boxes for jumping to new regions of the genome; a pictoral representation of the region shown on the chromosome; the main display window, comprising several tracks showing different types of information about the genomic region on display; a set of buttons to modify the main display; and several rows of pull-down menus, each controlling the display status of an individual track in the main display window.

2) Click the “hide all” button in the second panel of display controls to remove all tracks from the browser. Then select the “full” option from the drop-down FlyBase Genes menu under “Genes and Gene Prediction Tracks” and click refresh.

3) In the main display window, the top feature in light blue is the 2-exon gene eve. Click on one of the blue exons of the eve gene. This will send you to a detailed page about the eve gene. Scroll down to see the data linked to eve, including information imported from external sources like FlyBase, links to other resources like “Orthologous Genes in Other Species” also at the UCSC genome browser, and links to other resources outside the UCSC Database such as “Protein Domain and Structure Information” at the InterPro database. [Note: you can minimise subsections of this page by clicking the “-” button next to each major heading.]

3) Click the white “Genome Browser” link on the top blue banner to return to the main browser page. Scroll down the page and read the names of the various tracks available for this genome sequence. Sets of tracks are organised into logical groups (e.g. “Genes and Gene Prediction Tracks”). To find out more information about the data in any track, you can click on the blue title above each track’s pulldown menu, which leads to a detail page explaining the track and additional track-specific display controls. [Note: the same detail page can be displayed by clicking on the blue/grey rectangle at the very left of the track display in the main display window.]

4) Click on the blue title for the “FlyReg” track under the “Expression and regulation” heading. Take a minute to read about the FlyReg track and where the data comes from. Set the display mode pull-down menu to “pack” and then click “refresh”. [Note: visit for more information.]

5) Click the “zoom out 10x” button in the top right corner of the page to expand your window to display ten times the current sequence length. Each annotated regulatory element you are seeing in the FlyReg track is an experimentally-validated transcription factor binding site (TFBS) that regulates eve.

6) Click directly on one of the brown TFBSs features in the FlyReg track in the 5′ intergenic region upstream of the eve gene. As with all data in different tracks, clicking on a feature will send you to a detail page about the feature, in this case the TFBS. In the detail page, click on the “Position” link and this will return you to the main Genome Browser window window showing just the coordinates of the TFBS you just selected.

Investigating the conservation of individual TFBSs using the UCSC Genome Browser

1) Select the “full” option from the Conservation drop-down menu under “Comparative genomics” and click refresh.

2) The browser should now be displaying exactly one TFBS and conservation of this TFBS using the “12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores” track. Is this TFBS conserved across the Drosophila species? If so, which ones? Is this TFBS conserved in Anopheles gambiae, Tribolium castaneum or Apis mellifera?

3) Click the “zoom out 10x” button twice and select a different TFBS from the FlyReg Track, click on the “Position” link in the detail page and evaluate if this TFBS is conserved and in which species. Now repeat for every TFBS in the genome (only joking!).

4) To see the general correspondence between TFBS and highly conserved sequences, set the pull-down menu to “pack” for the “Most Conserved” Track. Zoom out 10x so you are displaying a few hundred bp. How well do the most highly conserved sequences correspond the TFBSs? Are TFBSs typically longer or shorter than a highly conserved region? Zoom out 10x so you are displaying a ~1 Kbp. Are there more TFBSs than conserved sequences or vice versa? Why might this be the case?

Investigating the conservation of all known TFBSs using the UCSC Table Browser

1) Click the “Tools” menu on the dark blue background at the top of the Genome Browser window and select “Table Browser”. This will send you to an alternative interface to access data in the UCSC Genome Database called the “Table Browser.” The pull-down menus you see here correspond to the same tracks you saw in the Genome Browser.

2) Select “Insect”, “D. melanogaster” and “Apr 2004” from the “clade”, “genome” and “assembly” pull-down menus.

3) Select “Expression and Regulation” and “FlyReg” from the “group” and “track” pull-down menus, respectively.

4) Click the radio button (the circle with a dot indicating which option is selected) next to “genome” as the “region” you will analyse. This will select the whole genome for inclusion in your analysis.

5) Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with >1,300 hyperlinks that send you to all the annotated TFBS in Drosophila, each of which corresponds to one row in the FlyReg data track. [Note: this is a general method to export data from any track the whole genome or a specific region.]

6) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. We are now going to use the Table Browser to ask “how many TFBS in Drosophila are found in highly conserved sequences?” We will do this by using the Table Browser to overlap all of the TFBS with all of the Most Conserved segments of the genome.

7) Click the “create” button next to the “intersection” option. This will send you to a page where you can set conditions for the overlap analysis of the FlyReg TFBS with the Most Conserved regions.

8 ) Select “Comparative Genomics” and “Most Conserved” from the “group” and “track” pull-down menus. Click the “All FlyReg records that have at least X% overlap with Most Conserved”. Set the X% value to “100” and click “submit” to return to the main Table Browser page.

9) Notice that the Table Browser now shows the “intersection with phastConsElements15way” option is selected. Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with hyperlinks that send you to all the annotated TFBS in Drosophila that are 100% contained in the Most Conserved regions of the genome. Click on a few links to convince yourself that this analysis has generated the correct results. At this point you have a great result – all fully conserved TFBS in Drosophila – but it is difficult to quantitatively summarize these data in the Table Browser, so let’s move on to Galaxy where analysis like this is a lot easier.

Quantifying TFBS conservation using Galaxy

1) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. Select “BED – browser extensible data” from the “output format” pull-down menu and check the “Galaxy” check box. Now click the “get output” button at the bottom of the page. This will send you to a detail page. Leave all the default options set here and click “Send query to Galaxy”.

2) This will launch a new Galaxy session as an anonymous user. [For more information on Galaxy, visit] The results of this Table Browser query will be executed as a Galaxy “analysis” which will appear in the right hand “History” pane of the Galaxy browser window as “1: UCSC Main on D. melanogaster: flyreg2 (genome)”. While the Table Browser query is running, this analysis will be grey in the history pane, but will turn green when it is completed.

3) When the analysis has completed, you will have created a new dataset numbered “1” in your Galaxy history. Click on the link entitled “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the history to reveal a dropdown which contains a snapshot of this dataset. [This dataset should contain 533 regions. If it doesn’t stop here and get help before moving on.] Now click on the eye icon to reveal the contents of this data set in the middle pane of the Galaxy window. Use the scroll bar on the right hand side of the middle pane to browse the contents of this data set.

4) Click on the pencil icon to “Edit attributes” of this data set. In the middle pane replace “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the “Name” text box with something shorter and more descriptive like “conserved TFBS”. Click the “Save” button at the bottom of the middle pane.

5) Now let’s get the complete set of FlyReg TFBS by querying the UCSC Table Browser from inside Galaxy. Click “Get Data” under “Tools” in the left hand pane of the Galaxy page. This will explode a list of options to get data into Galaxy. Click “UCSC Main” which will bring up the Table Browser inside the middle pane of the Galaxy page. Click the “clear” button next to “intersection with phastConsElements15way:”. Make sure the “Galaxy” check box is selected and click “get output” and then click “Send query to Galaxy” on the next page. This will create a new dataset “2: UCSC Main on D. melanogaster: flyreg2 (genome)” which you can rename “all TFBS” as above using the pencil icon. [Note: this dataset should have 1,362 regions in it; if your does not, please stop and ask for help.]

6) You can now perform many different analyses on these datasets using the many “Tools” available in the left hand pane of the Galaxy window. Let’s start by summarizing how many TFBS are present in the “1: conserved TFBS” and “2: all TFBS” datasets. To do this, click the “Statistics” link on the left hand side, which will open up a set of other analysis tools including the “Count” tool. Click on the “Count” tool, and in the middle pane select “1: conserved TFBS” in the “from dataset” menu and click on “c4” to activate the counting to occur on column 4 containing the name of the TFBS in the “1: conserved TFBS” dataset. Repeat this analysis for the “2: all TFBS” dataset. This should lead to two more datasets of 70 and 88 lines, respectively, which you should again rename to something more meaningful than the default values, such as “3: conserved TFBS counts” and 4: all TFBS counts”.

7) Now let’s use the “Join, Subtract and Group->Join two Datasets” tool to join the results of the two previous analyses into one merged dataset. Click “Join, Subtract and Group->Join two Datasets”, select “4: all TFBS counts” in the “Join” drop-down menu, “c2” for the “using column” drop-down menu, “3: conserved TFBS counts” in the “with” drop-down menu and “c2” for the “and column” drop-down menu. Set the “Keep lines of first input that do not join with second input:”, “Keep lines of first input that are incomplete:”, and “Fill empty columns:” drop-down menus to “yes”. Setting the ” Fill empty columns” menu to yes will pull up a few more menus, which should be set as follows: “Only fill unjoined rows:” set to “Yes”; “Fill Columns by:” set to “single Fill Value”, and “Fill value” to “0”. Now click “Execute”. What you have just achieved is one of the trickier basic operations on bioinformatics, and is the underlying process in most relational database queries.  Pat yourself on the back!

8 ) Now let’s try to do some science and ask the question: “what is the proportion of conserved TFBS for each transcription factor?” This will give us some insight into whether TFBS turnover is the same for all TFs or might be higher or lower for some TFs. To do this, use the “Text manipulation->compute” Tool and set the “Add expression” box to “1-((c1-c3)/c1)” and “as a new column to:” to “5: Join two Datasets on data 3 and data 4” and click “Execute”. This will add a new column to the joined dataset with the proportion of conserved TFBS for each transcription factor.

For Further Exploration…

1) Format the output of your last analysis in a more meaningful manner using the “Filter and Sort->Sort” tool.

2) Plot the distribution of the proportion of conserved TFBS per TF using the “Graph/Display Data->Histogram” tool.

3) Go back to the original TFBS datasets and derive new datasets to investigate if different chromosomes have different rates of TFBS evolution?

4) Develop your own question about TFBS evolution, create a custom analysis pipeline in Galaxy and wow us with your findings.

Compiling Nikolaus Rajewsky’s Ahab on OSX

Posted 11 Mar 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, regulatory sequences

One of the most challenging areas of genome bioinformatics is the de novo prediction of regulatory sequences that control gene expression. This is especially true for fully functional cis-regulatory elements that can act far from their target gene, such as enhancers. The most frequently used approach to predict enhancers is to find regions of the genome with a high density of matches to the recognition sequences of known transcription factors. There are a large number of papers describing bioinformatics methods for enhancer prediction using this strategy, including Martin Frith‘s pioneering method Cister.

One of the most efficient and easy to use command line programs for detecting clusters of transcription factor binding sites is Nikolaus Rajewsky‘s program Ahab, which you can obtain by contacting Nikolaus directly.

I had trouble getting Ahab to run on OSX, partly because in the distribution Nikolaus sent there was no Makefile provided, but found that it can be compiled with the following modifications. The underlying cause of the problem is a conflict between a system wide definition of fmin and that provided by the Numerical Recipes in C nr.h file (see discussion here). To fix this and compile, our sysadmin Nick Gresham has developed the following solution:

1) change line 182 and line 640 in the file source/nr.h from:

float fmin(float x[]);
float fmin();
float nr_fmin(float x[]);
float nr_fmin();

2) create a Makefile with the following commands and make.

ALL_PROGRAMS = module_fit module_prof

FIT_OBJS = dbrent.o df1dim.o dlinmin.o f1dim.o frprmn.o mnbrak.o Wtmx.o readFas.o
PROF_OBJS = Profile.o readFas.o


.PHONY : all

module_fit : $(FIT_OBJS)

$(LD) $(LDFLAGS) $^ $(LIBS) -o $@

module_prof : $(PROF_OBJS)

$(LD) $(LDFLAGS) $^ $(LIBS) -o $@


rm -rf *.o $(ALL_PROGRAMS)

3) change the $exec and $exec_profile variables in perl/ to ‘module_fit’ & ‘module_prof’, respectively.

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Tabs in the makefile have been faked using the indentation function in the wordpress editor.