Customizing a Local UCSC Genome Browser Installation II: The UCSC Genome Browser Database Structure

No Comments
Posted 29 Mar 2012 in database, drosophila, genome bioinformatics, linux, UCSC genome browser

This is the second of three posts on how to set up a local mirror of the UCSC genome browser hosting custom data. In the previous post, I covered how to perform a basic installation of the UCSC genome In this post, I’ll provide an overview of the mySQL database structure that serves data to the genome browser using the Bergman Lab mirror (http://genome.smith.man.ac.uk/) as an example. In the next post, I focus on how to load custom data into a local mirror.

General databases

hgcentral – this is the central control database of the browser. It holds information relating to each assembly available on the server, organizing them into groups according to their clade and organism. For instance the mammalian clade may contain human and mouse assemblies, and for mouse we may have multiple assemblies (e.g. mm9, mm8 etc.).

The contents of the clade, genomeClade, dbDb and defaultDb tables control the three dropdown menus on the gateway page of the browser. The clade and genomeClade tables have a priority field that controls the order of the dropdown menus (see figure 1).

The defaultDb table specifies which assembly is the default for a particular organism. This is usually the latest assembly e.g. mm9 for mouse. You can alter this if, for whatever reason, you want to use a different assembly as your default:

UPDATE defaultDb SET name = 'mm8' WHERE genome = 'mouse';

Databases for each organism/assembly are listed in the dbDb table within hgcentral (see more below). For example the latest mouse assembly (mm9) has an entry that contains information relating to this build, an example query (not retrieving all the information contained within dbDb) looks like this:

> select name,description,organism,sourceName,taxId from dbDb where name = 'mm9';</pre>
 +------+-------------+----------+---------------+-------+
 | name | description | organism | sourceName | taxId |
 +------+-------------+----------+---------------+-------+
 | mm9 | July 2007 | Mouse | NCBI Build 37 | 10090 |
+------+-------------+----------+---------------+-------+

hgFixed – this database contains a bunch of human related information, predominately microarray stuff. If you are running a human mirror then you may wish to populate this database. If not, then it can be left empty.

Genome databases

We will now look at the structure of a genome database for a specific organism/assembly using mm9 as an example.

The first table to consider is chromInfo, this simple table specifies the name of each chromosome, its size and the location of its sequence file. Mouse has 19 chromosomes (named chr1 – chr19) plus the two sex chromosomes (chrX and chrY), mitochondrial DNA (chrM) and unmapped clone contigs (chrUn). If we list the tables within our minimal mm9 install we will see that most of the tables can clearly be identified as belonging to a particular chromosome by virtue of their names, e.g. all the tables with names starting chr10_* contain data relating to chromosome 10.

Of the remainder, the two tables of most interest to understanding how the browser works, and leading us towards the ability to load custom data, are the grp and trackDb tables. These two tables control the actual genome browser display, as we discussed above for the hgcentral tables that control the gateway web interface. Not surprisingly, the trackDb table holds information about all the data tracks you can see (see figure 2), and grp tells the browser how all these tracks should be organized (e.g. we may want a separate group for different types of experimental data, another for predictions made using some computational technique etc.).

Figure 2 shows a screenshot of the mm9 browser we developed in the Bergman Lab as part of the CISSTEM project.

Here, we can see that the tracks are grouped into particular types (Mapping & Sequence Tracks, Genes and Gene Prediction tracks etc.) – these are specified in mm9’s grp table and, like the dropdowns on the gateway page, ordered using a priority field

> select * from grp order by priority asc
+-------------+----------------------------------+----------+
| name        | label                            | priority |
+-------------+----------------------------------+----------+
| user        | Custom Tracks                    |        1 |
| map         | Mapping and Sequencing Tracks    |        2 |
| genes       | Genes and Gene Prediction Tracks |        3 |
| rna         | mRNA and EST Tracks              |        4 |
| regulation  | Expression and Regulation        |        5 |
| compGeno    | Comparative Genomics             |        6 |
| varRep      | Variation and Repeats            |        7 |
| x           | Experimental Tracks              |       10 |
| phenoAllele | Phenotype and Allele             |      4.5 |
+-------------+----------------------------------+----------+

The tracks that constitute these groups are specified in the trackDb table, we can see the tracks for a particular group like so:

> select longLabel from trackDb where grp = 'map' order by priority asc;
+-----------------------------------------------------------------------------+
| longLabel                                                                   |
+-----------------------------------------------------------------------------+
| Chromosome Bands Based On ISCN Lengths (for Ideogram)                       |
| Chromosome Bands Based On ISCN Lengths                                      |
| Mapability - CRG GEM Alignability of 36mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 40mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 50mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 75mers with no more than 2 mismatches  |
| STS Markers on Genetic and Radiation Hybrid Maps                            |
| Mapability - CRG GEM Alignability of 100mers with no more than 2 mismatches |
| Physical Map Contigs                                                        |
| Assembly from Fragments                                                     |
| Gap Locations                                                               |
| BAC End Pairs                                                               |
| Quantitative Trait Loci From Jackson Laboratory / Mouse Genome Informatics  |
| GC Percent in 5-Base Windows                                                |
| Mapability or Uniqueness of Reference Genome                                |
+-----------------------------------------------------------------------------+

These two tables provide information to the browser to know which groups to display, which tracks belong to a particular group, and in which order they should be displayed. The remaining fields in the trackDb table fill in the gaps: what sort of data the track contains and how the browser is to display it. I won’t say anymore about trackDb here – we will naturally cover the details of how trackDb functions when we talk about loading custom data in the next post.

You will probably have noticed that there is not an exact equivalence between the query above and the tracks in the browser (e.g. the query returns a number of Mapability tracks whereas there is only a single Mapability dropdown in the browser). This is because some tracks are composite tracks — if you click Mapability in the browser you will see that the multiple tracks are contained within it. Composite tracks are specified in trackDb, we will investigate this in the next part of this blog post.

[This tutorial continues in Part III: Loading Custom Data]


Add Your Comment