Archive for the ‘OSX hacks’ Category

Formatting Figures for PLoS ONE with Illustrator on OSX

Posted 30 Aug 2012 — by caseybergman
Category OSX hacks

We in the Bergman Lab are big supporters of the Public Library of Science, and increasingly have been submitting papers to PLOS ONE over the last few years because we believe this journal represents the true values of science: openness, technical rigor and objectivity. Because PLOS ONE uses a streamlined production process, their author guidelines are very strict, with article formatting responsibilities falling on the author that would be traditionally handled with the help of a copy editor.

One area of PLOS ONE article formatting that I have found particularly difficult in the past is to get the exact figure specifications that pass the automated checks of the PLOS ONE Editorial Manager system.  Personally, I find the guidelines for figure preparation for PLOS ONE to be somewhat bewildering. In my first few submissions, I wasted substantial time uploading files that failed the automated checks and/or I had nerve-wracking requests to change figure formats at the post-acceptance stage, where I did not get another chance to look at the manuscript before it goes live. After some trial and error, I have gotten my head around what actually works to prepare figures for a trouble-free PLOS ONE submission. So to save a headache and speed up the publication process for one or more scientist out there (and so I have access to these notes out of the office), I’ve typed up a protocol that should work for preparing PLOS ONE-ready figures using Illustrator on OSX.

1. Prepare your figure in your favorite sofware (R, Illustrator, etc) and Save.

2. Open/Import into Illustrator.

3. In Illustrator, under the “File” menu, select “Export…”. You will now see a window entitled “Export”.

4. Select “TIFF (tif)” from the “Format” dropdown menu. You should see something like this:

5. Click “Export”. You will now see a window entitled “TIFF Options”.

6. Set the “Color Model” drop-down menu to “RGB”.

7. Click the “Other” radio button for the “Resolution” setting and set to 500 dpi.

8. Select the “Anti-Aliasing” check-box.

9. Select the “LZW Compression” check-box. At this point you should see a screen something like this:

10. Click “OK”.

You should now have a 500 dpi .tif file that is ready to upload with minimal (to no) complaints by the PLOS ONE Editorial Manager, and hopefully your next open-access manuscript will be speeding to publication soon.

Notes: This recipe was developed using an Illustrator 11.0.0 on a MacBook Air running OSX 10.6.8. Please don’t laugh at my ancient software – I find upgrading is the enemy of efficiency.

Installing libsequence and analysis tools on OSX

After being relegated to the margins of biology for almost a century, the field of population genetics is now moving closer the the center of mainstream biology in the era of next-generation sequencing (NGS). In addition to providing genome-wide DNA variation data to study classical questions in evolution that are motivated by population genetic theory, NGS now permits population-genetic based techniques like GWAS (genome-wide association studies) and eQTL (expression quantitative trait locus) analysis to allow biologists to identify and map functional regions across the genome.

Handling big data requires industrial strength bioinformatics tool-kits, which are increasingly available for many aspects of genome bioinformatics (e.g. the Kent source tree, SAMtools, BEDtools, etc.). However, for molecular population genetics at the command line, researchers have a more limited palette that can be used for genome-scale data, including: VariScan, the PERL PopGen modules, and the libsequence C++ library.

Motivated by a recent question on BioStar and a request from a colleague to generate a table of polymorphic sites across a bacterial genome, I’ve recently had a play with the last of these — Kevin Thornton’s libsequence and accompanying analysis toolkit — which includes a utility called compute which is a “mini-DNAsp for the Unix command-line.” As ever, getting this installed on OSX was more of a challenge than desired, with a couple of dependencies (including Boost and GSL), but in the end was do-able as follows:

$ wget http://sourceforge.net/projects/boost/files/boost/1.47.0/boost_1_47_0.tar.gz
$ tar -xvzf boost_1_47_0.tar.gz
$ cd boost_1_47_0
$ ./bootstrap.sh
$ sudo ./b2 install
$ cd ..

$ wget http://molpopgen.org/software/libsequence/libsequence-1.7.3.tar.gz
$ tar -xvzf libsequence-1.7.3.tar.gz
$ cd libsequence-1.7.3
$ ./configure
$ make
$ sudo make install
$ cd ..

$ wget ftp://ftp.gnu.org/gnu/gsl/gsl-1.15.tar.gz
$ tar -xvzf gsl-1.15.tar.gz
$ cd gsl-1.15
$ ./configure --disable-shared --disable-dependency-tracking
$ make
$ sudo make install
$ cd ..

$ wget http://molpopgen.org/software/analysis/analysis-0.8.0.tar.gz
$ tar -xvzf analysis-0.8.0.tar.gz
$ cd analysis-0.8.0
$ ./configure
$ make
$ sudo make install
$ cd ..

With this recipe, compute and friends should now be happily installed in /usr/local/bin/.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8.

Enhanced by Zemanta

Running Taverna workflows in Galaxy on OSX

Posted 28 Jul 2011 — by caseybergman
Category galaxy, OSX hacks, taverna

Recently I’ve been bitten by the Galaxy bug, primarily because I needed a mechanism this year to supervise final year undergraduate projects of students without a strong background in bioinformatics. This was a great success, since students seem to pick up the interface really easily and I was able to track and comment on their progress explicitly via shared histories and workflows.

Because of this experience, I’ve become much more interested in using workflow systems  to run and manage my bioinformatics pipelines in my research projects rather than relying on READMEs and UNIX shell scripts. Recent news that Kostas Karasavvas from NBIC has developed eGalaxy, a mechanisms to run Taverna 2 workflows using Galaxy is in my view a game-changer for the more widespread use of workflows by practicing bioinformaticians like myself, since it will permit mash-ups between the two main workflow systems and deployment of the large pre-established library of Taverna workflows in myExperiment to be used in a local Galaxy installation.

The easiest way of getting a Taverna workflow running in Galaxy is to search myExperiment for Taverna 2 workflows, and click the “Download Workflow as a Galaxy tool” button in the “Download” section of the page. This will send you to a “Galaxy tool download” page with instruction on how to get the Taverna workflow installed as a tool in Galaxy. The instructions are a bit spare at the moment and require familiarity with installing Galaxy locally and adding tools to a local Galaxy installation. They also only have have installation notes for Debian-based systems, but with the help of Rob Haines from the Taverna team, I’ve been able to get a stable protocol working for OSX as well.

To give a bit more context, Taverna workflows are run in Galaxy is as Ruby scripts that are added to your Galaxy tools directory like any other custom tool.  Executing the Ruby script tool launches a connection to a remote Taverna 2 server, where the workflow is run. Results are then returned back to the Ruby script and thence to Galaxy. Like all Galaxy tools, installing a Taverna tools requires the tool itself (a script or other executable program) and a description of the tools’ inputs/outputs in XML format to be placed in the “tools” directory, plus a notification to Galaxy that the tool exists in the tool_conf.xml file in Galaxy main directory.

The Ruby script generated by myExperiment requires a few ruby packages (aka “gems”) that are installed by the RubyGems. Both Ruby and RubyGems are installed by default on OSX (in /usr/bin) so your kit is nearly complete. The following steps should allow you to run a test Taverna workflow to make sure your configuration is working properly on a OSX 10.6 machine. To help consolidate install notes for the entire process in one place, I’m copying the key steps for a local Galaxy installation here as well.

1) Install Mercurial version control system for OSX from here, and add make sure /usr/local/bin/ is in your path.

2) Checkout the Galaxy codebase using Mercurial in your home directory ($HOME):

$ hg clone https://bitbucket.org/galaxy/galaxy-dist/

3) Create a Taverna tools directory in your Galaxy distribution:

$ mkdir $HOME/galaxy-dist/tools/tavernaTools

4) Install the RubyGems needed for the Taverna tool to run. The critical gems to install are t2-server (which is needed to connect to the taverna server that runs the workflow) and rubyzip (which is needed for compression of Galaxy results). Installation of t2-server will automatically install the libxml-ruby and hirb gems it is dependent on. libxml-ruby calls on the the libxml2 C XML parser, which is also installed by default on OSX in /usr/include/libxml2/

$ sudo gem install t2-server
$ sudo gem install rubyzip

5) Select a Taverna 2 workflow from myExperiment and download Ruby script and XML file. For testing, use a workflow that does not require any input files, e.g. http://www.myexperiment.org/workflows/823/versions/1/galaxy_tool

6) Paste “http://test.mybiobank.org/taverna-server” into the “Taverna server URL:” textbox.

7) Click the “Download Galaxy tool” button, e.g. to your Downloads folder.

8) Unzip the Taverna 2 Galaxy tool and move the Ruby script and XML file into your Taverna tools directory, e.g.

$ unzip $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.zip
$ mv $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.xml $HOME/galaxy-dist/tools/tavernaTools
$ mv $HOME/Downloads/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.rb $HOME/galaxy-dist/tools/tavernaTools

9) Edit your tool_conf.xml file to include a new section for Taverna tools, e.g.

 <section name="Taverna Tools" id="tavernaTools">
    <tool file="tavernaTools/fetch_pdb_flatfile_from_rcsb_server_58764_galaxy_tool.xml" />
 </section>

 
10) Start the Galaxy server by running the run.sh script:

$ sh $HOME/galaxy-dist/run.sh

11) Open http://127.0.0.1:8080 in your web browser and you should see a “Taverna Tools” tool heading above the “Get Data” Tools heading, which when clicked on reveals the newly installed “Fetch PDB flatfile from RCSB server” tool. You can now run this Taverna tool like any other analysis tool in Galaxy. For this particular tool, this involves inputting a PDB id and clicking “execute”. Successful completion of the job should return a PDB file in the main Galaxy window, e.g.

In early stages testing this protocol, I selected a different Taverna workflow (916) that did not run out of the box and gave less-than-helpful error messages in Galaxy. Trouble shooting with Rob Haines pinpointed this problem to the http://test.mybiobank.org Taverna server not being able to execute aspects of this workflow.  When tested on an alternate Taverna server, the workflow did run and completed with expected results. So if you experience a failed attempt at running a Taverna workflow on Galaxy it may not have anything to do with your kit or the workflow in question.

From this initial experience, looking forward (from mid-2011) I’d like to see the eGalaxy system include a mechanism to generate tests for each Taverna tool automatically, either at the time of tool download from myExperiment or as a part of the Galaxy testing system. I’d also like to see a industrial-scale Taverna sever hosted somewhere (preferably by the Galaxy or Taverna teams) so all Taverna tools can be used reliably out of the box on at least one tested server. In any event, I’m now convinced that the eGalaxy project is what it says on the tin, and can only improve with more folks trying it out and contributing feedback.

Notes: This protocol was developed on a MacBook Air Intel Core 2 Duo running OSX 10.6.8. Credits to Rob Haines for help trouble shooting andgiving me a detailed walk through on the mechanics of the Taverna-Galaxy integration as well as to Kostas Karasavvas of NBIC for having the inspiration to intiate the eGalaxy project.

 

Converting a Mathematica Notebook to Run at the Command Line on OS X

Posted 09 Nov 2009 — by caseybergman
Category mathematica, OSX hacks

As part of a collaboration with Justin Blumenstiel at the University of Kansas, I’ve been getting reacquainted with Mathematica for the first time in many years. Although clearly one of the most powerful programming languages, I’ve always had difficulty getting comfortable with Mathematica since it is a heavily GUI based system, and GUIs make me nervous since they don’t allow the reproducibility that I am accustomed to with command line systems. Mathematica does allow execution of code at the command line through the MathKernel, but after looking around I was not able to find an easy way to convert Mathematica code generated using the GUI to something that the kernel can run cleanly. Other solutions on the web suggest converting code to Initialization Cells and saving as a “package,” but this threw a bunch of errors for me. With the help of our local Mathematica guru, I’ve now got a reasonable (albeit manual) protocol that is relatively simple and hopefully works in other contexts.

The basic trick is to convert all code generated using the GUI utilities that have pretty mathematical notation into the InputForm of the code that is read by the kernel, then save as a text file for input to MathKernel.

1) Select cells with pretty expressions and click Cell > Convert to > InputForm
2) Cut and paste the resulting inputform code into a textfile, e.g. script.m.
3) Repeat 1)-2) as needed.
4) Terminate textfile with:
Exit[]
5) Execute script from the command line prompt with:
/Applications/Mathematica.app/Contents/MacOS/MathKernel -noprompt -run “<<script.m”

The MathKernel requires absolute file paths, whereas the GUI assumes file paths start in $HOME, so be sure to use absolute file paths in the original notebook. This approach should give equivalent results to what you would obtain through the GUI, which you can check by loading the script.m and executing all the inputformat expressions into the GUI. I have however noticed very slight numerical differences (~15th significant figure) in the results from pretty and input form, which I’ll post more about if I find out the reason why.

Notes: Thanks to Michael Croucher for helping with this solution.

Configuring REannotate Apollo Tiers Files on OSX

Posted 23 Jun 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, transposable elements

Vini Pereira has recently published a nice paper on his REannotate package for defragmenting pieces of dead and nested transposable elements (TEs) detected using RepeatMasker. Defragmentation is an important aspect of accurately annotating and data mining transposable elements (as Hadi Queseneville and I have discussed in our 2007 review article on TE bioinformatics resources).

One of the nice features of REannotate is the ability to output GFF files ready for import into the Apollo genome annotation and curation tool. Although there is excellent documentation for configuring Apollo and REannotate provides a custom “tiers” file to properly display defragmented TE annotations, I struggled a bit to get these to work together on OSX based on the REannotate documentation. Partly my difficulty arose because the location of the “conf/” directory for the Apollo installation on OSX is not explicit, and in the end finding this directory provided two alternate solutions to this problem. Both solutions assume Apollo was installed via the .dmg file

Solution 1: This solution should be platform independent of the location of the conf/ directory.

$cd $HOME/.apollo
$wget -O ensj.tiers http://www.bioinformatics.org/reannotate/download/REannotate.tiers
Launch Apollo
Import REannotate GFF file and associated FASTA sequence.
Select File->Save Type Preferences..
Save As: ensj.tiers

Solution 2: This solution is OSX specific, and depends on the location of the conf/ directory.

$cd $HOME/.apollo
$wget -O REannotate.tiers http://www.bioinformatics.org/reannotate/download/REannotate.tiers
$cat /Applications/Apollo.app/Contents/Resources/app/conf/ensj.tiers REannotate.tiers > ensj.tiers
Launch Apollo
Import REannotate GFF file and associated FASTA sequence.

One final note: I find it helpful to “grep -v un-RepeatMasked_sequence” the REannotate GFF files before loading into Apollo to get rid of non-RepeatMasked annotations.

Compiling UCSC Source Tree Utilities on OSX

Posted 12 Mar 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, UCSC genome browser

The UCSC genome bioinformatics site widely regarded one of the most powerful bioinformatics portals for experimental and computational biologists on the web. The ability to visualize genomics data through the genome browser and perform data mining through the table browser, coupled with the ability to easily import custom data, permit a large range of possible genome-wide analyses to be performed with relative ease. One of the limitations of web-based access to the UCSC genome browser is the inability to automate your own analyses, which has led to the development of systems such as Galaxy, which provide ways to record and share your analysis pipeline.

However, for those of us who would rather type than click, another solution is to download the source code (originally developed by Jim Kent) that builds and runs the UCSC genome browser and integrate the amazing set of stand-alone executables into your own command-line workflows. As concisely summarized by Peter Schattner in an article in PLoS Computational Biology, The Source Tree includes:

“programs for sorting, splitting, or merging fasta sequences; record parsing and data conversion using GenBank, fasta, nib, and blast data formats; sequence alignment; motif searching; hidden Markov model development; and much more. Library subroutines are available for everything from managing C data structures such as linked lists, balanced trees, hashes, and directed graphs to developing routines for SQL, HTML, or CGI code. Additional library functions are available for biological sequence and data manipulation tasks such as reverse complementation, codon and amino acid lookup and sequence translation, as well as functions specifically designed for extracting, loading, and manipulating data in the UCSC Genome Browser Databases.”

Compiling and installing the utilities from source tree is fairly straightforward on most linux systems, although my earliest attempts to install on a powerpc OSX machine failed several times. The problems relate to building some executables around MySQL libraries which I never fully sorted out, but I’ve now gotten a fairly robust protocol for installation on i386 OSX machine. These instructions are adapted from the general installation notes in kent/src/README.

1) Install MySQL (5.0.27) and MySQL-dev (3.23.58) using fink.

2) Install libpng. [Note: my attempts to do this via Fink were unsuccessful.]

3) Obtain and Make Kent Source Tree Utilities

$wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
$unzip jksrc.zip
$mkdir $HOME/bin/i386
$sudo mkdir /usr/local/apache/
$sudo mkdir /usr/local/apache/cgi-bin-yourusername
$sudo chown -R yourusername /usr/local/apache/cgi-bin-yourusername
$sudo mkdir /usr/local/apache/htdocs/
$sudo chown -R yourusername /usr/local/apache/htdocs
$export PATH=$PATH:$HOME/bin/i386

[Note: it is necessary to add path to bin before making, since some parts of build require executables that are put there earlier in build]

$export MACHTYPE=i386
$export MYSQLLIBS="/sw/lib/mysql/libmysqlclient.a -lz"
$export MYSQLINC=/sw/include/mysql
$cd kent/src/lib
$make
$cd ../jkOwnLib
$make
$cd ..
$make

These instructions should (hopefully) cleanly build the code base that runs a mirror of the of UCSC genome browser, as well as the ~600 utilities including my personal favorite overlapSelect (which I plan to write more about later).

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Thanks goes to Max Haeussler for tipping me off the Source Tree and the power of overlapSelect. This protocol was updated 19 March 2011 and works on the 9 March 2001 UCSC jksrc.zip file.

Compiling Nikolaus Rajewsky’s Ahab on OSX

Posted 11 Mar 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, regulatory sequences

One of the most challenging areas of genome bioinformatics is the de novo prediction of regulatory sequences that control gene expression. This is especially true for fully functional cis-regulatory elements that can act far from their target gene, such as enhancers. The most frequently used approach to predict enhancers is to find regions of the genome with a high density of matches to the recognition sequences of known transcription factors. There are a large number of papers describing bioinformatics methods for enhancer prediction using this strategy, including Martin Frith‘s pioneering method Cister.

One of the most efficient and easy to use command line programs for detecting clusters of transcription factor binding sites is Nikolaus Rajewsky‘s program Ahab, which you can obtain by contacting Nikolaus directly.

I had trouble getting Ahab to run on OSX, partly because in the distribution Nikolaus sent there was no Makefile provided, but found that it can be compiled with the following modifications. The underlying cause of the problem is a conflict between a system wide definition of fmin and that provided by the Numerical Recipes in C nr.h file (see discussion here). To fix this and compile, our sysadmin Nick Gresham has developed the following solution:

1) change line 182 and line 640 in the file source/nr.h from:

float fmin(float x[]);
float fmin();
to:
float nr_fmin(float x[]);
float nr_fmin();

2) create a Makefile with the following commands and make.
CC=gcc
LD=$(CC)
CFLAGS=-O2
LDFLAGS=-lm

ALL_PROGRAMS = module_fit module_prof

FIT_OBJS = dbrent.o df1dim.o dlinmin.o f1dim.o frprmn.o mnbrak.o Wtmx.o readFas.o
PROF_OBJS = Profile.o readFas.o

all : $(ALL_PROGRAMS)

.PHONY : all

module_fit : $(FIT_OBJS)

$(LD) $(LDFLAGS) $^ $(LIBS) -o $@

module_prof : $(PROF_OBJS)

$(LD) $(LDFLAGS) $^ $(LIBS) -o $@

clean:

rm -rf *.o $(ALL_PROGRAMS)

3) change the $exec and $exec_profile variables in perl/run_module_fits.pl to ‘module_fit’ & ‘module_prof’, respectively.

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Tabs in the makefile have been faked using the indentation function in the wordpress editor.

Compiling Jody Hey’s Multilocus HKA on OSX

Posted 09 Mar 2009 — by caseybergman
Category molecular evolution, OSX hacks, population genomics

The Hudson, Kreitman, Aguade (HKA) test is one of the most widely used tools for testing the neutral theory of molecular evolution, combining information from both polymorphism and divergence among closely related species. The HKA test classically was applied to two loci, but can be extended to multiple loci as well, using software such as Jody Hey‘s Multilocus HKA program.

Unfortunately the source code for this software has a couple of Windows-specific features that make it difficult to compile on OS X. To get it compile on OS X (and linux) our sysadmin Nick Gresham has developed the following solution:

1) change the following .h and .c filenames in the source code distribution to lowercase:

$ mv HKAEXP.C hkaexp.c
$ mv HKAFUNCS.H hkafuncs.h
$ mv HKADIST.C hkadist.c
$ mv HKAEXTRN.H hkaextrn.h
$ mv HKAPRAM1.C hkapram1.c
$ mv HKA.C hka.c
$ mv HKASIMB.C hkasimb.c
$ mv HKAFILE.C hkafile.c
$ mv SETS192.H sets192.h
$ mv HKAPREP.H hkaprep.h

2) change line 44 of hkafile.c to the following:

double gamma(double x);

3) create a Makefile with the following commands and make.

CC = gcc
CFLAGS = -O2
ALL_CFLAGS = -I. $(CFLAGS)
LDFLAGS = -lm

all: hka

hka: hkadist.o hkaexp.o hkafile.o hkapram1.o hkasimb.o hka.o

.c.o:

$(CC) $(ALL_CFLAGS) -c $< -o $@

clean:

-rm *.o *~ core hka

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Tabs in the makefile have been faked using the indentation function in the wordpress editor.