This pipeline takes automated microscopy analysis outputs and massages them into a useful form for clustering or classification. The raw microscopy output file produces a list of auto-detected feature values, both average and per-cell. Typically a plate will contain both experimental and control wells. The full distribution of feature values across cells is more informative than simply the average (e.g. consider a bimodal distribution with the same mean as a unimodal distribution), so it is desirable to come up with a measure of the distance between the entire experimental and control distributions for each feature. This pipeline implements several measures of this distance. The most common measure of a distance between two sample distributions is the Kolmogorov-Smirnov distribution. In some tests, however, this measure did not perform well for clustering. A measure that worked much better was to bin the values and then sum the squared difference between the control and experimental histograms defined by these bins.
The scripts are pretty rough around the edges. Apologies for that, but I have run out of time to polish it. Everything is set up on cscenter under user halo384 to run the pipeline, so you should be able to just skip to the section "Running" and do the steps there to run the pipeline. I did my initial runs on Mac OS X, and for that you will need to read all the info on this page.
You will need to have a JDK 1.6 or greater, groovy 1.7 or greater. To compile the code you will need a *recent* version of ant installed. On cscenter as halo384 java is already installed. Groovy has been downloaded into ~/bin/ and a symlink created to make ~/bin/groovy/ point to the current version. Several environment variables are set in .bashrc (bash!, not csh) to put groovy in your path and improve java performance:
#Groovy setup export GROOVY_HOME=/home/halo384/bin/groovy export PATH=$PATH:$GROOVY_HOME/bin # Java options to optimize long-running large RAM processes... export JAVA_OPTS="-Xms2500m -Xmx2500m -server"
On cscenter as user halo384, durbinlib.jar and it's dependencies are already installed in ~/.groovy/lib/. Instructions for installing from source are later in this document.
R is also already installed on cscenter machines in /usr/bin/R. R is only used to produce clusters and heatmaps. The pipeline needs a couple of libraries from bioconductor in order to produce heatmaps. If you don't have write access to the global R installation, you can install them in a local library directory by specifying a lib parameter to the biocLite installation script. For example, I installed the two libraries of bioconductor I needed with:
R
>source("http://bioconductor.org/biocLite.R")
> biocLite("ctc",lib="~/R/x86_64-redhat-linux-gnu-library/2.9/")
> biocLite("gplots",lib="~/R/x86_64-redhat-linux-gnu-library/2.9/"
This installs bioconductor from source, so there will be a fair bit of compiling. On Mac OS X, the gui package manager suffices and one can simply search for and install bioconductor, ctc, and gplots. In addition, it is necessary to tell R to look for these local library installations. I have done this for halo384 users by putting the following in .bashrc:
export R_LIBS_USER="~/R/x86_64-redhat-linux-gnu-library/2.9/"
Create a directory (e.g. IXM) with subdirectories heatmap_cdt and heatmap_pdf. Create a symlink to /home/halo384/lokeycyto/jamespipe/scripts. Copy or symlink the raw run files zipped up (e.g. IXM405.zip) and the unzipped platemap file in CSV format (e.g. SP40013.csv) into this directory. Then run ./scripts/process.sh passing in the root name of the raw data file (e.g. IXM405), the name of the platemap file, the number of bins for the distribution histogram (e.g. 20), and the kind of measure to use (e.g. histdiff). For example, for IXM405 one could do the following:
{login as user halo384} mkdir IXMTEST ln -s ~/cyto/IXM405.zip . cp ~/cyto/CytoPlateMaps/SP40013.csv . {edit headers to standard names, see note below} ln -s ~/lokeycyto/jamespipe/scripts/ . ./scripts/process.sh IXM405 SP40013.csv 20 histdiff
After about 5 minutes, the pipeline should stop and you should find heatmaps in the heatmap_pdf and heatmap_cdt directory in pdf format and in TreeView format respectively. The processed data will be in a file IXM405.histdiff.csv. This file is a matrix of Compound_Concentration values by Feature. Each entry in this table is the measure (e.g. histdiff) between the control and experimental distributions for that feature. The file IXM405.merged.csv is also saved. From this file you can generate different Compound_Concentration x Feature tables with different measures (e.g. histidff, ksprob, logdiff). This is saved only as a convenience, since for large files it can take up to 30 minutes to regenerate this file.
IMPORTANT NOTES:
The platemap files have so far come in inconsistent configurations. Some are in Excel spreadsheets, some in csv files, and the header names are not always the same. None of these scripts rely on special column locations, but they do rely on special column NAMES. For platemap files, the relevant column names are: Well, MoleculeID, and Concentration. All other columns are ignored. These three column names must match exactly, so until this becomes standardized you will have to export a csv file from Excel and/or edit these three column headings to match these names (e.g. 'Well Name' ==> 'Well')
The software is currently installed on on halo384@cscenter. There are two components to the software, a set of scripts and a compiled library. The scripts consist of one shell script, one R script, and several other scripts written in Groovy. The second component is a compiled set of Java libraries called durbinlib.jar, which itself has several .jar dependencies. (durbinlib.jar, by the way, contains MANY other things besides just the cyto pipeline). The scripts can be found already installed in:
/home/halo384/lokeycyto/jamespipe/scripts
durbinlib.jar and the other .jar dependencies can be found in:
/home/halo384/.groovy/lib/
By default, groovy will include all .jar files found in ~/.groovy/lib/ in it's classpath, so you do not need to edit a classpath to include additional dependencies for groovy scripts.
The scripts are under version control on GitHub at:
git://github.com/jdurbin/lokeycyto.git
You can obtain a copy of these scripts with:
git clone git://github.com/jdurbin/lokeycyto.git
A public key for halo384 has been entered into the github, so someone logged in as halo384 can also submit code back to github. See github help for instructions for doing that.
The library, durbinlib.jar, is also under version control on GitHub at:
git://github.com/jdurbin/durbinlib.git
You can obtain a copy of this source from github, along with all of it's library dependencies, and compile like:
mkdir ~/.groovy/lib
git clone git://github.com/jdurbin/durbinlib.git
cd durbinlib
ant install
This will compile durbinlib and copy durbinlib.jar and required dependencies to ~/.groovy/lib/.