Prepare Bioconductor

This lab uses Bioconductor. You can either use an SOE computer that has R 2.5 installed, such as swingdance.cse.ucsc.edu, or you can install Bioconductor on your personal computer and run it from there. Check out a tutorial (PDF) if you need an introduction or refresher with R.

Using swingdance

In order to let R know where Bioconductor is installed, you must set the environment variable R_LIBS to point to /cse/classes/bme211/Spring08/Rlib. If you are using csh (the SOE default), add this line to the end of ~/.cshrc:

setenv R_LIBS /cse/classes/bme211/Spring08/Rlib

If you are using bash, add this line to the end of your ~/.bashrc:

export R_LIBS=/cse/classes/bme211/Spring08/Rlib

Installing Bioconductor on your personal computer

If R is not yet installed on your computer, download and install it from the R homepage. Once you have R, launch it, and then issue the following three commands:

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("GEOquery")

This will download the base Bioconductor packages, and the GEOquery package which we use to obtain the data for this lab, and can take several minutes to half an hour.

Get Data

The 2005 Zapala et. al study profiled the gene expression of many parts of the mouse nervous system and some other tissues. They deposited their expression data in the Gene Expression Omnibus database.

We will use Bioconductor to download the data from the Zapala et al study and convert it into a Bioconductor ExpressionSet object. Read through the tutorial for the GEOquery (pdf) library. In particular, look at the getGEO() and GDS2eSet? () functions, and use them to load the data from Zapala dataset and then convert to an ExpressionSet object. It will take a few minutes (3-7) to download and parse the data.

1. How many features are in the microarray data that you loaded? How many samples are there?

Now's a good time to save your progress, using save.image(), so that you can easily come back if needed.

Subset the data to muscle and cerebellum

The ExpressionSet class contains three types of data about a microarray dataset:

  • phenoData - metadata about each microarray sample, e.g. stress conditions, genetic strain, tissue, etc.
  • fetaureData - metadata about the microarray probes, e.g. gene name, alternative mappings, etc.
  • assayData - the experimental data

For information about the microarrays, use phenoData(eset), which returns another Bioconductor class, AnnotatedDataFrame. To see the classes of information, use varLabels(phenoData(eset)). Each of these vectors of metadata can be accessed off the phenoData as in a list. For example, phenoData(eset)$strain will give the genetic strain of mouse for each microarray in our dataset. For convenience, this can also be accessed from the original ExpressionSet, so eset$strain will return the same vector as phenoData(eset)$strain.

2. Based on your eSet, what classes of metadata did Zapala et al annotate for each of the microarrays?

An ExpressionSet object can be subsetted just as they are in a matrix, e.g. eset[1:5,1:5]. Use the phenoData information to subset your data to just skeletal muscle and cerebellum.

3. How many muscle and how many cerebellum samples are in your subsetted data?

Use featureData(eset) to access the probe metadata. As with phenoData(), featureData(eset) returns an AnnotatedDataFrame, and the varLabels() function reveals the metadata categories. Later, we are going to perform an analysis using sets of Entrez Gene IDs, so we want to find out which probes are mapped to more than one Entrez Gene so that we can later remove those genes from consideration. Which field in the featureData contains the Entrez Gene ID? Investigate how the mapping has been represented. Some helpful functions for this are head(), tail(), and page(..., method="print").

4. What type of object is the Entrez Gene ID Mapping?

Find out which features map to more than one Entrez Gene ID, and subset the ExpressionSet object to remove those features. Also, create a vector that corresponds to each probes Entrez Gene ID for future use.

To access the assayData of ExpressionSet eset, use exprs(eset), which will return a normal matrix. Missing data is represented by NA. To simplify future analysis, let's take a very strict approach and remove any probes with missing data in these two tissues. Use is.na() to count the missing spots in each row, and filter out rows with any.

5. How many features are left in the dataset?

Calculate per-probe differential expression

Create a function, robust.t.test(x, classification), that performs a modified T-test. The variable x is the data, and classification is a boolean vector specifying the two classes. In order to mitigate the effect of outliers in the data, change the minimum value to be equal to the second smallest, and the maximum value to be equal to the second largest, and then calculate the one-sided T statistic of the modified data set (you can use t.test() here).

Josh: I'm not sure we need to have accurate p-values, just using the T-statistic may be fine

Implement the GSEA enrichment score

Read the 2005 Subramanian et al GSEA paper. We're going to use GO categories as sets. We have pre-parsed them into a tab-delimited file at /projects/sysbio/map/Data/GeneSets/Go/Mouse/lists_t.tab. You can use the following function to read the GO sets into a list:

readSetList 

This function returns a list, each element a vector of Entrez Gene IDs. The names attribute of the list are the GO categories.

Write a function called gseaEnrichmentScore(set, score, geneids). The variable set is a vector of Entrez Gene IDs, like the elements of the set list from above. Score is a numeric vector of scores for each probe, and geneids is the Entrez Gene ID of each probe, that you created above. Some functions that you may find helpful are %in% and cumsum.

Calculate the GSEA scores

Implement a function that scores all GO sets with GSEA, to calculate the raw GSEA scores.

6. Based on the raw GSEA enrichment scores, which 20 GO categories are most enriched in cerebellum? Which of these seem biologically related? (It's likely that you are unfamiliar with many of the GO categories. Look them up to discover what the GO term refers to.) Which seem unlikely to be enriched in cerebellum? How about for skeletal muscle?

As in the GESA paper, we need to calculate a null distribution of GSEA scores in order to calculate significance of the GSEA scores. To get by with less computation, we are only going to use a null distribution of 100, and then use parametric estimates of the null distribution for each GO category. Before you calculate the full null distribution, use sytem.time() to see how long it takes to run a single set of scores, and then multiply that by 50 to estimate how long the full calculation will take. It's a good idea to save your work both before and after this calculation. If you need to disconnect from a remote computer while R is calculating the distribution, launch R from within
screen
to make sure that R is not killed when you disconnect, and to let you continue work once you reconnect.

Recalculate GSEA enrichment scores for each GO category from 100 relabellings of the tissue type, and store them in a 7650 x 100 matrix. Examine the null scores for 5-10 of the GO categories.

7. How would you describe the distribution of null scores for each category? Would a normal distribution serve as a good estimate? Why or why not? What statistical test could you use to show this?

GSEA score significance

In the GSEA paper, they perform 1000 permutations of tissue labels in order to assess p-value. Since this will take a significant amount of time, we will instead just investigate how our GSEA scores compare to scores from randomization. First, plot each GO set's GSEA score vs. the log of the size of the GO set. When counting the size of a GO set, only include genes for which we have expression data. Scatter plots in R are made with the plot() command. Use log="x" in the plot command to make the x-axis have a log scale. (Look at the documentation for par() for more details and other plotting options.)

Write a function to permute the tissue labels, recalculate the T-statistics, and re-score the GO sets. Plot these randomized scores on top of the real enrichment scores using the points() function. You can use col="red" as an argument to points to dinstinguish the random scores from the real scores. Repeat this with several more randomization so that you aren't biased by a single sample. In order to save plots in R, you can either use the dev2bitmap() function to get a copy of the current plot. You can create plots in a wider variety of formats by first creating a new plotting device, e.g. with a pdf() command, re-issuing the plotting commands, and then using dev.off()

8. Save a pdf version of your plot. What proportion of 1 or 2 gene GO sets look significant? What proportion of 1000 member sets seem significant? Devise a method using set size to estimate a p-value.

Visualize the expression data

We are going to use Java TreeView to look at a heatmap of the data. Reorder the data by increasing T-statistic (the order() function may be helpful). Subtract out the mean from every probe's data. Next, identify the top 5 muscle and the top 5 cerebellum GO categories with 10 or more members. For each of these GO categories, add a column onto the expression data indicating which probes are members of the category (the functions cbind() and %in% may be helpful here). Finally, save the data matrix to a tab-delimited file. If your data matrix is called d, then

write.table(d, sep="\t", file="data.tab", quote=FALSE)

will achieve this. However, R leaves the upper left cell blank, shifting column names to the left by one. This can be overcome with a little finnagling:

write.table(data.frame(Probe=rownames(d), d), sep="\t", file="data.tab",
            row.names=FALSE, quote=FALSE)

Load the heatmap into Java TreeView? and create an 500x1000 pixel image file.

-- Main.cvaske - 05 Apr 2008