Thesis

We would like to know if there are general signatures that transcend platforms and/or organisms for the characterization of cell state.

We need to address several questions/issues: a) do core signatures really exist? b) how characteristic are they within well-defined tissues?

[We can use the existing microarray expression data to build classifiers that can boost our ability of recognizing the expression state of new cell populations. We will use SVMs to build binary classifiers for cell populations within an experiment and use them to classify cell populations across experiments. ]

Results

Known tissues experiments

We start the analysis by using experiments and data from tissue expression compendia, because of the well-defined types of tissues (no issues with the purity of samples or mixed populations).

Data

We have started with two mouse expression datasets that test a variety of tissues (both neural and non-neural)

  • Zapala - neural tissue profiling, 150 samples, 24 tissues, 10 body regions, approximately half of the tissues are neural
  • Groden - large-scale tissue profiling study, 161 samples, various tissues

I have now also included a third dataset also from mouse

  • Hughes - compendium of approx. 55 tissues (and samples)

I have also incorporated three datasets from human (add information about each dataset)

  • Shyamsundar
  • Ge
  • Yanai

  • The distribution of tissues that I am building classifiers for and testing in these datasets is as follows:

Dataset\Tissue Heart Kidney Liver Lung Thymus
Ge 1 1 1 1 1
Shyamsundar 6 5 5 4 2
Yanai 10 10 10 10 10
Groden 8 8 12 16 2
Zapala 5 5 5 - 5
Hughes 1 1 1 1 1

Methodology

We will use two different types of features for classification purposes. If we use gene-based features, there are several different way of building the training sets, including using the expression data directly, or performing transformations into either ranks or significance of expression scores (like Z-scores).

Gene-based features

I have tried two different ways of building gene-based feature vectors for each tissue within each dataset. One issue is that the different experiments have different number and set of genes that they've tested, but SVM programs do not allow for missing data or difference in features between training and testing datasets. To avoid this problem, even though I originally started with a feature vector that consisted of the intersection of the genes tested in both (or all) experiments, I have also tested a different idea: using a universal reference gene feature set that consists of all the genes with unique identifiers in Entrez Gene (approx. 70k in mouse).

Direct expression usage

  • I have not performed the self training and testing in this analysis (ideally, I should also try building training data from a single dataset and then test on a different set of samples from the same dataset; in the table below this would be the results along the diagonal)
  • Note that I usually use 5 positive examples (samples) and 5 negative examples (samples) in the training procedure. In the results below, training on the Hughes dataset results in very poor results, because of the lack of replicates of tissue samples. Each tissue type is only represented by a single positive example, which is probably not sufficient for training purposes.
  • Also note that for the missing genes, I use a value of 0, which may not be appropriate for the different platform types, as it has different significance on a single-channel or dual-channel platform
  • I used to use accuracy as the measure of success, where accuracy = (TP+TN)/ALL, but have now moved to sensitivity and specificity (I also have the raw numbers as well)
  • Sensitivity = TP/(TP+FN)
  • Specificity = TN/(TN+FP)
  • The tables below also do not account for the fact that some of the datasets did not test the tissue of interest (for example, lung is not tested in all three experiments)

Train\Test Groden Zapala Hughes
Groden   Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity   Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity  

  • I have also performed randomizations on the testing data (10 randomizations per test dataset; only randomize the test data, not the training data; randomizations are performed per gene between tissues. In the original setup I was first permuting only within a tissue between all the genes, but with the introduction of universal reference gene feature set, randomizations between genes within a tissue does not seem appropriate).

Train\Test Groden Zapala Hughes
Groden   Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity   Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity  

  • We also have the scores for all assignments/classifications. (note to self: attach the plots for scores)

Rank transformation

In mouse

  • Ranking genes removes the problem of having to deal with assigning an appropriate expression value (like 0 or some other value) to missing genes.
  • The rank assignments can be done in several different ways
    • Currently, in each dataset rank is assigned independently (if dataset A has 4,000 genes and dataset B has 3,500 genes, the ranks in A will vary between 1 and 4,000 and the ones in B will vary between 1 and 3,500); during the transformation to the universal reference gene feature set, the missing genes are assigned a rank of 0.
    • SVM score distribution (from Gist) for real pdf and randomized pdf data across all datasets sorted by tissue. High absolute values mean a high level of confidence in the predictions; positive values indicate a positive prediction; negative values indicate a negative prediction.
    • Sensitivity and specificity results for real and randomized data:
      • Real data

Train\Test Groden Zapala Hughes
Groden   Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity   Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity  
      • Randomized data (randomization performed as described above)

Train\Test Groden Zapala Hughes
Groden   Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity   Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity  

    • The missing genes can be assigned instead the worst possible rank. Is this going to make a difference?
      • Real data

Train\Test Groden Zapala Hughes
Groden   Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity   Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity  
      • Randomized data (randomization performed as described above)

Train\Test Groden Zapala Hughes
Groden   Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity   Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity  

In human

  • The methodology for the setup of the classifiers is the same as the one for mouse
  • The purpose is to first identify if we can observe similar performance within human datasets and also to see if I can use cross-species classifiers successfully.

Train\Test Ge Shyamsundar Yanai
Ge   Sensitivity Specificity Sensitivity Specificity
Shyamsundar Sensitivity Specificity   Sensitivity Specificity
Yanai Sensitivity Specificity Sensitivity Specificity  

Cross-species analysis

  • This analysis now includes all 6 datasets (3 mouse and 3 human datasets) and a set of 4 tested/classifier tissues
  • Genes are taken from mouse to all human space and after using best reciprocal mapping
  • The expression data is converted to rankings and a universal reference set is used, where for all the missing data values a rank of 0 is assigned
  • All pairs of datasets are compared, so this analysis is a superset of some of the previous analyses. Note that self-comparisons for all datasets are performed as well (although the positive samples are never specifically held out in training; that test should be performed separately)
  • One thing to note is that the datasets, like Hughes and Ge that have a single sample as the positive sample in a given training set are really hard to use as classifiers and show very low levels of sensitivity, when used for classification against the other datasets

Train\Test Ge Shyamsundar Yanai Groden Zapala Hughes
Ge Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Shyamsundar Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Yanai Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Groden Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity

* Random results
Train\Test Ge Shyamsundar Yanai Groden Zapala Hughes
Ge Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Shyamsundar Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Yanai Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Groden Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Zapala Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity
Hughes Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity

* Positive predictive values for all tested datasets

Train\Test Ge Shyamsundar Yanai Groden Zapala Hughes
Ge PPV PPV PPV PPV PPV PPV
Shyamsundar PPV PPV PPV PPV PPV PPV
Yanai PPV PPV PPV PPV PPV PPV
Groden PPV PPV PPV PPV PPV PPV
Zapala PPV PPV PPV PPV PPV PPV
Hughes PPV PPV PPV PPV PPV PPV

* Random positive predictive values

Train\Test Ge Shyamsundar Yanai Groden Zapala Hughes
Ge PPV PPV PPV PPV PPV PPV
Shyamsundar PPV PPV PPV PPV PPV PPV
Yanai PPV PPV PPV PPV PPV PPV
Groden PPV PPV PPV PPV PPV PPV
Zapala PPV PPV PPV PPV PPV PPV
Hughes PPV PPV PPV PPV PPV PPV

Standard SAM analysis

In Mouse

  • For each tissue in the Groden and Zapala datasets that are in common between the two experiments, I performed SAM analysis, where one class type was considered the tissue of interest and the other was all other tissues pooled together. I used the SAM d-values and performed clustering analysis (average linkage; Pearson uncentered correlation metric). The tissues of interest were thymus, testes, muscle, liver and heart. Genesis output cannot be easily stored for such a large set of genes, but the same tissues from the different datasets cluster separately - they cluster closest to the tissues within the same experiment than to the other experiment.

  • I also performed GSEA analysis, where for each ranked list of genes I looked for the enrichment of GO categories within the ranked list for each tissue. GO categories were restricted to sizes between 5 and 500 genes. Results for the analysis can be seen below:

Feature Selection

  • Instead of using all genes for classification (given that SVM light allows missing data as well), I'd like to test how the classification accuracy improves or deteriorates if I use a smaller sets of genes as features, or if I use pathways as features
    • Metric 1: T-statistic: Procedure: I test how well a gene separates two classes of interest; I pick positive class samples (all available) and randomly select negative class samples to match the order of samples in the positive class; then for all genes tested in experiment, I perform a two-sample t-test to measure the significance of the difference between the two classes.
      • Distribution of t-statistics within each tissue for the Groden and Zapala experiments PDF
      • I have also then used the t-statistic information to construct pathway features. I have restricted my use of gene sets to any set of genes within the compendium that is between 1 and 100 genes (list of gene sets: TAB) . For each gene set I estimate a mean t-statistic, which uses the significance of the individual genes for separating two classes (gene set size is not taken into account). I estimate the significance of getting such a mean by performing 100 randomizations for each tissue type and pathway set. Randomization is performed in such a way that the assignments of t-statistics to genes permuted without replacement across all genes in the experiment (here is a comparison of the scores across real pathways and random pathways in kidney for the Groden dataset: Histogram). Also, here are the top performing pathways sorted by zscore:

Pathway-based features