ChromWaves | Systems Biology Group

Detecting spatial and temporal patterns across expressed genes

Thesis

Questions

Are coexpressed genes near each other on the chromosome?
- How many genes within a distance d on the chromosome are coexpressed?
- Choose a null model: break the location dependency and permute the genes' expression values (100,000 simulations)
Do coexpressed genes that are near each other also correlate temporally?
How significant are the sizes of clusters of coexpressed genes that have certain patterns of temporal/spatial relationships.
Are there periodic patterns of expression on the chromosome, as calculated using an FFT approach.
Is there a consistent topology for an entire chromosome or for segments of the chromosome, as determined by gene expression data)

Metrics to be used on all GSE/GDS time series experimental sets

Preprocessing
- Compress the expression data to a network of correlated genes (genes that are coexpressed within a timepoint?) and estimate if the genes in dense subnetworks are significantly close to each other compared to random collections of genes.
  - Get a network: (use different correlation cutoffs)
  - Get clusters from the network: a) Connected subnets: Identify connected subnetworks in the network b) Gene-centric neighborhoods: For every gene find all genes it's correlated with above a certain cutoff and estimate the distances between all the genes (take an average of

the distances across all genes) c) Make clusters/modules using MODES

* Null model construction

- - PermCoord? : Permute association of gene to its coordinates
  - PermLink? : Permute correlation links among genes (e.g. by rewiring the network)
  - RandSubgroups? : Randomly pick genes of a certain size from the genes that were tested by the experiment and have genomic coordinates
  - PermData? : Permute the columns/rows/both of the expression data and recompute networks (very slow!)
  - PermRanks? : Generate a random permutation for the genes

Measure 1 - Are genes that are coexpressed near each other on the chromosomes?
- - Assess genomic proximity of the genes in the cluster a) best hypergeom P-value from overlapping genes in the cluster with genes on each chromosome b) median integenic distance between consecutive genes in the module for every chromosome
  - Null models: Permute the links (correlated genes); Randomly pick modules of the same size and estimate average distance/correlation.

Measure 2 - Are genes that are near each other coexpressed?
- - GSEA on every chromosome with the module. Use as rank their genomic coordinates.
  - Estimate the median correlation of genes within a certain window of size d and slide the window down
  - Estimate the number of coexpressed genes within a certain window of size d and slide the window down
    - Issues: Should I normalize to account for the gene density within the regions of interest or is this taken care of by the null model? How do I choose an appropriate window size?
  - Rank genes, based on their expression value in the experiment and use a sliding window approach again to estimate the average rank of the genes within that window

Measure 5
- Estimate the probability of two neighboring genes within a certain distance being coexpressed over the probability that two genes are coexpressed and the probability that two genes are within that certain distance.

Measure 6
- FFTs?

* Bigger issue: what common format should I compress the GSE/GDS files to such that all these measures can be applied to them? I need a common format for all time series expression data.

Grow proximal coexpression groups: Use a greedy approach, where you put together links between genes that are closest to each other. Compute stats of the grown groups; e.g. connectivity, average correlation, clustering coefficient, etc.

-- Main.martina - 27 Apr 2007