ChromWaves

Detecting spatial and temporal patterns across expressed genes

Thesis

Questions

  • Are coexpressed genes near each other on the chromosome?
    • How many genes within a distance d on the chromosome are coexpressed?
    • Choose a null model: break the location dependency and permute the genes' expression values (100,000 simulations)
  • Do coexpressed genes that are near each other also correlate temporally?
  • How significant are the sizes of clusters of coexpressed genes that have certain patterns of temporal/spatial relationships.
  • Are there periodic patterns of expression on the chromosome, as calculated using an FFT approach.
  • Is there a consistent topology for an entire chromosome or for segments of the chromosome, as determined by gene expression data)

Metrics to be used on all GSE/GDS time series experimental sets

  • Preprocessing
    • Compress the expression data to a network of correlated genes (genes that are coexpressed within a timepoint?) and estimate if the genes in dense subnetworks are significantly close to each other compared to random collections of genes.
      • Get a network: (use different correlation cutoffs)
      • Get clusters from the network: a) Connected subnets: Identify connected subnetworks in the network b) Gene-centric neighborhoods: For every gene find all genes it's correlated with above a certain cutoff and estimate the distances between all the genes (take an average of
the distances across all genes) c) Make clusters/modules using MODES

* Null model construction
      • PermCoord? : Permute association of gene to its coordinates
      • PermLink? : Permute correlation links among genes (e.g. by rewiring the network)
      • RandSubgroups? : Randomly pick genes of a certain size from the genes that were tested by the experiment and have genomic coordinates
      • PermData? : Permute the columns/rows/both of the expression data and recompute networks (very slow!)
      • PermRanks? : Generate a random permutation for the genes

  • Measure 1 - Are genes that are coexpressed near each other on the chromosomes?
      • Assess genomic proximity of the genes in the cluster a) best hypergeom P-value from overlapping genes in the cluster with genes on each chromosome b) median integenic distance between consecutive genes in the module for every chromosome
      • Null models: Permute the links (correlated genes); Randomly pick modules of the same size and estimate average distance/correlation.

  • Measure 2 - Are genes that are near each other coexpressed?
      • GSEA on every chromosome with the module. Use as rank their genomic coordinates.
      • Estimate the median correlation of genes within a certain window of size d and slide the window down
      • Estimate the number of coexpressed genes within a certain window of size d and slide the window down
        • Issues: Should I normalize to account for the gene density within the regions of interest or is this taken care of by the null model? How do I choose an appropriate window size?
      • Rank genes, based on their expression value in the experiment and use a sliding window approach again to estimate the average rank of the genes within that window

  • Measure 5
    • Estimate the probability of two neighboring genes within a certain distance being coexpressed over the probability that two genes are coexpressed and the probability that two genes are within that certain distance.

  • Measure 6
    • FFTs?

* Bigger issue: what common format should I compress the GSE/GDS files to such that all these measures can be applied to them? I need a common format for all time series expression data.

  • Grow proximal coexpression groups: Use a greedy approach, where you put together links between genes that are closest to each other. Compute stats of the grown groups; e.g. connectivity, average correlation, clustering coefficient, etc.

-- Main.martina - 27 Apr 2007