Detecting spatial and temporal patterns across expressed genes
Thesis
Questions
- Are coexpressed genes near each other on the chromosome?
- How many genes within a distance d on the chromosome are coexpressed?
- Choose a null model: break the location dependency and permute the genes' expression values (100,000 simulations)
- Do coexpressed genes that are near each other also correlate temporally?
- How significant are the sizes of clusters of coexpressed genes that have certain patterns of temporal/spatial relationships.
- Are there periodic patterns of expression on the chromosome, as calculated using an FFT approach.
- Is there a consistent topology for an entire chromosome or for segments of the chromosome, as determined by gene expression data)
Metrics to be used on all GSE/GDS time series experimental sets
- Preprocessing
- Compress the expression data to a network of correlated genes (genes that are coexpressed within a timepoint?) and estimate if the genes in dense subnetworks are significantly close to each other compared to random collections of genes.
- Get a network: (use different correlation cutoffs)
- Get clusters from the network: a) Connected subnets: Identify connected subnetworks in the network b) Gene-centric neighborhoods: For every gene find all genes it's correlated with above a certain cutoff and estimate the distances between all the genes (take an average of
the distances across all genes)
c) Make clusters/modules using MODES
* Null model construction
-
-
- PermCoord? : Permute association of gene to its coordinates
- PermLink? : Permute correlation links among genes (e.g. by rewiring the network)
- RandSubgroups? : Randomly pick genes of a certain size from the genes that were tested by the experiment and have genomic coordinates
- PermData? : Permute the columns/rows/both of the expression data and recompute networks (very slow!)
- PermRanks? : Generate a random permutation for the genes
- Measure 1 - Are genes that are coexpressed near each other on the chromosomes?
-
- Assess genomic proximity of the genes in the cluster a) best hypergeom P-value from overlapping genes in the cluster with genes on each chromosome b) median integenic distance between consecutive genes in the module for every chromosome
- Null models: Permute the links (correlated genes); Randomly pick modules of the same size and estimate average distance/correlation.
- Measure 2 - Are genes that are near each other coexpressed?
-
- GSEA on every chromosome with the module. Use as rank their genomic coordinates.
- Estimate the median correlation of genes within a certain window of size d and slide the window down
- Estimate the number of coexpressed genes within a certain window of size d and slide the window down
- Issues: Should I normalize to account for the gene density within the regions of interest or is this taken care of by the null model? How do I choose an appropriate window size?
- Rank genes, based on their expression value in the experiment and use a sliding window approach again to estimate the average rank of the genes within that window
- Measure 5
- Estimate the probability of two neighboring genes within a certain distance being coexpressed over the probability that two genes are coexpressed and the probability that two genes are within that certain distance.
* Bigger issue: what common format should I compress the GSE/GDS files to such that all these measures can be applied to them? I need a common format for all time series expression data.
- Grow proximal coexpression groups: Use a greedy approach, where you put together links between genes that are closest to each other. Compute stats of the grown groups; e.g. connectivity, average correlation, clustering coefficient, etc.
-- Main.martina - 27 Apr 2007