DrugTargets

Drug target prediction project

Thesis

How do we combine drug-sensitivity data in a way that imrpoves predictions. We have two different types of data, HAP / DIP (non-essential / essential), and we want a way to make use of this data.

  • Pathway-level perspective of genome-wide drug sensitivity data improves ability to predict mode of action of a drug.

Results

  • Simulated data: Comparison of robustness of different methods of combination of dip/hap info in simulated data.
    • Different types of noise: noise in the SL network, and noise in the drug sensitivity network
    • Comparison of performance as function of complexity (avg. degree?) of pathway networks. * How does the complexity of the underlying...
    • Parameter: err: Performance as function of FP / FN rate
    • Parameter: k: Performance as function of number of genes in each pathway
    • Parameter: pi: Performance as function of number of distinct pathways that a drug hits (parameter: # pathways hit)
    • Parameter: d: Is the pathway that the drug actually hits known?
    • Parameter: f: fraction of genes that are known to belong to the drug's pathway
    • Instead of Go / KEGG / MIPS, we are making our OWN fake pathway data, with some degree of accuracy. We name them something like Hit1, Hit2, NoHit1? , NoHit2? . In this context, the meaning of FP / FN is: False negative: removed a gene from a pathway, even though it was there. False positive: add a gene to a pathway even though it isn't really part of it. Note that we can express FP as precision instead.
    • Things to vary: GO category "true"-ness, SL network, Drug sensitivity data, and network complexity
    • Generate sensitivity based on a "sensitive" mean and "non-sensitive" mean. This will create optical-density (OD)-style sensitivity data.
    • A pathway in our method is a path from the essential root up to a leaf node.
    • Generate all the data from a "true" pathway, but what we actually use for computation is going to be a "noisy" version of this true pathway.

Positive controls

  • Obs. 2: Comparison of different methods of combining hap/dip data for predicting drug pathway targets, and how well these do with the known positive controls.

Methods

  • Topic 1: Simulating gene pathways
    • Make a gene network that feeds into a bunch of "Functions" that are essential to the cell. Construct an SL map from this, as well as drug sensitivitiy profiles. The essential genes (actually, all the genes) can be tested with the diploid method, and the non-essential remainder can be tested with the haploid method.
    • Parameters: number of essential functions, number of genes, average pathway overlap (how many genes appear in more than one pathway?), pathway length, number of pathways
    • Way to do this: tri-partite graph? Functions / Pathways / Genes. Note that we don't care about the particular order within the pathway.

    • Identify the essential and nonessential genes

  • Topic 2: Simulating the SL network given the "True" network from above

      • False negative / positive rate: for a control, check to see the real Tong-Boone data. Some of the genes were both reference genes AND also tested in the large full-genome panel. We can see if there were cases in which an SL interaction was discovered one way (when the gene was a "reference" gene), but not the other (whent the gene was a "panel" gene). Check the paper.

      • We assume that pathways that terminate in the same function are SL with each other (i.e., at least one is required). Parallel pathways: they are pathways that do not have a gene in common, but end up in the same function. * False positive synthetic lethal connection: a SL link between two genes that are NOT in parallel pathways (i.e., they are either in the same one or in unrelated pathways) * False negative: SL connection is missing, but it should have been there (SL connection missing between two genes in parallel pathways). * "Completeness" parameter: number of genes for which we have SL data. In other words, what fraction of gene pairs were actually tested? (Tong-Boone tested ~143x5000) (reference genes * test genes/(test_genes*(test*genes-1)) ). Completeness for Tong-Boone was 143/5000.

  • Topic 3: Simulating the gene functional categories (like Go/MIPS)
      • Pathway_HIT: Pathways that are hit somewhere by the drug. * False positives: add a few unrelated genes according to a PRECISION parameter (note: not a global FP rate) * False negatives: leave out some of the genes in this real pathway
      • Pathway_NOT_HIT: Pathways that don't have any gene member that was hit by the drug. * False positives: add a few unrelated genes * False negatives: leave out some of the genes in this real pathway

  • Topic 4: Simulating drug targets * Number of pathways hit * Fraction of genes in each pathway that are hit (note that this will produce easy large pathways, whereas a raw number instead of a fraction would make small pathways easier and large ones harder)

  • Topic 5: Simulating drug sensitivity data.
      • Simulating Haploid deletion+drug results (for NONessential gene deletions) * Identify any non-essential (NOT essentials, we can't test them here) gene in a parallel pathway to the pathway that the drug hit. Note that it is SL with the WHOLE PATHWAY, not just the one gene that the drug hit. (i.e., genes sensitive in the haploid case--those genes that are in a path that is parallel to. A gene "g" where you remove it AND add the drug and there's no remaining path to the critical function(s) F that was previously led to by "g" (a function F is orphaned). Result: who is in the sensitive set. * (Now that we have the sensitive set, we make up numbers for it)

      • Simulating Diploid hemizygous (one out of two copies) deletion+drug results (primarily for essential gene deletions) * Now it's just the genes that are picked to be sensitive to the drugs are in the sensitive set. (We might relax this: Maybe the ones in the same pathway are more sensitive than ones in unrelated pathways)

      • Sensitivity: now that we've decided what was sensitive in the various experiment types, we apply optical-density-style growth values to them based on the mean of the "sensitive" and "non-sensitive" groups (with some error). The overlap between the means / STDEVs of these peaks is what causes our fp/fn rate.
      • Now we have an optical density value, and we decide whether or not we THOUGHT the mutant was sensitive or not, based on whether it's closer to the "non-sensitive" or the "sensitive" peak.
      • We're just going to assume that the sens/non-sens peaks are unit-normal gaussians with some separation parameter (mu_SENS-mu_NON_SENS).

"Which gene should we screen the NCI compuond collection with in order to find one that targets __ (for example, DNA damage)

  • Which gene-deletion mutant should we check with our NCI compoud sets?
    • We want one with a SL link to lots of DNA damage genes, but NOT a lot of extraneous SL links to genes in other pathways.
    • check SOD1 / DNA damage
    • Find a pathway with lots of SL connections to our one of interest, but not a lot to other pathways.

Extra / Confusing things

  • Simulation parameters
    • number of genes hit by a drug
    • noise * Introduce noise to SL network * Introduce noise to drug sensitivity profile * Introduce noise to SL network

    • simulate a pathway

  • Positive Control Pathways
    • how did we make these

  • Predicting target pathways from diploid (het) sensitivity data

  • Predicting target pathways from haploid (hom) sensitivity data * making use of SL networks to predict a pathway

Current Plans

  • Figure out why replicates of genes are in the Guri 07 data.
    • Answer: At least some of them are like these: YBR020W? :chr00_18 YBR020W? :chr2_2 . We strip off the :chr_XYZ part, but sometimes that is different.
  • New question: what does "ORF:chr00_##" mean? What is chr00?

  • Make a way to compare the HET and HOM experiments for the same drug.
  • Actually compare HOM and HET data in the correct place in the comparison pipeline. Currently, we are treating HET data like HOM in generating pathways.
  • Add list of total list of UNIVERSE of genes for sets overlap for network.mak... need to find a list of ALL THE GENES in guri's study. That needs to be intersected with anything we have in a pathway. So (all Guri's) + (all of pathway file). Note that this will be different for HET and HOM. Make sure to make two different files for the HET and HOM, since the genes involved are (subtly) different.

Prediction Results

Paper & Figures

Data Summaries

Note that ~/r is being used as a temp file in these commands! Hopefully you don't have an important file in your home directory with that name, if you decide to run them!

egrep -ic 'SL$' tong04.tab = 929 synthetic lethals interactions in the Tong-Boone data

egrep -ic 'SS$' tong04.tab = 839 synthetic sick interactions in the Tong-Boone data

cut -f 1 tong04.tab > ~/r  ;  cut -f 2 tong04.tab >> ~/r  ; sort -u ~/r | wc = 1040 distinct genes involved in these synthetic interactions
  1. 1040 5989

cut -f 1,2 tong04.tab > ~/r  ;  cut.pl -f 2,1 ~/r > ~/s  ;  cat ~/r ~/s > ~/t  ;  sort -k 1 ~/t > ~/u  ; sets.pl ~/u | sort -k 1 > ~/v  ; row_stats.pl -count ~/v > ~/w = This gives us a histogram-style list of the number of SL AND SS partners for each gene. Note that they are all nonzero here, because this is a list of all the lethals / sicks, and it doesn't include any non-lethal interactions that were tested.

=tail +2 ~/w | cut -f 2 | transpose.pl -q | row_stats.pl -h 0 -sc 0 -mean -median -std -count -max -min > ~/x

Including both SL and SS interactions, we get (~/x):

Mean Median Std Count Max Min
7.796 2 16.8960 1039 155 1

egrep -i 'SL$' /projects/sysbio/map/Projects/DrugTargets/Data/tong04.tab | cut -f 1,2 > ~/rt  ;  cut.pl -f 2,1 ~/rt > ~/rs  ;  cat ~/rt ~/rs | sort -k 1 | sets.pl | sort -k 1 | row_stats.pl -count > ~/wsl

tail +2 ~/wsl | cut -f 2 | transpose.pl -q | row_stats.pl -h 0 -sc 0 -mean -median -std -count -max -min > ~/xsl

With synthetic lethals ONLY (~/xsl), we have:

Mean Median Std Count Max Min
4.081 1 8.123 417 84 1

Here is a histogram of the exact values for the above SL-only data.

Raw Data

Data directories: http://sysbio.cse.ucsc.edu/Projects/DrugTargets/files/

The main page is at: http://sysbio.cse.ucsc.edu/Projects/DrugTargets/

Useful software

Cluster 3.0 (Mac / Win / Linux) Note that the command-line version of this is already installed on jig et al., as cluster-eisen.

Java TreeView (Direct links: OS X version or Windows version)

VisANT Graph Viewer (Java)