Meta-Analysis for Omics Datasets Pratyaksha "Asa" Wirapati SIB Bioinformatics Core Facility, Swiss Institute of Bioinformatics bcf.isb-sib.ch www.isb-sib.ch Bioinformatics in Genomic and Proteomic Data November 25-27, 2009, Brno, Czech Republic Lausanne UNIL EPFL Vlad Popovici Jenny Miggliavacca Thierry Sengstag Eva Budinska Asa Wirapati Mauro Delorenzi M. Faouzi i(?rec sigf H HI'H CK ■ ^CGT finiPSOGEN Diagnoplex .MERCK fpfizeil >!. u™,.,,,, CANCER PROFILER & F l( SERQN0 \ ( ^ W hfiafil C) N OVA RT I S Growth of Gene Expression Omnibus (GEO) Database + - CD 2002 2004 2006 2008 2010 time (year) Technology # samples in situ oligonucleotide 209391 spotted DNA/cDNA 76911 spotted oligonucleotide 54941 oligonucleotide beads 17013 SAGE 1660 other 1193 high-throughput sequencing 853 RT-PCR 497 spotted protein 390 antibody 337 MPSS 194 mixed spotted oligo/cDNA 109 MS 94 SARST 12 a- genomics (DNA) transcriptomics (RNA) proteomics (protein) a- *omics (everything else) Other data sources: ArrayExpress, journal suppl. data, investigator's websites Omics Biology and Medicine Data "supertable": studies (rows) x omics variables (columns) DNA RNA Protein Phenotype Environment SNP CNV, CGH UHTS mRNA miRNA SAGE IHC proteomics clinical Imaging, metabolomics, physiology drug, therapy pathogen, toxin Study design 1 human breast cancer patients, retrospective, clinical outcome, drug Study 1 Study 2 Study 3 Study 4 Study 5 Study 6 Study design 2 experimental, time-series, tissue culture Study a Study b Study design 3 cancer cell lines Study x Study y Study z "Horizontal integration": same samples, various omis variables "Vertical integration": similar variables, multiple studies =>* our focus Why re-analyze existing datasets? Critical review of the original findings Confirmation/validation of results from other studies More solid discoveries based on larger sample size New discoveries in larger scopes/contexts Issues in Co-Analysis of Multiple Datasets I. Dataset curation Survey of relevant datasets that are available Search literature, public databases, and the web Independence of datasets Reorganize datasets to ensure non-redundant samples Non-uniform variable names and representation Rename and recode variables Re-mapping probe(set)s and matching across platforms Align to a reference sequence database; reduce to single probe per gene Quality control of quantitative variables (e.g., gene expression Ensure same unit/transformation; renormalize and rescale if necessary Issues in Co-Analysis of Multiple Datasets II. Downstream Analysis How to do combined analysis of heterogeneous datasets? • Differences in study designs, populations and sample selection criteria • Incommensurable quantitative data; systematic measurement artefacts How to produce the "total" results based on all datasets? How to assess and incorporate heterogeneity? How to visualize and present the analysis results? How to adapt to omics data? How to adapt to complex analysis, such as hierarchical clustering and prediction? Outline • A brief introduction to statistical meta-analysis • Applications of meta-analysis to omics data — An example: breast cancer clinical-expression datasets — Differential expression — Clustering of genes — Clustering of samples — Prediction • Conclusion and future works Intro to meta-analysis: an example data UC Berkeley graduate school admission 19731 Male Female Admitted Rejected 1198 557 1493 1278 Was there a sex bias in the graduate school admission process? 1 oyp /f^f^y odds ratio: -'-- = 1.84, 95% CI: [1.62, 2.09 ] 1493/1198 L J p-value: < 2.2 x 10 -16 1Bickel, Hammel, O'Connell (1975) Science 187:398-403 Stratified Analysis and Forest Plot Dept. data odds ratio 95% C.I. p-value B C D pooled 512 89 313 19 353 17 207 8 129 202 205 391 138 131 279 244 53 94 138 299 22 24 351 317 1198 557 1439 1278 0.35 0.80 1.13 0.92 1.22 0.83 1.84 [0.20, 0.59] [0.30, 0.20] [0.84, 1.52] [0.68, 1.25] [0.80, 1.83] [0.43, 1.58] [1.62, 2.09] 10" 0.68 0.39 0.60 0.36 0.55 10" -16 A - B - C - D - E - F - pooled - 0.2 ~~I-1-1 0.5 1.0 2.0 odds ratio (favor male vs female) Simpson's Paradox: "the whole contradicts its parts" the danger of pooling data =>* biases due to hidden factors Meta-Analytical Solution Analyze each stratum/study separately Average using the inverse variance as weight h EtiA/(»?+r2) £til/(*?+*2) Pi, (3q\ effect size (per study and total) af\ within-study variance of Pi, i.e. [SE(/3^)]2 f : between-study variance If r2 is fixed to zero (may not be realistic!) fixed effects meta analysis (FEMA) If r2 is estimated from the data random effects meta analysis (REMA) J2: proportion of variation due to between study heterogeneity A - B - C - D E - F - pooled 16 p = 10 FEMA H p = 0.18 REMA -| p = 0.16 I2 = 0.72 T T T 0.2 0.5 1.0 2. odds ratio (favor male vs female) I2 = (Q — (k — 1))/Q, see Higgins & Thompson (2002) Stat Med Hierarchical Sampling Models 0o REMA 0i~N(f3o,T2) 0i ■v 01 0 k 0 Y1 Y FEMA Y k M ? Single study: • Inference about (fio + study biases: technical, design, population, ...) Fixed-effect models • Inference about J3 = ^iPi/k (the mean of the specific datasets in hand) • Confidence interval is not affected by between study variability r2 Random-effect/hierarchical models • Inference about (3o (the "truth"; expectation of future studies) • Confidence interval is small if I2 is small (and vice versa) Alternative Methods (Empirical) Bayes Hierarchical Models This is the theoretically "proper" way to hierarchical models More flexible than REMA (not limited to normal summaries) Simultaneous fitting of model parameters at all levels of hierarchy (while REMA is stage-wise). Computationally more expensive (need to maximize marginal likelihood via EM, or MCMC, or quadrature, etc. etc.) REMA is an approximate approach to hierarchical models (may even be equivalent in some cases), but easier to calculate. Compromise: maybe less optimal for large number of very small studies. For categorical explanatory variables (e.g. ANOVA or contingency tables), the study indicator can be treated as another term, and the heterogeneity is modelled as interaction terms. Which summary to combine? A - B - E - F - total H p = 0.16 I2 = 0.72 0.2 I- 0.5 1.0 odds ratio 2.0 p= 0.4 I2 = 0.7 -0.2 ~l— -0.1 "i-1-1-r 0 0.05 correlation • • • • • • p= 0.12 'n-1- O -' -2 0 Z-test p= 0.0081 O T T 10" 10~5 10~3 p-value 10" odds ratio: regression coefficient (average using REMA) correlation: measure of dependence or mutual information (average using REMA) Z-test: significance (signed) accumulate using Stouffer's method: Y] Z'/sqrt(k) p-value: significance (unsigned) accumulate using Fisher's method: —2 Y] log p vote counting method: count rejected null hypothesis reject yes no no no no no 1/6 Spectrum of possibilities in combining analysis 1. Combine raw data (+)easy to apply (-) potential bias, no heterogeneity assessment 2. Combine coefficients (fold change, hazard and odd ratios, . (+)physical interpretability (-) affected by measurement unit 3. Combine correlation/dependence (i?2, tanh_1(r),...) (+)unit-free (-) affected by sampling/design 4. Combine significance measures (t-test, Z-test, p-value, etc. (-) strong effect + low power = weak effect + high power 5. Combine decisions (reject/accept hypothesis, gene lists) (+) easy to apply (-) lacks power Outline • \ brief introduction to statistical meta-analysis • Applications of meta-analysis to omics data — An example: breast cancer clinical-expression datasets — Differential expression — Clustering of genes — Clustering of samples — Prediction Breast cancer data collection Wirapati et. al. 2008 Breast Cancer Res Susanne Kunkel Dataset No. of Institution Reference Platform Data source No. of symbol arrays GenelDs Genomic platforms NKI 337 Nederlands Kanker Instituut van't Veer 2002, van de Vijver 2002 Agilent author's website 13120 EMC 286 Erasmus Medical Center Wang 2005 Aff. U133A GEO:GSE2034 11837 UPP 249 Karolinksa Institute (Uppsala) Miller 2005, Calza 2006 Aff. U133A,B GEO:GSE4922 15684 STOCK 159 Karolínska Institute (Stockholm) Pawitan 2005, Calza 2006 Aff. U133A,B GEO:GSE1456 15684 DUKE 171 Duke University Huang 2005, Bild 2006 Aff. U95Av2 author's website 8149 UCSF 161+8 UC San Francisco Korkola 2003 cDNA author's website 6178 UNC 143+10 University of Carolina Hu 2006 Agilent HuAl author's website 13784 NCH 135 Nottingham City Hospital Naderi 2006 Agilent HuAl AE:E-UCON-l 13784 STNO 115+7 Stanford Univ./Norwegian Radium Hosp. Sorlie 2003 cDNA author's website 5614 JRH1 99 John Radcliffe Hospital Sotiriou 2003 cDNA journal's website 4112 JRH2 61 John Radcliffe Hospital Sotiriou 2006 Aff. U133A GEO:GSE2990 11837 MGH 60 Massachusetts General Hospital Ma 2004 Agilent GEO:GSE1379 11421 expO 239 International Genomic Consortium http: //www. intgen.org Aff. U133v2 GEO:GSE2109 16634 TGIF1 49 EORTC trial 10994 Farmer 2005 Aff. U133A GEO:GSE1561 11837 BWH 40+7 Brigham and Women's Hospital Richardson 2006 Aff. U133v2 GEO:GSE3744 16634 Small diagnostic platforms TRANSBIG 253 TRANSBIG Consortium Buyse 2006 Agilent AE:E-TABM-77 1052 EMC2 180 Erasmus Medical Center Foekens 2006 Aff. (custom) GSE3453 86 H PAZ 96 Hospital La Paz, Madrid Espinosa 2005 RT-PCR paper's appendix 61 Total 2865 = 2833 carcinomas No. of the union of all GenelDs: 17198 + 32 non-malignant breast tissues No. of GenelDs common to genomic platforms: 1963 Abbreviations: No. = number, GEO: = Gene Expression Omnibus accession, AE: = ArrayExpress accession, Aff. = Affymetrix • Reorganize datasets into independent, non-redundant cohorts • Remap probe(set)s to the same version of RefSeq subset (NM_* only) using BLAT • Use the most variable probe(set) as the unique representative of a gene Clinical variable availability and distributions NKI TRANSBIG HPAZ EMC EMC2 UPP STOCK DUKE UCSF NCH UNC STNO JRH1 JRH2 MGH TGIF1 BWH expO total 337 253 96 286 180 249 159 171 161 135 143 115 99 61 60 49 40 239 2833 i-r ■{IH 25 50 75 10( age at diagnosis (year) - + ER status 1 2 3 histologic grade - + size >2cm - + lymph node u hcbx R M O R M O M O M R O R R R R R R R R M R O O O O O O O O R 1890 M 1015 O 2019 adjuvant available treatment outcome treatment: u untreated, h hormone, c chemo, b both, x unspecified patient outcome: R relapse-free, M metastasis-free, O overall survival Heterogeneity in survival data Ctf > w CD CD CD W Q. CD 100- 80- UCSF EMC2 JRH2 STOCK NCH UPP TRANSBIG NKI MGH JRH1 UNC STNO > > CO "öS i_ o > o 100- 80- 60 40- HPAZ 1EMC2 UPP TRANSBIG STOCK NCH NKI UCSF UNC STNO DUKE JRH1 follow-up: 0 2.5 5 7.5 10 (year) follow-up: 0 2.5 5 7.5 10 (year) group: number at risk events group: number at risk: events STNO 115 39 10 1 60 STNO 115 54 14 3 1 46 UNC 128 33 10 4 32 DUKE 170 86 46 12 3 43 JRH1 99 75 59 30 1 45 UNC 129 39 14 5 22 MGH 60 50 42 28 18 25 JRH1 99 85 70 32 1 45 NCH 135 110 97 81 66 47 UCSF 132 104 74 43 14 37 NKI 319 260 216 131 75 121 NKI 319 290 248 147 89 74 TRANSBIG 253 207 170 147 118 101 STOCK 159 148 130 64 40 UPP 249 185 158 140 107 88 NCH 135 122 111 96 81 34 JRH2 61 55 44 38 33 15 TRANSBIG 253 240 212 190 154 57 STOCK 159 140 124 68 40 UPP 232 198 173 152 122 51 EMC2 180 164 149 94 40 37 EMC2 180 175 166 103 44 23 UCSF 132 97 68 37 11 20 HPAZ 96 87 55 20 3 12 total 1890 1415 1147 799 469 631 total 2019 1628 1 313 867 512 484 Variability between studies greater than that due to natural risk factors or treatments potential bias in pooled (unstratified) analysis Quality control of original author's normalization Plot SD-vs-mean of each probe in a dataset A characteristic trend for each (platform,normalization) combination NKI: Agilent (Rosetta) + ? 2.5 21.5 CO O Q CO 1 0.5 -10 12 mean log intensity EMC: Affy U133 + MAS 5.0 3- 05 C = 2- C0 o Q CO 1 - ■V. • mean log intensity TGIF1: Affy U133A + RMA 3- c CO o Q CO 7.5 10 12.5 mean log intensity Raw instrument data (e.g. CEL files) for renormalization from scratch are not always available possible "post-hoc" corrections: • Non-parametric variance stabilizing transform • Global scaling between studies • Lowess calibration against the mean profile (In subsequent results in this talk, we used the original without correction) Differential Expression Analysis The transcriptome is "scanned" to search for genes whose change in expression is related to changes in other variables (e.g. clinical outcome or experimental conditions) Adaptation for multiple datasets: 1. Choose the appropriate models that produce an estimate ± standard error (with normal sampling variation, independent of the location estimate) transformation may be used when appropriate 2. If a gene is missing from a platform, the summary is considered missing value (and simply ignored) 3. Calculate REMA (estimate, SE, heterogeneity) 4. The usual analysis: ranking, multiple testing, etc. on the combined estimates from REMA normal Generalized Linear Models logistic ER histologic grade low med high 0 4 11 66 1 75 98 83 CÖ > > CD CD i_ T CO CO TO CO TO "53 00 O CO ö C) o C) C) survival III III II I II I II I H—H-+ low grade med grade high grade T T T estrogen receptor status 5 10 15 time-to-follow-up (year) 1 - S CO Q. CD O CD i_ CD o i_ u) CD o - CO CD Q. Z3 I O o T o T CD AURKA expression GATA3 expression RACGAP1 expression An example: prognostic genes in breast cancer Gene: RACGAP1; Model: Cox proportional hazard Response variable: metastasis-free survival; explanatory variable: log2 expression DUKE EMC EMC2 HPAZ JRH1 JRH2 MGH NCH NKI STNO STOCK TRANSBIG UCSF UNC UPP total p-value I2 1.6e-09 0.37 T T o 1.1e-18 0.047 0.0 0.5 1.0 coeff 1.5 ~~1-1-1-1 0.0 0.4 0.8 std. coeff O 1.5e-17 O 3.1e-17 1—r1—I—I—I r T-1-n-1 0 2 4 6 8 10 1e-19 1 e-11 1e-03 Z-test p-value coeff: loge(hazard change)/log2(fold change) effect size with physical interpretation std. coeff: measure of correlation (mutual information), equivalent to (pseudo) R2 Z-test: significance, equivalent to p-value, but with direction of effect ( —/+) Only significant (after multiple testing) in two studies Another example gene: AURKA DUKE EMC EMC2 HPAZ JRH1 JRH2 MGH NCH NKI STNO STOCK TRANSBIG UCSF UNC UPP total p-value I2 I— -0.2 O 2.5e-13 0.4 i—i—i—r 0.2 0.6 coeff -1-1 1.0 O 5.5e-17 0.15 T T T 1 0.0 0.2 0.4 0.6 0.8 std. coeff O 4.2e-18 O 7.8e-16 1-1 8 10 Z-test I-1-1-1 1e-17 1e-09 p-value Coefficients are less heterogeneous than in RACGAP1 Present in all genome-wide platforms Another example gene: MELK DUKE EMC EMC2 HPAZ JRH1 JRH2 MGH NCH NKI STNO STOCK TRANSBIG UCSF UNC UPP total p-value I2 O 5.8e-05 0.76 -0.5 I 0.5 1.5 2.5 O 5.49-19 O 2.19-15 1—r1-1—i—i coeff 0.0 0.2 0.4 0.6 0.8 std. coeff -2 0 2 4 6 8 10 Z-test I- 1e-17 -1- 1e-09 p-value T-1 1e-01 Coefficients are heterogeneous; correlation (std. coeff) is homogeneous normalization issue? or the log2 scale is less consistent in general? Not significant in individual studies Another example gene: BTG2 DUKE EMC EMC2 HPAZ JRH1 JRH2 MGH NCH NKI STNO STOCK TRANSBIG UCSF UNC UPP total p-value I2 -1.0 1.6e-11 0.18 -1- -0.5 coeff 0.0 O 2.3e-11 0.19 i—i—i—r -0.8 -0.4 0.0 0.2 std. coeff O 1.2e-13 I-T-h-\ _8 -6 -4 -2 0 Z-test i* O 1.4e-10 i-1-r-r 1e-12 1e-06 1e+00 p-value Negative effects (over-expression is protective) Yet Another Example gene: RPL11 DUKE EMC EMC2 HPAZ JRH1 JRH2 MGH NCH NKI STNO STOCK TRANSBIG UCSF UNC UPP total p-value I2 0.063 0.58 i—r -1.5 T O 0.074 0.58 i—r i—i—i—r i—i—i o 0.0048! T t O 0.00022 -0.5 0.5 coeff 1.5 -0.8 -0.4 0.0 0.4 std. coeff -6 - 4 -2 0 Z-test i-1-n 1e-05 1e-03 1e-01 p-value A gene that doesn't work. (It's a housekeeping gene) The Usual Analysis and Visualization Gene rank table est 2 pval p.ttDnf 5EC61G 0 5252396 ( L86B52133 7 666869 L 773-Sls 1- 2 9634B6S IC CEP55 0 -2-1852 ; ). 85554332 7 636946 2 22-3-Cě 1- 3 716B72S IC BIRC5 0 2513234 ( ).83322773 7 563666 3 913662s 1- 6 54B8B4S IC P5MA7 0 5396936 1 i. 07981163 7 -S3-3S 5 429511e 1- 1 48B571S C9 HF 0 5357213 ^ ).87291376 7 3-7327 2 822891e 13 3 37B915S C9 AURKA 0 3987769 ( L85348B49 7 316757 2 5-03&1S 13 i- 244944e C9 MEK2 0 411281B ( ). 85666895 7 257236 3 958383s 13 6 681B88S C9 UBE25 0 3783391 1 i.85161736 7 15-357 6 758934s 13 1 12B8B1S C5 P5MD2 0 5975764 1 ).83333927 7 166187 7 716848s 13 1 239358s CS TCEB1 0 5^997 ( }. 87595975 7 1-1937 3 282587s 13 1 537739s CS 5PAG5 0 4161139 i L85B46667 7 117114 L 182186s 12 1 B4161BS CS P--Í.2 0 5B22613 ( ).83292219 7 821779 2 198689s 12 3 668587s C5 GAR5 0 4B71429 1 i.87892937 Ě 367999 6 518366s 12 1 837966s C7 TXNRD1 0 52B4883 1 ). 87736935 6 735729 L 155819s 11 1 938836s C7 MYBL2 0 4579217 ^ ).86758758 6 733271 L 174351s 11 1 963175s C7 GIN52 0 4853218 ( ). 85991314 6 764579 L 336972s 11 2 2348B1S C7 ABFP 0 3-S7&&3 : L85293363 6 532524 625278s 11 7 723326s C7 NDRG1 0 228B146 1 ).83369468 6 553-12 5 623725s 11 9 397245s C7 RAB51 0 5155852 ( L87BB1448 6 5-S7-9 6 121145s 11 1 822B43S CS 5-CEF1 0 3931851 ( ).86853558 6 493795 5 378878s 11 1 393639s CS CBK2AP1 0 469B637 1 í. 87412179 6 339876 2 311474e IC 3 B62472S CS C28arf24 0 4956172 1 J .37373671 Ě 294614 3 831649s IC 5 1-9-3ěs CS DDX39 0 6519741 1 ).18334654 6 273245 3 424157s IC 5 721766s CS TGFBI 0 3872691 ( í. 84945123 6 213572 5 179349s IC 8 654693s CS ZWINT 0 4B16899 ( >.87764377 6 282315 5 546219s IC 9 267732s CS p-value histogram er CD o o LO o o — LO Volcano plots _r CD cö > O) O coeff 0.0 0.2 0.4 0.6 0.8 1.0 "I-1-1-1-1-1-r -0.6 -0.2 0.0 0.2 0.4 0.6 Many significant genes even after the stringent Bonferroni multiple testing correction for > 17,000 genes (red lines, p.bonff 0.05) Standardized coefficients yield more significant genes 400 vs « 300) p value std. coeff Hierarchical Clustering of Genes 1. Calculate Pearson correlation for each pair of gene in each study k 2. r isn't normal (bounded by [—1,1], asymmetric variance) =>* transform using (yet another) Fisher's method: Zijk = tanh-1^^), Var(zijk) = l/(n - 3) 3. Combine z using REMA 4. Treat the combined correlations as similarity measures in hierarchical agglomerative clustering. No need to back transform 2^0 to r^o (irrelevant for single- and complete link, maybe even better for average link) 5. Display the heatmaps in stratified manner O 50 150 250 50 100 150 0 50 150 250 0 100 200 300 RPL10 •AR ESR1 ■ GATA3 ■ SPDEF FOXA1 • TFF1 ■ SCUBE2 PGR PERLD1 ERBB2 ' GRB7 •VEGFA 'MKI67 BIRC5 •AURKA ■ CCNB2 i- STAT1 J-MX1 ft- GZMA ft— NDRG1 JJ- ADM LM04 EGFR KRT 14 FOXC1 DCN P LAU • MMP11 ■ SPARCL1 91 UPP 1 7000. 16000 15000 14000 1 3000 12000 1 1000 10000 9000. 8000, 7000-6000-5000-4000 3000-I 2000 1000 0 survival in subtypes 1 .;-:.,Vv".vr-v.;,f, .... S3 '^si^-t rijSjfc*ai(, log level L B H Hierarchical Clustering of Samples This doesn't fit the framework of REMA. (Dis)similarity measures are not summary statistic from a regression model, rather it is a kind of a distance. We need to have separate clustering tree for each study, but we need to know the correspondence across studies. Pooling the data is inevitable. Expression profiles will be compared between and within studies. The problem: how to ensure the similarity measures are biological (rather than technical, e.g. due to batch effect), which will results clustering by the data of origin. Simplest solution: mean center each gene for each dataset before clustering without mean centering with mean centering stratify by splitting the tree Extension to level Gene Clustering Multi-stage random-effects meta-analysis can be use to both combine the correlations and assess differential co-expression using the between-strata variance. Example: cluster genes in multiple types of cancer, each having multiple studies Examples of consistently correlated pairs (left) and breast-cancer-only pairs (right) AURKA vs TPX2 (proliferation genes) breast-EMC breast-NKI breast-UPP colon-AARHUS colon-CINCI colon-TOKYO lung-DUKE lung-MERLION prostate-SKCC prostate-TMHS breast colon lung prostate total heterogeneity O o -1.5 -1 -0.5 0 0.5 z-transformed correlations 1.5 ESR1 vs AR (estrogen and androgen receptor) breast-EMC breast-NKI breast-UPP colon-AARHUS colon-CINCI colon-TOKYO lung-DUKE lung-MERLION prostate-SKCC prostate-TMHS breast colon lung prostate total heterogeneity O o o -1.5 -1 -0.5 0 0.5 z-transformed correlations 1.5 0 breast cancer Dendogram of 16742 genes NKI EMC UPP prostate cancer SKCC TMHS colon cancer ?~ ••■•Jív ä^f^-."Jf" ■'; •V;- v1*-, ip- • •- jutuji La 1 ':ii'xl£ .j AARHUS CINCI TOKYO lung cancer DUKE MERLION n = 148 n = 65 n = 155 n = 105 n = 84 n = 198 n = 72 Prediction Components of classifiers: • Gene list ("signature"): identified by feature selection step • Model parameters (e.g. coefficients, neural network weights, etc.): identified by model fitting. • Cutoff Very difficult to calibrate. Sensitive to changes in the distribution of both predictor variables and outcome. e.g. disease prevalence (or baseline hazard in survival data) in the target populations may be different from those in retrospective study datasets NaTve/Idiot Bayes predictors Assume conditional independence amongst predictor variables (conditioned on the response). DLDA, Tukey's compound covariate, etc. are based on this principle. Penalized regression is similar, if the penalty is large. • Fit gene-by-gene models • Select top genes • Use the gene-by-gene coefficients or significance (t-stat or Z-score, or simply the ± signs) as weights in linear predictor: ^WiXi\ the cutoff is to be calibrated from the training set Still one of the best for microarray data. =>* Most amenable to cross-platform applications, because it's insensitive to the exact weights or missing genes. Cross validation schemes 1. Within dataset • Split each dataset into learning and test parts • Select top genes (ranking based on REMA summaries) • In each dataset, apply the model with dataset-specific parameters to the test part • Combine performance 2. Cross-dataset • Split datasets into learning and test datasets • Fit model in the test datasets • Apply to test datasets: global weights, local cutoff (need its own CV) =>* "Leave-one-dataset out CV" is particularly simple Example of LODOCV: Breast cancer datasets Cutoff is 30% low-risk 100 H 80 60- a 40- 20- OH lo-risk hi-risk follow-up: 0 2.5 5 7.5 10 group: number at risk: hi-risk 1633 1216 941 617 363 lo-risk 793 697 592 412 229 total 2426 1913 1533 1029 592 (year) events 556 119 675 —■— hi-risk DUKE EMC EMC2 HPAZ JRH1 JRH2 MGH NKI STNO STOCK UCSF UNC UPP NCH TRANSBIG Total lo-risk 0 25 50 75 100 5-year DMFS (%) 0 1 3 10 30 hazard ratio Summary Multiple omics datasets can be co-analyzed under the framework of "standard" statistical methods (e.g. generalized linear models, meta-analysis, hierarchical sampling models). Extension to complex analysis (e.g., prediction, cluster analysis) is possible, by incorporating REMA for combining summaries, at the appropriate stage of analysis. Future Work Release (hopefully soon) R packages for: • Fast, meta-analytical scanning of GLM (normal, logit, survival). • Fast multilevel meta-analytical hierarchical clustering A system for data clean-up and curation (this is the most time consuming part): • text mining of clinical data and mapping to ontologies • QC and renormalization/retransformation of expression data