Omics technologies: genomics, transcriptomics, metabolomics, databases, personalized medicine and big data Vendula Pospíchalová, PhD (pospich@sci.muni.cz) Department of Experimental Biology Animal Physiology and Immunology BÍ5599 Applied Biochemistry and Cell Biology Methods 2021-11-08 Tissue/Cell Lines Comparative genome hybridization to DNA microamys i DNA sequencing Mutation screening . , Mtss-spectrometry-based gcnotypme • Mutation-specific PCR Lipids ipidomics Metabolomics A Metabolomic profiting Carbohydrates V I j /.* r .Jüan Amino acids Gene-expression profiling _j 'MM microunys MicroRNA-expression profiling J *DNA aienunys Multiplex PCR DNA mit Multiplex PCR I Proteomic profiling Proteomics Pbospboproteomic I profiling ' Mass spectrometry • Mass spectrometry after immunoprecipitation with phosphotyrosine-speciftc antibodies Schematic representation of omics technologies, their corresponding analysis targets, and assessment methods. Taken from Wu RD et al. JDR 2011; 90:561 -572. Contents 1. Introduction: what are -omics technologies + history 2. How does big data look and how to approach it 3. From -omics technologies to biomarkers and personalized medicine 4. Genomics: genomes vs exomes vs genotypes + DTC service 5. Cancer databases: COSMIC, TCGA and others 6. Transcriptomics: microarrays vs. RNA sequencing 7. Gene set analysis 8. Metabolomics 9. Cutting edge: single cell -omics and single cell multi-omics 10. Summary and take home messages What are „-omics" technologies • Omics refers to a field of study in biology ending in -omics, such as genomics, proteomics or metabolomics • The related suffix -ome is used to address the objects of study of such fields, such as the genome, proteome or metabolome • -ome = many/collectivity or whole/all/complete in Greek • -omics = study of large sets of biomolecules • High-throughput experimental technologies characterized by automation, miniaturized assays and large-scale data analysis • Analytic part of the experiment is usually much longer than the experiment itself - bioinformatics skills needed • Raw data is the „gem" but usually is in user unfriendly format • Interpreting functional consequences of millions of discovered events is one of the biggest challenges Big -omics data challenges EASY DATA GENERATION DNA5EQUENCERi»".D ( vuu"h w*i" I JJMr^ i h oiju»»"" □ J omi^" l 'mir" „,, *f*ir;,uru,.»-,- BP6b"Ui,"" J 1 V.riC'Bir,-jalL.ln~-.S BB-I'.ri' HttkMtl Li bsntan P. T. So*^i** E Pudon." T Hud1 "•. wskhss C-c. 1 Sw" C Crpte-". C El«-- W jfl I [111 * &" Vm^it1' JO VaoZr-M-■ S Gkfl*"*, B r*—2---a ____f ■ 0*..... UCLA Lea ^pkbUniu9DDW.UU." >s*wi « S(J»«J^«w.tJt fm|li*Jwi In tkacUirNMinJhi umntt oIThh MO MnmCira Ct Data sharing policy he concepts of data sharing and open data are becoming increasingly important in science Funding bodies, journals and societies are now encouraging or mandating data sharing (usually the raw data) Sharing data publicly is an important way of improving reproducibility and showing that researchers are confident in their work Studies with raw data shared in a repository also receive more citations than those without publicly available data But raw -omics data are hard to analyse, so many platforms gather the publicly available data, thoroughly analyze it, curate it and share it in a user friendly format Leveraging Public Databases to Identify Actionable Targets Target Drug Drug response biomarker discovery 1000 Genomes A Of (c Ziii oi c( hi-a- itrrrx Vraow discovery NIH |j discovery THE CANCER GENOME ATLAS National Cancer Imtitut* National Human C«nomt Research Institute KGTExPortal t- Connectivity Map O ChEMBL i mm Port Clinical Triah.gov 1 i^CCLEasar Pub©hem $p*«nnöw DIKW pyramide „Doto is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom/'-Clifford Stoll Jennifer RowleyPublished 2007 in J. Information Science DOM 0.1177/0165551506070706 What is the aim of OMICS technologies In Genomics: Sequence of 3 billions letters in .txt file information where the individual's genome varies from reference sequence CYP2C9 or TPMT genotype, which has known pharmacogenomic associations individualize the dose of a new warfarin prescription OmSc Data 4 Biological Information 4 Clinical Knowledge Action The "Omic" Funnel -7 High-Through put Sequence Data. Methylation. Tissue Array, Tertiary Structure, etc. / \ SNPs, Network Activation, Indels. CNVs, Rearrangements, etc. Clinically Relevant "Omic" Findings NitWii DB of (MuHy Mtmflcjnl Virtmtt Personalized Health Care Envwonmcnul I \ The DIKW pyramid metaphor: "know-nothing" (Data) "know-what" (Information) "know-how" (Knowledge) "Know-why" (Wisdom) Zeleny (2005) http://www.jpathinforrnatics.org/text.asp72015/6/1 /46/163985 What is personalized health care? Personalized medicine, sometimes referred to as precision or individualized medicine, is an emerging field of medicine that uses diagnostic tools to identify specific biological markers, often genetic, to help assess which medical treatments and procedures will be best for each patient. t Without Personalized Medicine: Some Benefit, Some Do Not Patients Therapy Benefit No benefit Adverse effects With Personalized Medicine: Each Patient Receives the Right Medicine For Them Patients Biomarker J ^ Diagnostics Each Patient Benefits From Individualized Treatment Therapy responds to normal dose responds to lower dose t t responds to higher dose responds to alternative medication https://pharma.bayer.com/en/research-and-development/research-focus/oncology/personalized-medicine/index.php Value of personalized medicine $5 billion Estimated annual cost of wasted prescription drugs in the LrS. $3 billion Estimated cost of wasted hospital 27% of all NMEs approved by the FDA in :oi6are personalized medicines. o 50% of the personalized medicines approved by the fda in 2016 are oncology drugs. https://invivo.pharmaintellig Oncology is on the Leading Edge of Personalized Medicine In ten years, cancer patients have seen a four-fold increase in their personalized medicine treatment options. Breakdown of Oncology Treatment Modalities, Global Market share 2003-2013* 2003 2013 Personalized Medicine Can Create Efficiencies in the Health Care System I Targeted Cytotoxics Supportive Care Hormonals Breast Cancer ^^^^^ Reduction in chemotherapy use would occur If women with breast cancer receive a genetic test of their tumor prior to treatment Metastatic Colorectal Cancer Stroke $604 Million In annual health care cost savings would be realized If patients with metastatic colorectal cancer receive a genetic test for the KRAS gene prior to treatment Strokes could be prevented each year If a genetic test is used to properly dose blood thinners PERSONALIZED MEDICINES ON THE MARKET A greater understanding of the molecular basis of disease has transformed what was once known collectively as "disease of the blood" into multiple subtypes of leukemias and lymphomas with a 5-year survival rate of 70% collectively. https://www.mtan.org/ ■ Nearly 250 medicines are in oVveropm&nl lor blood cancers arsuivival rales haua wn to 70% 5-Year Survival Rates for CML Patients Nearly Triple After Introduction oflmatinib 1_L Leukemia L J.T ■ ||-1-l.-il "H: Chronic Leukemia Aüute Leu he-in. i a r - i- ■!---! ii-Indolent Lymphoma Ag oresEiue . ymprie n -40 Unique Leukemia types identified w50 UiiiQuť Lymphoma types KientiBed Prior to Introduction of I mat i nib After Introduction of Imatinib 2008 2012 And many more examples, see http://www.personalizedmedicinecoalition.org for more detailed information on PM What is a biomarker? A biomarker is a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, disease processes, or biological responses to a therapeutic intervention. Biomarkers can be used to reduce uncertainty and guide clinical care. Molecular Biomarkers Can Inciude: Biomarkers Help Inform Medical Decisions: -> Prevention measures? -> Which diagnosis? -> Treat or don't treat? 4 What dose? How Do You Detect a Biomarker? Diagnostics Blood draw Microscopic analysis Gene sequencing Biopsy Protein analysis -OMICS technologies and their integration is crucial for biomarker discovery and validation Proteins Source: National Cancer Institute, "NCI Dictionary of Cancer Terms" (accessed M History of ,,-omics" technologies Sanger VS NGS Bases Genes Human Genome 3.3xl09 ~20,000 _S!-11-ffi-!C-2!-a-S-K-3!-S-E-5L_ ll'iM'PüTI'Mll Sequencing of the human genome using Sanger technology took more than a decade and cost an estimated $70 million dollars In 3 days (one run), lllumina HiSeq 4000 is able to produce l,680xl05 bases for ~$32,000 1953: Discovery of DNA structure by Watson and Crick • Genome - Central part Of all Brief History of DNA Sequencing -omics technologies NGS = next generation sequencing Ö 1973: First sequence of 24 bases published 1977: Sanger sequencing method published 1982: GenBank started 1987:1st automated sequencer: Applied Biosystems Prism 373 (up to 600 bases) 1996: First Capillary sequencer: ABI310 2000-2003: Human Genome Sequenced 2005-: First NGS sequencers 454 Life Sciences, Sotexa/lllumina, Helicos, Ion Torrent © slideshare.net Sanger vs next generation sequencing Sanger sequencing https://www.youtube.com/watch?v=e2G5zx-OJIw Next generation sequencing (lllumina is shown as an example) https://www.voutube.com/watch?v=9YxExTSwqPM (A) I'B) (Ci Fragment b3se*S2. B»c92 öta¥3 * G S G AG A T T ■» T TT TAT TT T G G G—fn;—rTT—T H T—T T—T fl T-1 T—T Pf» 101 H ■ ■:. ■ ■ -• ■■■.■:! -v Mi, Em* 81 of H7 G T GO G Ä G AT T IM T T T T A T T T T A n g g g r G h t m T~H I fl T I T T H Sanger Sequencing lllumina Sequencing "A" allele "C" allele Advantages Disadvantages Lowest error rate (1.5%) High cost per base Long read length (~750 bp) Can target a primer Used to confirm NGS results SNP (single nucleotide polymorphism) ' Read 1: QAQT,.. Clusl*r2 > Read 2' TTGA . ChislBr 3 > Rud 3: CTAG,.. Ckjater 4 ? Road A: ATAC... Text File Sequencing reagents, including llucrescently labeled nucleotides, are added arrd the first base is incorporated. The How cell is imaged and the emission from each duster is recorded. The emission wavelength and intensity are used to identity the base. This cycle is repeated ,ln': times to create a read Inngth of "n" bases. B. Cluster Amplification |,|mj. Bridge Amplification Cycles Clusters Library is loaded into a flow cell and the fragments are hybridized to the flow cell surface. Each bound fragmenl is amplified into a clonal cluster through bridge amplification. D. Alignment and Data Anaylsis ATGG CATTÜ CAATTTG A C AT TGG CATT G C A ATTT G AGATGGTATTG GATGGCATTGCAA GCATTGCAATTTGAC ATGGCATTGCAATT AGATGGCATTGCAATTTG Qsmm" AGATGGTATTGCAATTTGACAT Heads are aligned to a reference sequence with biointormatics software: After alignment; differences between the reference genome and the newly sequenced reads can de identified. g-*gqc34g ggagcacc gaga ca gag caagacca GAJWTCCCA3CKTACT WXiti; CCKwr, gags ca M cabimCm; tWCCCC» c < u tOXu-G Q/MXtit 6*6b Ca taG IvtotXMX HwCCCCt t t it aogcaa s*Sf « jaoa ca rj^g-7 caayacuspc gwccccatctttgcr ecmG GC»«*<^G*£AtCA. MCCCWHHTAET " ~ ~ ~ ~ ~ amsfaKKve aatkcatctc-act Maura et aaga ra gag caaoaccaq a. aqccafl 5 ii iL 5 C D Ü L - ' (-:- www.illumina.com/technology/next-generation-sequencing.html DTC (direct-to-customer) genetic testing /ancestry.com/ 0 Build a family tree to see your story emerge. Leammore Genotyping vs Sequencing Ancestry + Traits Service $99$79 If you went the most comprehei ancestry breakdown or the market • 2000+ Geographic regions. • Automatic Family Tree Builder ■ 30+ Trait reports • DhfA Relative Finder Genotyping - determining which genetic variants an individual possesses through a variety of different methods, especially genotyping chips (based mostly on SNPs - single nucleotide polymorphisms) - cheap, but require prior identification of the variants of interest Welcome to you It "11 Health + Ancestry Service $4W99 If you want to get a mors complete picture of your health with insights from your genetic data. *- Everything in Ancestry + Traits, plus... ■ 65+ health reports and features • Hearth Predisposition reports* ■ Wellness reports • Carrier Status reports* • Family Health History Tree ■ Learn more 23andMe+Membership $+99 $99 k- $9.99 one year prepaid membership If you want our Health + Ancestry Service plus access to new premium reports and features throughout the year. Everything in Health + Ancestry, plus... ■ Instant access to exclusive reports and features, including: ■ Heart Heelth reports » Pharmacogenetics reports {how you process certain medications]**- * Migraine report (Powered by23andMe research) CD ■ Obstructive Sleep Apnea report [Powered by 23andMe research) i ■ Plus new reports and features as more discoveries are made Methods We use genotyping technology to look at specific genetic variants in the genome that can be most informative about an individual's health and ancestry. Unlike sequencing which analyses all nucleotides in a gene to identify changes, genotyping detects specific known variants within the genome. 23andMe uses a custom lllumina HumanOmniExpress-24 format chip that analyses approximately half a million variants. This custom chip has been designed to include variants: In medically relevant genes Involved in drug metabolism, efficacy and side effects With known disease associations Associated with traits Used to assign genetic ancestry and ethnicity https://www.23andme.com/ How SNP genotyping works https://www.voutube.com/watch?v=Naona1 v I2U • For more information see YouTube Channel Useful Genetics: https://www.voutube.com/channel/UCtXCrx28msMBQ-vFUIOIReA o o " GENOMES vs EXOMES vs GENOTYPES https://www.jax.org/news-and-insights/jax-blog/2016/september/genomes-versus-exomes-versus-genotypes SNP - Single nucleotide polymorphisms the most common type of genetic variation occur almost once in every 1,000 nucleotides on average, 4 to 5 million SNPs in a person's genome may be unique or occur in many individuals; scientists have found more than 100 million SNPs in populations around the world most commonly in non-coding DNA can act as biological markers, helping locate genes associated with disease most SNPs have no effect on health or development some SNPs have proven to be very important in the study of human health. may help predict an individual's response to certain drugs, susceptibility to environmental factors such as toxins, and risk of developing particular diseases. SNPs can also be used to track the inheritance of disease genes within families How SNP genotyping works □mm Sample 1 (GG type) 11 m 111 i "TTTrrn - - Sample 2 (AA type) oo#««ooo 88828888 oo«#oo## (A) (B) (C) There are two types of microarray commonly used in multiplexing SNP analysis: allele-specific oligonucleotide (ASO) hybridization and allele-specific primer (ASP) extension. (A) ASO hybridization: The allele-specific oligonucleotide for every SNP is synthesized and separately immobilized onto the glass plate. Fluorescence labeled targets containing SNP sites are produced from a PCR reaction and plotted separately into each well to conduct the hybridization reaction. The mismatched base pair between target and oligonucleotide can decrease the binding strength with the fluorescence-labeled target removed after a stringent washing. A fluorescence signal is detected on a perfectly matched base pair; (B) Allele-specific primer (ASP) extension: The specific primer for SNP location is designed and separately immobilized onto a microarray. A different fluorescence labeled dNTP is individually used in an extension reaction. The extended fragment showing fluorescence signal can only be found when the 3' end of primer pair is perfectly matched (AA type in this case) in contrast to the mismatched primer pair (GG type in this case); (C) The SNP genotype can be determined according to fluorescent intensity from the products/target DNA. https://doi.org/10.3390/microarrays4040570 DTC genome sequencing as popular demand Dante Labs analyzes 100% of your DNA, so that we can give you reports on predispositions on any genetic disease. You will receive easy reports for you and your doctor, as well as raw data to explore. www.dantelabs.com My Full DNA: Whole Genome Sequencing with mtDNA €449.00 EUR €850.00 EUR YOU SAVE €401.00 EUR Sequencing - WGS and WES • Determining the exact dna sequence Whole Genome Sequencing "3,000,000,000,000 bases (100% of human genome) Whole Exome Sequencing -60,000,000 bases (-2% of human genome) Large Scale Genotyping -1,000,000 bases (-0.03% of human genome) "Non-coding DNA" was long thought of as junk DNA, but as we understand more about our genetics we now know these regions play a hugely important role in regulating the coding /^^^ portions of our DNA. Our understanding of these regions and their interactions is relatively poor compared to our knowledge of the DNA coding regions. https://www.mygenefoo Genomes vs exomes vs genotypes WGS Whole genome sequencing \/\/\/\/\/: WES Whole exome sequencing Hotspot sequencing = Targeted sequencing DfXDOfXXXXjCC Sequencing regio wholegenome Sequencing Depth: >30X Covers everything -can identify all kinds of variants including SNPs. INDELs and SV Results are sometimes challenging to interpret Sequencing region: wholeexome Sequencing Depth: >50X " 100X Identify all kinds of variants including SNPs, INDELs and SV in coding region. Cost effective J Good alternative to WGS in terms of clinical use Sequencing region specific regions (could be customized) Sequencing Depth; >500X Identify all kinds of variants including SNPs, INDELs in specific regions M ost Cost effective Most sensitive - able to detect rare tumor cells in a biopsy https://2wordspm.wordpress.com/2017/10/30/ngs-%EA%B2%80%EC%82%AC-whole-genome-exom What to expect Genetic testing provided by most of the companies is moreless for fun (ancestry, health and wellness, nutrigenetics, skincare, sports,...) More expensive, and complete, sequencing like the one provided by lllumina can be used for medical investigation Do not expect your genome sequencing to tell you how long is your life expectation, whether you are likely to get cancer and so on So far our knowledge on the "implication" of the genome are quite limited What we can already do in health care is to look at the genome once you have been diagnosed a specific ailment and look for specific genes that would make one cure more effective than another (this has become normal practice in some form of cancer cure) Based on http://sites.ieee.org/futuredirections/2017/12/26/did-you-get-your-genome-sequenced-for-christmas/ Example of genetic testing in clinical practise BRCA genes testing for PARP inhibitor treatment BRACAnalysis CDx® Ovarian Cancer BRACAnalysisCDx Overview Mutations in BRCA 7 or BRCA2 cause Hereditary Breast and Ovarian Syndrome (HBOC). Now mutations in the BRCA1 and BRCA2 genes provide an indication for treatment with Lynparza* (olaparib) for patients with ovarian cancer. Specifically, BRACAnalysis CDx® is the only FDA-approved laboratory developed test approved to be used to inform treatment decisions for the PARP inhibitor, Lynparza. A positive BRACAnalysis CDx result in patients with ovarian cancer is also associated with enhanced progression-free survival (PFS) from Zejula™ (niraparib) maintenance therapy.1'2'3 Learn More Order BRACAnalysis CDx More info: https://www.youtubexom/watch?v=ilwMGRH276l\/l PARP inhibitors A. Functioning PARP enzyme Single-Strand DNA Break B. PARP enzyme inhibited In December 2014, the drug olaparib (Lynparza) became the first of a new class of treatments known as PARP (poly(ADP-ribosa)polymerase) inhibitors to be licensed for clinical use, heralding in a new era for personalised, targeted treatment—and turning the promise of 'synthetic lethality' into reality. Single-Strand DMA Break PARP i DNA Repair BER = base excision repair PARP inhibitor No DNA Repair Collapsed replication fork BRCA deficiency Double-Strand DNA Break _I_ Synthetic lethality concept 1 Homologous Recombination (HR) Non-Homologous End Joining (NHEJ) Uses lister chromatid as template G2/M, after DNA replication High fidelity, error-free BRCA1 and BRCA2 dependent No template DNA trimmed and ligated Error-prone Lead? to genetic instability C. Deficiency in HR and BER together lead to synthetic lethality viable lethal More info on PARPi: https://www.youtube.com/watch?v=mgW30YyaJz4 Condition HR BER Outcome | Normal cells + + Viable BRCA deficient - + Viable Normal cells, PARP inhibitor + - Viable BRCA deficient, PARP inhibitor - - Cell Death https://doi.Org/10.1016/j.ygyno.2015.02.017 The Present and Future of Genome Sequencing • Genomics England - 100,000 pa- tients with rare diseases, their families, and cancer patients • Precision Medicine Initiative (PMI) 1-million-volunteer health study, data including genetics and lifestyle factors • GenomeAsia 100K - genomic data for Asian pOpUlatiOnS https://labiotech.eu/features/genome-sequencing-review-projects/ • ... a many more initiatives • How to handle such huge amount of data and the ethical implications? • In the US, the Genetic Information Nondiscrimination Act (2008) but mostly no act in other countries and somewhat grey legal position in Europe COSMIC: Catalogue of Somatic Mutations in Cancer #>COSMIC \^jßr Catalogue Of Somgtic Mi i1 .j1 In d' u ■■■ https://cancer.sanger.ac.uk/cosmic/ Projects ▼ Data ▼ Tools T News ▼ Help T About T Genome Version T ^^^^^^^^^W SEARCH Login ▼ Terms and Conditions have been udpated and include important chanaes. Please check the Licensina paae for details. COSMIC v94, released 28-MAY-21 COSMIC, the Catalogue Of Somatic Mutations In Cancerr is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. Start using COSMIC by searching for a gene, cancer type, mutation, etc. below. eg Braf, COLO 829, Carcinoma, V6Ö0E, BRCA-UK, Campbell Projects COSMIC is divided into several distinct projects, each presenting a separate dataset or view of our data: ® @ @ COSMIC The core of COSMIC, an ex pert-en rated database of somatic mutations Cell Lines Project Mutation profiles of over 1,000 cell lines used in cancer research COSMIC-3D An interactive view of cancer mutations in the context of 3D structures Cancer Gene Census A catalogue of genes with mutations that are causally implicated in cancer Cancer Mutation Census Classification of genetic variants driving cancer Actionability Mutations actionable in precision oncology Data duration I low to use COSMIC database: COSMIC News Digging for rare finds - three breast cancer publications to keep a watch for in V95 COSMIC V95 will have a focus on rare female cancers, including rare breast cancers. Our latest blog takes a closer look at three of these. More... Curating the future of precision oncology: An interview with Steve Jupe Lean about the curation process, background to Actionability, and innovative uses of COSMIC data in our interview witb Steve Jupe. More... COSMIC Release v94 is live! a focus on rare lung cancers and rare pancreatic cancers, and curation of somatic mutations in 12 hallmark apoptosis genes. Along with this, 9 cancer hallmark genes data are also updated. Find out more before exploring the v94 release, More,.. Tools ♦ Cancer Browser — browse COSMIC data by tissue type and histology ♦ Genome Browser — browse the human genome with COSMIC annotations ♦ GA4GH Beacon — access COSMIC data through the GA4GH Beacon Project ^ 4? COSMIC in BigQuery & — search COSMIC via the ISB Cancer Genomics Cloud & https://www.voutube.com/watch?v=2FD5RabgK6o, https://www.youtube.com/watch?v=k477uAiKx74 TCGA: The Cancer Genome Atlas NATIONAL CANCER INSTITUTE 1-800-4-CANCER Live Chat Publications Dictionary ABOUT CANCER CANCER TYPES RESEARCH GRANTS & TRAINING NEWS & EVENTS I search Q Homes About NCI * NC I Organization > CCG > Research s Structural Genomics ft w f w 9 I TCGA 1 Program History + TCGA Cancers Selected for Study Publications by TCGA Using TCGA + The Cancer Genome Atlas Program The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between the National Cancer Institute and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. Over the next dozen years, TCGA generated over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data. The data, which has already lead to improvements in our ability to diagnose, treat, and prevent cancer, will remain publicly available for anyone in the research community to use. TCGA OutcomQS & Impact TCGA has changed our understanding of cancer, how research is conducted, how the disease is treated in the clinic, and more. 'atternj TCCA's Pa nCancer Atlas A collection of cross-cancer analyses delving into overarching themes on cancer, including cell-of-origin patterns, oncogenic processes and signaling pathways. Published in 2018 at the https://www. youtube.com/ watch?time_c ontinue=249 &v=epsZjJ_Al y4 https://cancergenome.nih.gov/ TCGA: Overview NATIONAL CANCER INSTITUTE THE CANCER GENOME ATLAS TCGA BY" THE NUMBERS TCGA urotiucedovtir TCGA deta describes including 2.5*1 133 10 PETABYTES or data ■ i put this ir.io perspective. 1 petabyte et d Ml 15 equal to rCGA RESULTS & FINDINGS Initiated in 2005 A joint effort of the National Cancer Institute (NCI) 212,000 and the National Human Genome Research Institute (NHGRI). 27 participating Institutes in US and Canada. The overarching goal of TCGA is to improve our ability to diagnose, treat and prevent cancer, through the application of genome analysis technologies, including large-scale genome sequencing. The Cancer Genome Atlas Network have published more than 20 papers since the project began DIFFERENT RARE TUMOR TYPES CANCERS . .based on paired lumor and normal tissue sets; collected from 1111,000 PATIENTS ..using 7D1FF-:^ENT jH^HiHJB DATA TYPES IUm^/ © MOLECULAR BASIS OF CANCER Improved our und island tng -■: the genomic underpinnings of cancer For example, a TCCA study found the basal like subtype of brea«fl cancer to be similar to the serous subtype of ovarian career on a molecular level, sugqeshnq that despite ansiutj iroin different Iksuies in the body, these subtypes, may share a common path or development and respond to similar therapeutic strategies tlmdh Revolution: zed how cancel is classified TUGA revolutionized how cancer is classified by identifying tumor subtypes wilh distinct sets of yenumic alterations.* -A IHERAPEUriC 1MB targets Identified genomic chaiActetisrics of tumors that can be targeted with currently available therapies ot to help with drug development THE TEAM (https://tcga-data.nci.nih.gov/docs/publications/) COLLABORATING INSTITUTIONS across the United Slates and Canada TCGA's identification of targEiable genomic alterations in lung squamous cell carcinnrrm led to NCI's Limg-WAP Tnal. which will treat patients based on the specific genomic changes in their tumor, WHAT'S NEXT? I he Cenomic Data Commons (GDC] houses TCGA wid othei NCI-generated data sets for scientists to aroeis from nny where The GDC also has many exparvdod capabilities thai will allow researchers to answer more clinically relevant questions with increased ease £ O ■fC Ms jralfíií v\ slwBd: tance imitad lial! š rot iwyk liistoe. but i těej»«niipw*J dt« uibljpai ndjdfri) iwwaAIr^itbfKlŕv.itdb; in-K;t«i*B-E(rsliiiifc-Bi'i two. www.cancer.gov/ccg TCGA Data Portal https://portal.gdc.cancer.gov/ NATIONAL CANCER INSTITUTE CDC Data Portal * Home Projects ■*• Exploration -it* Analysis § Repository Harmonized Cancer Datasets Genomic Data Commons Data Portal Get Started by Expl Projects •\t Exploration Analysis Repository o, e.g. BRAF, Breast, TCGA-BLCA, TCGA-A5-A0G2 Data Portal Summary Data Release 31.0 - October 29. 2021 PROJECTS PRIMARY SITES CASES ft 70 A 67 & 85 415 FILES GENES MUTATIONS Q649 152 U23 621 #3 599 319 Q, Quick Search Manage Sets 43 Login "H Cart ::: GDCApps 1 (in 1000s) GDC Applications The GDC Data Portal is a robust data-driven piatform that ailows cancer researchers and bioinformaticians to search and download cancer data for analysis. The GDC applications include: TCGA: A Valuable Resource for Research Community TCGA Data Types • Clinical data • DNA sequencing • miRNA sequencing • Protein expression • mRNA sequencing • Total RNA sequencing • Array-based expression • DNA methylation • Copy number variations + Computational tools 10000 CO c •2 5000 CO 0 J • • • • # publications (TCGA research network) 00 o 5 O o CM CM CO o CM o o o CM CM CM o CM O CM ' unlH 201M2.U1 How to use TCGA: https://www.youtubexom/playlist?n Transcriptomics mRNA Protein-coding RNA Study of transcriptome, the sum of all RNA transcripts Two most widely studies types of RNA • mRNA - transcriptome or the expressed genes. Usually contains genes with poly A tail. • miRNA - Small non-coding RNA (containing about 21 -25 nucleotides), important in gene regulation. Array-based Expression Profiling: https://www.youtube.com/watch ?v=6ZzFihESjpO Type of RNA molecules RNA ncRNA Non-coding RNA. Transcribed RNA with a structural, functional or catalytic role rRNA Ribosomal RNA Participate in protein synthesis tRNA Transfer RNA Interface between mRNA& amino acids _J snRNA Small nuclear RNA Incl. RNA that form part of the ^spliceosomei snoRNA Small nucleolar RNA Found in nucleolus, involved in modification of rRNA RNAi RNA interference Small non-coding RNA involved in regulation of expression Other Including large RNA with roles in chromotin structure and imprinting miRNA MicroRNA Small RNA involved in regulation of expression siRNA Small interfering RNA Active molecules in RNA interference Microarrays vs RNA-seq DNA MICROARRAY cDNA sample 1 cDNA sample2 RNA-SEQ o y ^ Fluorescent ; .1.. relative intensity expression levels Low sensitivity Low dynamic range known transcript only No alternative splicing information lower cost cDNA sample 1 \/\s cDNA sample 2 Gene 2 High sensitivity High dynamic range Novel transcripts sequences identified structural variation & alternative splicing revealed unlimited sample comparisons Sequencing Reads expression levels • While methods for analyzing microarray data are fully mature and straightforward, there is no consensus on which pipelines—or series of computational steps—to use to analyze RNA-seq data. https://www.the-scientist.com/lab-tools/an-array-of-options-35381 44 Overview of RNA-seq Samples of interest Isolate RNAs mV)} Condition 1 (e.g. tumor) Condition 2 (e.g. normal) ^^^^^^^ "^Maaaaaaaaavvw, Poly (A) tail Map to genome, transcriptome, and predicted exon junctions intron pre-mRNA Exon Unsequenced RNA RNA reads I^^H ^^^^^ Transcript *___ Short reads =-^_-:—=r=r^===.— Short reads split by intron Short insert Downstream analysis Generate cDNA, fragment, size select, add linkers Sequence ends 100s of millions of paired reads 10s of billions bases of sequence By Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393, CC BY 2.5, https://commons.wi kimedia.org/w/index.php?curid=53055894 RNA sequencing downstream analysis • https://www.youtube.com/watch?v=tlf6wYJrwKY (from 13:10) • More info about microarray vs. RNA-seq at: https://www.youtube.com/watch?v=2c3t3tDEmsU • More info RNA seq at: • https://www.youtube.com/watch?v=MFRkwXq6v I • Useful detailed info about anything connected toRNA-seq • https://www.rna-seqblog.com Examples of transcriptomics data outputs E217S299R783 e217s299r7s4 E2175300R787 E217S300R7B6 E217S300R785 E217E299R782 E1HS150R567 E202S185R5de E202S196R564 E202S1BBR552 eiiisisqhscb E202S185R545 E1995255R449 E1995255R448 E199S2S5R4S0 e202s132r557 E2Ü25190R553 EmsisoRsee e144s1s4r737 E202S18BR548 E202S1B6R549 ETÜ2S18ÖR547 E202E192R5S6 e202s1bbr550 E202S192R558 E2025185R544 e20zs1bbr551 E202S19BR563 E144S1B4R73B E202S18BR562 E202S194R56D E202S190R554 E202S190R555 E202S194R561 E202E194R559 E1B95232R190 E139E232RS86 15 w 3-1 10 5 Q. 0 1 'u o: oo :\j i_n <£> 6 o en at rj m o> ui t± til pi [U u> Heat map ■ -10 Volcano Plot Down Not Sig Up log FC Cellular/functional/pathway analysis Preprocessing Background correction. Normalization, Summanzation Cellular/functional/pathway analysis is a valuable tool to summarize high-dimensional gene expression data in terms of biologically relevant sets. Genes are aggregated into gene sets on the basis of shared biological or functional properties as defined by a reference knowledge base. Knowledge bases are database collections of molecular knowledge which may include molecular interactions, regulation, molecular product(s) and even phenotype associations. Useful info in Czech language: https://portal.matematickabiologie.cz Micro array RNA-Seq Hybridization. Scanning images. Quantification. I Raw intensities Expression levels of Transcripts (continuous) Cellular functional/pathway analysis Sequencing. Base call. I Short reads Aligned to reference genome, known isoform & exon-junction sequences. Expression levels of Transcripts (counts) Novel transcripts Statistical analysis Usually hundreds to thousands of genes KEGG Database resources for understanding high-level functions and utilities of the biological system_ Database tools: • KEGG (Kyoto Encyclopedia of Genes and Genomes) • (h tt p s: //www, ge n o me.jp/kegg/) • Disadvantage-does not provide statistical significance of particular pathways • And many others available online CELL CYCLE KEGG analysis Growth factor Growth factor withdrawal I I I GSK3g I TCFp mapk signaling pitluyay pi 07 T E2F4.5 Smad2,3 dp-1,2 Smacl4 ,-, 1 Mdm2 1 ige checkpoint t30H 11-i-^*-1 |dna-pk| IatmatrI pl<5 Pl5 pis P1S p27,57 p21 1 Infcfe life* Iife4c InfeM 1 1 Kipl, 2 1 Oipl II 7- 1 1 Espl I Separin T I PTTG I Secuiin Maß BubRl I apcx 1 Bub3 Cdc20 {.__R-point rid (START) ^ ORG (Origin MCM (Mim-Chromosome Recognition Complex) MainterLance ) complex Orel Orc2 Oit3 Oic4 Om5 Mcm2 Mcm5 Mcm4 Memo Mcm7 Bnb2 I—II MEN q__S-phase proteins, y dna cji:e o dna dna biosynthesis Gl G2 Data on KEGG graph Rendered by Pathview Gene-set analysis (GSA)/Pathway analysis ) GEf1Ij?NI£LOG^ About Ontology Annotations Downloads Help Gene Ontology (GO) analysis (http://qeneontoloqv.org/) O v f \iii\su Current release 2021-10-26: 43 *32 GO terras | 7 G27 476 annotations 1 542 562 gene products | 5 086 species {see statistics) THE GENE ONTOLOGY RESOURCE The mission of the GO Consortium is to develop a comprehensive, computational model of biological systems, ranging from the molecular to the organism level, across the multiplicity of species in the tree of life_ GO Enrichment Analysis G Powered by PANTHER The Gene Ontology (GO) knowledgebase is the world's largest source of information on the functions of genes. This knowledge is both human-readable and machine-readabler and is a foundation for computational analysis of large-scale molecular biology and genetics experiments in biomedical research. > Any • Ontology • Gene Product Examples Launch ^ Hint can use UniPmt /£VAC, Gene Name, Gene Symbols, MOD fDs |p < ONTOLOGY The network of biological classes describing the current best representation of the "universe77 of biology: the molecular functions, cellular locations, and processes gene products may carry out. - {evidence) - Liji.iiiI ■ li- i- i ligat* activity Statements, based on specific, traceable scientif c evidence, asserting that a specific gene product is a real exemplar of a particular GO class. GO Causal Activity Model (GO-CAM) provides a structured framework to link standard GO annotations into a more complete model of a b slogica system. Tools to curate, browse, search, visualize and download both the ontology and annotations. Includes bioinformatic guides (Notebooks) and simple API access to integrate the GO into your research. 65 Example data of GO enrichment analysis GO enrichment analysis One of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set. 3 main GO aspects (molecular function, biological process, cellular component) http://qeneontoloqv.org/docs/qo-enrichment-anaivsis/ vacuolar and endosomai transport v 7% cellular iron inn homeostasis 5% \ mel, cys and thr metabolism proteasome complex 5% ubiquitin binding nonvesicular DOdy sorting pathway $% protein metabolic] ^iqu^-d^ndent Drocess — Proteiri catabolic ib iq ui ti n - HG003751) and the European Molecular Biology Laboratory. mi Latest News We want to hear your Success Story! Version 78 Released Reactome Multi-Ornics Pathway Analysis webinar reaches record attendance Version 77 Released The Reactome IDG Portal is released Tweets reactome @raactome An interesting publication just out using Reactome analysis tools & textbook-style illustrations #ci ting reactome https://twitter.com/guanwg/status/145625330009S035715 reactome @reactonne Introducing "Success Story of the Month"! Have you had some success with your experiment, tool or resource by using Reactome? Submit your #usecase success story, more details: reactome.org/about/news/172... Success Story View on Twitter More info at: https://www. youtube.com /user/Reacto me/videos I Version 78 released on October 13, 2021 2,546 Human Pathways 13,890 Reactions 10,720 Proteins 1,940 Small Molecules IQI 507 Drugs m 34,025 Literatu re Referentes Metabolomics GENETIC AND EPIGENETIC FACTORS Metabolomics - large-scale systematic study of the metabolome Metabolome - total complement of metabolites present in a biological sample under given genetic, nutritional or environmental conditions - the unique biochemical fingerprint of all cellular processes Metabolite - low molecular (usually 50 -1,500 Da) weight organic compound, typically involved in a biological process as a substrate or product. Metabolomics yield many insights into basic biological research in areas such as systems biology, metabolic modelling, pharmaceutical research, nutrition and toxicology ENVIRONMENT DISEASES NUTRITION Genome PHEHOfVPE Meta boto me https://polypdx.com/for-healthcare-providers/metabolomics > <------ i v Ii r * Amino Lipids ^(Liprdomi?) Metabolites Pheno type/Function Metabolites are important >95% of all diagnostic clinical assays test for small molecules 89% of all known drugs are small molecules 50% of all drugs are derived from preexisting metabolites 30% of identified genetic disorders involve diseases of small molecule metabolism Small molecules serve as cofactors and signaling molecules to 1000's of proteins Metabolomics can therefore be seen as bridging the gap between genotype and phenotype Human Metabolomes (2015) 3670 (T3DB) 1240 (DrugBank) 28500 {FooDB) 1550 (DrugBank) 19700 (HMDB) Toxins/Env. Chemicals Drug metabolites Drugs Endogenous metabolites ~T~ mM nM pM 1 tot Theoretical Human Metabolomes 100,000 (Lipidome) 10,000 (Drug metabolome) 100,000 (Food metabolome) 10,000 (Secondome) Lipids/Lipid derivatives Secondary drug metabolites Secondary food metabolites Secondary endogenous metabolites mM UM T- nM "I pM 1 fM Metabolomics technologies Metabolomics - ,a snapshot' in time Conceptual approaches in metabolomics: • Target analysis: has been applied for many decades and includes the determination and quantification of a small set of known metabolites (targets) using one particular analytical technique of best performance for the compounds of interest. • Metabolite profiling: aims at the analysis of a larger set of compounds, both identified and unknown with respect to their chemical nature. This approach has been applied for many different biological systems using GC-MS, including plants, microbes, urine, and plasma samples. • Metabolomics: employs complementary analytical methodologies, for example, LC-MS/MS, GC-MS, and/or NMR, in order to determine and quantify as many metabolites as possible, either identified or unknown compounds. • Metabolic fingerprinting: a metabolic "signature" or mass profile of the sample of interest is generated and then compared in a large sample population to screen for differences between the samples. When signals that can significantly discriminate between samples are detected, the metabolites are identified and the biological relevance of that compound can be elucidated, greatly reducing the analysis time. J^tf bininQ >S-f J^^_jr dissociation classic biochemical reaction degradation * modification transport A diagram showing the main different types of metabolicreactions that take place in a cell. These are shown as they arc represented in the database Reactome. Metabolomics data analysis From Spectra to Lists ill—X__ 7 6 5 4 3 2 1 From Lists to Pathways Ciiili-jraL-.-Iiwuii, «.ü ji:. .■>.■—,.-■■ i. >. !'■■■.! Nftaml ctt&Mike íjliři vascular Immune % duster Tr.Ml.g A Functional category DŤr-ciapncrlaJ prolcr. Syn apse Cell iLirriar IW'tiMHHl He ltd gen ess UoluQBoalsd c Ns-inel PÜMttb P^lEiiijT mimf Cellular component Wolerular fur ebon Lobnd gated ar Cri: channel PjUmilanr CNeicc rynn spart G^HtiL.'n channel 1Ů 5 . 0-1 1 Fraction (Si FDR f» Regulatory rex;tcn DNA bndinp DKlA burling rranicripfcm» activator activity i I i cd Eřjrnel octvity Ion charoe- acUvriy Sutateato' specific cnan ret activity cGMP bmtng G cfcieir cdLpad amnc rcceclor acnvily Cycbc iiirlaatHtE binding rranscnctonai activator a en wily i?i Phmghonc zsv.a hydtoUBH actwity 10 E 1 It 1 Cell identity Membrane conductance Cürtmi'rlC 1ů 5 I 0.1 1 (%) FDRr» Connectivity a: 3= ssa ■es s Í rl- ! I l{ Molecular Architecture of the Mouse Nervous System DOI :https://doi.org/10.1016/j.cell.2018.06.021 Common applications of scRNA-seq a) Deconvolving heterogeneous cell populations I heterogeneous tissue or tumor} íilTiOhlionjlily (edutlifln If!. PCA) V ■J o o°ono o O o< C> O o o° o o • • ° •g o c Component 1 b) Trajectory analysis of cell state transitions lineage A OqUq lineage b (cell differentiation, or response to stimulus) trajectory anahw pipeline je.g. Monocle. W] ndef lust) Component 1 https://f1000research.com/articles/5-182/v1 For more info go at: https://omicstools.com c) Dissecting transcription mechanics Gene transcription "off" JII3^-l I^TMUM Uli — RNA Polymerase disassociated from gene RNA Polymerase bound and transcribing gene iribing (TO °ö \f Gene transcription "on" {transcriptional bursting and stochastic gene expression) d) Network inference Cells Genes module 1 Low I High (identifying modules of co-regulated genes) network inference {inference of gene regulatory riEtu/crks/sufcinet works) ScRNA-seq databases https ://www. ebi. ac. u k/gxa/sc/ho m e t EMBL-EBI A Services S3 Research & Training O Ahoi it ls 0) Singl Single eel Home W Single cell gene expression across species <\ Gene search I A Browse experiments I (§j Release notes I 6 Help I © Support Search across 18 species, 229 studies, 5 978 348 cells Search Ensembl 104, Ensembl Genomes 51 , WormBase ParaSite 15, EFO 3.10.0 Gene ID or gene symbol Examples CFTR (gene svn I:: I: E\3G::_::: 11 SU\;- En-em i ICj tBl (Errtrez ID), MGI:98354 (MGI Ö). FBg.nO004647 f FyBase ID) Animals Plants Fungi Pmtists Species Any t Homo sapiens 103 experiments Mus musculus 76 experiments Drosophila melanogaster 9 experiments Danio rerio 7 experiments Gallus gallus A experiments Schistosoma mansoni 2 experiments Single-cell multi-omics Central dogma Single cell genomics CNV SNP DNA methylation Historie modification Chromatin RNA expression RNA structure Protein expression Combination Single cell Multi-omics © O • O DR-Seq ley el jl . 2015] <3&TSeq Mjcaulsy et J 2015 f Angermueller et al., 2016 J icivrr-seq (Hu et ai.2016] «Trio Seq (Hou et al.. 2016 ) CITE-seq (Stoeckius etil., 2017); REAP-seq (Peterson el jL 2017) FIGURE 2 | Strategies for multi-omics profiling of single cells. Three major types of molecules relating to biological central dogma (Top). Single cell genomics methods profiling the genome, epigenome, transcriptome, and proteome are shown by different shapes with variable colors (Middle). Single cell multi-omics methods are built by combining different single cell sequencing methods to simultaneously profile multiple types of molecules of a single cell genome wide (Bottom) For example. G&T-seq was built by combining genome (orange) and transcriptome (yellow) to simultaneously detect DMA and RNA of the same cell genome wide. Challenges: There are no commercial kits available yet for any single-cell multi-omics techniques, and many are technically challenging. Researchers must modify existing single-cell protocols so that they're compatible with multiple types of molecules and take great care to minimize the loss or contamination of samples https://www.the-scientist.com/lab-tools/integrating-multiple-omics-in-individual-cells-64829 Difficulty squared Combining modalities only multiplies the difficulty All the weaknesses, all the noise, all the challenges from each technology, it just gets exacerbated by combining them into a multimodal assay Single-cell analysis enters the multiomics age https://www.nature.com/articles/ d41586-021 -01994-w#correction-1 https://www.frontiersin.org/articles/10.3389/fcell.2018.00028/full Summary Omics technologies - „the data deluge" Genomics and Transcriptomics rely on two main approaches: microarrays (hybridization) and NGS (sequencing by synthesis) Proteomics and Metabolomics rely heavily on mass spectrometry DNA Genomics Transcriptomics Proteomics RNA Proteins Biochemical, Metabolomics Omics technologies are revolutionizing science and medicine From data to actionable knowledge -Integrated Omics data Precision medicine is the ultimate goal of many -omics efforts Despite the progress made we have still a long way to go ... UNDERSTAND WHICH DIFFERENCES ARE IMPORTANT BEST TREATMENT 1 USE CURRENT WEDICWES BETTER " NEW DRUGS o NEW DIA&NOSTIC TESTS Take home messages • We have been generating Big data, but we hardly understand it© • Big data is publicly available, go through the databases before you even start planing your experiment - it can save you enourmous time and money • Databases contain huge datasets of patients you would never be able to gather by yourself, test your hypothesis in silico before the „wet-lab" work • If you cannot find the „yes/no" or „a few genes" answer, use the Cellular/functional/pathway analyses to help you out © • Learning bioinformatics skills (e.g. programing in R) is a good investment plan for your future (scientific) career Thank you for your attention Any Questions? Jay Flatley, Executive Chairman of lllumina: „ Everyone is going to get sequenced, it is gonna be part of their health record and it will be used to manage their health care throughout their lifetime" : 0 0 PERSONALIZED 0 0 0 MEDICINE * ö I CONFERENCE