Central European Institute of Technology BRNO) CZECH REPUBLIC Understanding miRNA binding behaviour through Deep Learning David Cechak, Katarina Gresova CEITEC, Masaryk University, Brno, Czech republic AT G ATCTCGtT A A i i i i i i i i i i i i TA C TAG AGOA TT \T A C T AG AGO A T TJ W \^ ONA A-T G-C BIOLOGY • • • DNA Structure AT &ATCTC&TAA ........... TA CTAGAGOA IT 4 AT GATCTC&T AA AVGAOCU* i i i i i i i T A C TAGAGOA TT i OMA DMA AU & AO C ÜC & U AA "^^c^'W {Mi*) A-T G-C A-U G-C AT & ATCTC& i i i i i i i i A C TAG AGOA DNA A-T G-C AT GATCTC&T AA AVGAOCU> i i i i i i i TA C TAGAGCATT ON* A-U G-C AU G AO C U C & U A A T5*nw|¥ (MIK) Me*' íle. Ser fcAy^«^'i u ■ n> «5 so r o a> k_ 0) 30 > J2 20 10 1 2 3 4 5 6 7 □ miR-7 miR-278 T I I 01 2345(789 10 3' _mismatch position (from miRNA 5') Verification - correlation with in vitro experiment miR7 miR278 i-1-r 9 lfl 3 I ~ I I I I I I 12 3 4 5 6 7 -i- a mismatch position (from miRNA 5") miRNA miR-7 miR-278 correlation 0.59 0.85 Functional MicroRNA Targeting Simple miRNA-mRNA binding model How will the amount of the products (proteins) of a gene change if a certain miRNA is introduced into the environment in larger quantities? Ago ucagcauagcuacgacguc miRNA, ~20nt long Task overview Search for binding. If binds —► suppress the mRNA. / auggacacgcggggcgcgaucgugucacguagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaaaaaccacaauucgac... Messenger RNA, 100s - 100,000s nt long regression How much less protein products will we get? (in comparison to a normal cell); Approximate range <0, -2> RISC (RNA-induced silencing complex) A-U G-C RISC (RNA-induced silencing complex) A-U G-C Dataset - labels Dataset - labels microRNA count mean std min 25% 50% 75% max hsa-miR-16-5p 7915 -0.072 0.122 -1.297 -0.104 -0.004 0 0 hsa-miR-106b-5p 7902 -0.056 0.109 -1.255 -0.07 0 0 0 hsa-miR-200a-3p 7934 -0.075 0.144 -1.82 -0.098 0 0 0 hsa-miR-200b-3p 7966 -0.077 0.139 -1.372 -0.105 0 0 0 hsa-miR-215-5p 7976 -0.089 0.164 -1.477 -0.122 0 0 0 hsa-let-7c-5p 8002 -0.063 0.119 -1.334 -0.079 0 0 0 hsa-miR-103a-3p 7489 -0.069 0.152 -1.498 -0.072 0 0 0 average 7883.429 -0.07158 0.135495 -1.43614 r-0.09286 r -0.00057 0 0 f.e. hsa-let-7c-5p train:4046 test:2050 -- because removing transcripts without signal Dataset - inputs Lengths of transcript sequences 1200 H 2000 4000 6000 8000 Transcript length 10000 12000 14000 State-of-the-art so far - manual feature extraction Seed region 3' region Canonical sites 8-rrw 7-mer m8 7-mer Al 6-mer 123456789 AOOOOOOON BOOOOOOON AOOOOOOON BOOOOOOON Noncanonical sites 6-mer Al offset 7-mer offset 6-mer CDNST 1 CONST 2 CDNST 3 CDNST 4 123456789 AOOOOO00N B0OOOOOOO BOOOOOOON NNOOOOOOB NNOO0OOOA OOO0O0O0N NO000OOOA miRNA 3'UTR length Minimum distance to 3'UTR end Structural accessibility Local AU content / Seed pairing stability 3' Painng miRNA Thermodynamic pairing stability Target RNA Binding affinity (AGO-RBNS) Target site abundance Evolutionary conservation AAA AAA 3" AAA -AAA 3 3'UTR isoforms (affected isoform ratio) Table 1 A table of representative computational tools for miRNA target prediction and the determinants they use Model Seed TPS EC SA Dist. AU Len. 3 Sup. TA ORFS TargetScan7 0 SPS 0 0 0 0 0 0 0 8m miRanda-mirSVR 0 X 0 0 0 0 0 0 X X DIANA-microT-CDS 0 0 0 0 0 0 X X X 0 MIRZA-G 0 0 0 0 0 X X X X X PITA Opt. 0 X 0 X X X X X X PicTar 0 0 0 X X X X X X X RNAhybrid Opt. 0 X X X X X X X X Micro Tar 0 0 X X X X X X X X Open in a separate window Seed, seed match or site type; TPS, thermodynamic pairing stability; EC, evolutionary conservation; SA, structural accessibility; Dist., distance to 3'UTR ends or relative position of the target sites in the 3'UTR; AU, AU or GC content; Len., length of transcript or UTR; 3Sup., 3' supplementary pairing; TA, target abundance; ORFS, ORF or CDS sites; Opt., optional; SPS, seed pairing stability; 8m, number of 8-mer sites in the ORF. Scanning - prediction only Ago uca^S^B^^guc miRNA, ~20nt long auggacacgcggggcgcgaucgugucacguagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaaaaaccacaauucgac... Messenger RNA, 100s - 100,000s nt long Scanning - prediction only Ago uca^SS^B^^guc miRNA, ~20nt long auggacacgcggggcgcgaucgu ^ucacguagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaaaaaccacaauucgac... Messenger RNA, 100s - 100,000s nt long Scanning - prediction only Ago ucagcauagcuacgacguc miRNA, ~20nt long auggac acgcggggcgcgaucgugucacg jagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaaaaaccacaauucgac... ~ Messenger RNA, 100s - 100,000s nt long Scanning - prediction only Ago ucagcauagcuacgacguc miRNA, ~20nt long Scanning - prediction only Ago uca^cS^^^^acguc miRNA, ~20nt long Scanning - prediction only Narrowing the peaks ££2.ceeiteec= Scanning - including attribution score Narrowing the peaks T CA CATCCAAC lAGGTA ........ GTAGGTTG model score attribution score ground truth 50 100 150 200 250 Position (from gene 3') Scanning - including attribution score Narrowing the peaks T CA CATCCAAC i im i i i i i i i i A GTA GTAGGTTG model score attribution score ground truth 50 100 150 200 250 Position (from gene 3') Scanning - including attribution score Narrowing the peaks T CA CATCCAAC i im i i i i i i i i A GTA GTAGGTTG model score attribution score ground truth 50 100 150 200 250 Position (from gene 3') Distance between seeds starts 15 20 25 30 35 40 Scanning - using attribution score ^^^^ Score at a position = mm prediction * attribution_score ucagcauagcuacgacguc miRNA, ~20nt long Scanning - using A} miRNA regression model CNN ucagcauagcuacgacguc miRNA, ~20nt long prediction auggacacgcggggcgcgaucgugucacguagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaaapBTcacaauucgac ^ir^a Messenger RNA, 100s - 100,000s nt long 61 Dataset - inputs Lengths of transcript sequences Sequences too long 1200 H 2000 4000 6000 8000 Transcript length 10000 12000 14000 Compressed inputs 1200 Lengths of signals compressed Per transcript Longest: 2719 1000 A 800 A u C 600 400 200 1000 1500 2000 Compressed signal length 2500 Signals preprocessing Highly sparse —► compression: (number_of_zeroes % 100) +1 Normalization to <0.00001, 1 > 0 used for padding Signal samples (compressed) 0.05 0.005 1200 0.000 300 <=EITECZ 10' Signal Values histogram over all samples (before padding) 10; mhi i i i 0.4 0.6 signal values 0.8 1.0 CNN + RNN + pooling State-of-the-art based on feature selection (TargetScan) testcorrelation = 0.09586669597532436 barteltestcorrelationweightedcontextscore = 0.3146302627568912 prediction weighted context++ score Summary A} miRNA regression model CNN ucagcauagcuacgacguc miRNA, ~20nt long prediction auggacacgcggggcgcgaucgugucacguagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaa^PBTcacaauucgac ^if^a Messenger RNA, 100s - 100,000s nt long 68 Advantages & Disadvantages of our two-part approach Shields from sequence and 1. overfitting on simple patterns like seed binding First model is not perfect which leads to accumulation of mistakes to the second model Generalizes across miRNAs 2. Cannot propagate error through second model to the first model Summary driver target RNA miRNA: TGAGGTAGTA GGTTGTATAG Binding site: ATGTCAACCTA CCTACTTCTAA GCACAGGGTAT GAAGCTCTCTT TCCACT driver target RNA A-T G-C miRNA: TGAGGTAGTA GGTTGTATAG Binding site: ATGTCAACCTA CCTACTTCTAA GCACAGGGTAT GAAGCTCTCTT TCCACT miRNA miRNA: TGAGGTAGTA GGTTGTATAG Binding site: ATGTCAACCTA CCTACTTCTAA GCACAGGGTAT GAAGCTCTCTT TCCACT miRNA 0.84 miRNA miRNA: TGAGGTAGTA GGTTGTATAG Binding site: ATGTCAACCTA CCTACTTCTAA GCACAGGGTAT GAAGCTCTCTT TCCACT NN Black Box T CAT CATCCAAC im i i i i i i i i GA GTA GTAGGTTG tgaggtagtaggttgtatag _l_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_L, miRNA miRNA: TGAGGTAGTA GGTTGTATAG Binding site: ATGTCAACCTA CCTACTTCTAA GCACAGGGTAT GAAGCTCTCTT TCCACT NN Black Box i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—r T CAT CATCCAAC im i i i i i i i i GA GTA GTAGGTTG TGAGGTAGTAGGTTGTATAG J_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_L, miRNA: TGAGGTAGTA GGTTGTATAG Binding site: ATGTCAACCTA CCTACTTCTAA GCACAGGGTAT GAAGCTCTCTT TCCACT i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—r 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 CA! CATCCAAC tgaggtagtaggttgtatag ,_!_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_L, CA! CATCCAAC tgaggtagtaggttgtatag ,_!_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_I_L, Summary A} miRNA regression model CNN ucagcauagcuacgacguc miRNA, ~20nt long prediction auggacacgcggggcgcgaucgugucacguagcuacagucaugcaugucguagcuagcacucgucgucgagcuacgugggagacugcgaaaa^PBTcacaauucgac ^if^a Messenger RNA, 100s - 100,000s nt long 80 Future work 1. Include other features a. Genomic conservation - score / multiple sequence alignment / tree b. RNA Binding Proteins - binding sites c. Sequence? d. e. Ablation studies 2. If two-part approach does not work a. Simplify regression to classification task? b. Skip the two-part approach and go with sequence? (in progress) i. HyenaDNA - pretrained single nucleotide resolution transformer for long sequences 1. Use for embeddings 2. Use full with regression head Sources SHAP https://aithub.com/shap/shap Determinants of Functional MicroRNA Targeting https://www.ncbi.nlm.nih.aov/pmc/articles/PMC9880601/ miRBind: A Deep Learning Method for miRNA Binding Classification https://www.mdpi.com/2073-4425/13/12/2323 Using Attribution Sequence Alignment to Interpret Deep Learning Models for miRNA Binding Site Prediction https://www.mdpi.eom/2079-7737/12/3/369 Thank you for your Attention! in z \i cesne. ^/f^T u metacentrum ^ V CZECH MARIE CURIE REPUBLIC ÜGACR 4sb ~