Week 3 : Pattern Recognition Introduction to Bioinformatics (LF:DSIB01) Adobe Systems Sequence Patterns •We will learn: • 1. How to define a pattern • 2. How to identify the presence of a pattern in a sequence • 3. How to count pattern occurrences • 4. Calculation of overrepresentation • 5. Creation of elaborate queries Adobe Systems Defining pattern •There are several ways to define a pattern • •Deterministic Patterns • •Patterns with mismatches • •Position Weight Matrices • •Stochastic Models 3 Adobe Systems Deterministic Patterns •Defined Alphabet Σ=[A,T,G,C] • •Simple Sequence: e.g. TATAAAA • •Ambiguous character: e.g. TAT[AT]AAA : [AT] = either A or T • •Wildcard: TAT . AAA : . = any character • •Flexible gap: TAT. {1,3}AAA : .{1,3} = one to three times any character 4 Adobe Systems Patterns with mismatches •Allow exact matching of Deterministic Pattern + a certain number of mistakes • •This category will be covered in depth over the next 2 lectures (Week 4,5) 5 Adobe Systems Position Probability Matrices (PPM) •Ambiguous symbol [AT] gives the same % on both symbols • •PPM is a table Position x Alphabet containing probability (%) scores • •Using this PPM one can score the probability that this pattern produces this sequence. 6 M Adobe Systems Position Weight Matrices (PWM) •More useful is the log-odds score (weight) •Odds Score = log2 (frequency / background frequency) •Sum: How different is the scored seq from a background random seq •0 => equal prob of being M or Background, + => M more prob, - => M less prob 7 M M Adobe Systems Information Content 8 Adobe Systems Information Content 9 Height = Information content (bits) https://erilllab.umbc.edu/files/2016/04/Introduction_Information_Theory.pdf Information = degree of decrease in uncertainty Hartley 1928: I(N) = log(N) Shannon 1948: < Definition of a bit Adobe Systems Entropy 10 https://erilllab.umbc.edu/files/2016/04/Introduction_Information_Theory.pdf Entropy is a measure of the unpredictability of a state (average information content) Adobe Systems How to read a Seq Logo 11 Stack height indicates the Information Content per position (Rseq(I)) Letter height indicates the Base Frequency per position Adobe Systems Stochastic models •A set of rules or a machine learning method • •Must be able to discriminate / classify / score sequence • •Commonly used: Hidden Markov Models • •We will not talk about stochastic models in this course 12 Image result for hidden markov model dna sequence Adobe Systems HMM Logo (PFAM) 13 HMM logo https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-7 Adobe Systems 14 Image result for hunting Hunting for Models Adobe Systems Motif Discovery with MEME 15 http://meme-suite.org/ Adobe Systems Motif Discovery with MEME 16 http://meme-suite.org/ >SEQ1 TCAGTGCAGTCATGCACATGCATGCATGCATGCATGCATGCATGCATG >SEQ2 TGTGCTGACTGCATGACTCTATCTGCATGACTGTTTCTGCGCGGC >SEQ3 TGCATGCATGCACTGAAAAAAAAAATGCATGCATGCACTGACATGCTGACTGA Adobe Systems Motif Discovery with MEME 17 http://meme-suite.org/ Adobe Systems Motif Discovery with MEME 18 http://meme-suite.org/ Adobe Systems Regular Expressions •For Deterministic Patterns •Regex is a commonly used language for definition of Deterministic Patterns •Extremely powerful syntax – but ATTENTION for unintended results • 1.Meta-characters 2.Special Characters 3.Sets 19 https://www.w3schools.com/python/python_regex.asp Adobe Systems Meta-characters 20 https://www.w3schools.com/python/python_regex.asp Adobe Systems Special Characters 21 https://www.w3schools.com/python/python_regex.asp \A vs ^ : ^ matches start of LINE while \A start of string Adobe Systems Sets 22 https://www.w3schools.com/python/python_regex.asp Adobe Systems Python regex 23 https://www.debuggex.com/cheatsheet/regex/python Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems 24 www.ceitec.eu CEITEC @CEITEC_Brno Thank you for your attention! 60 minutes lunch break. >