DSIB01 Autumn 2021 05 Motif Detection Mgr. Ondřej Vaculík 437307@mail.muni.cz Overview CEITEC at Masaryk University 2 •Peak calling - brief overview •Motif representation in biology •PPM •PWM •sequence logos •Tools •Bedops •Bedtools •The MEME Suite •MEME-ChIP •Tomtom •Demo on real dataset •Homework - Individual work 3 Clip-seq analysis - peak calling 4 Sequence motifs •a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule •short, recurring patterns in DNA/RNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases, transcription factors, RNA-binding proteins. Others are involved in important processes at the RNA level, including ribosome binding, mRNA processing (splicing, editing, polyadenylation) and transcription termination. 5 Sequence motif representation - PPMs •a position probability matrix •in general: •there’s one row for each symbol of the alphabet and one column for each position in the pattern •in PPM each number is a probability of nucleotide occurrence in given position (sum of each column is 1) 6 Sequence motif representation - PWMs •a position weight matrix •also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM) •the most commonly used •the elements in PWMs are calculated as log likelihoods •PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery. 7 Sequence motif representation - Sequence logos •Graphical representation of PWMs •the bigger letter the higher chance for the nucleotide to appear in the position 8 Tools - BEDOPS + bedtools •BEDOPS: •open-source command-line toolkit that performs efficient and scalable Boolean and other set operations, statistical calculations, archiving, conversion and other management of genomic data of arbitrary scale •https://bedops.readthedocs.io/en/latest/ •functions for today: sort-bed, bedextract •bedtools: •a swiss-army knife of tools for a wide-range of genomics analysis tasks •allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files and in many different formats (.bed, .bam, .gff, …) •https://bedtools.readthedocs.io/en/latest/ •function for today: getfasta 9 Tools - The MEME Suite •The MEME Suite is a powerful, integrated set of web-based tools for studying sequence motifs in proteins, DNA and RNA. •MEME-ChIP •web service designed to analyze ChIP-seq ‘peak regions’ - short genomic regions surrounding declared ChIP-seq ‘peaks’ •works also with CLIP-seq ‘peak regions’ •Given a set of genomic regions, it performs: •ab initio motif discovery •motif enrichment analysis •motif visualization •binding affinity analysis •motif identification •https://meme-suite.org/meme/tools/meme-chip 10 Tools - The MEME Suite •The MEME Suite is a powerful, integrated set of web-based tools for studying sequence motifs in proteins, DNA and RNA. •Tomtom •web service that allows the user to compare motifs discovered by the suite, by other tools, or taken from the literature to all of the motifs in a selected database of motifs •aligns each input motif with each motif in the selected database and reports the most similar pairs, along with estimates of the statistical significance of each match •https://meme-suite.org/meme/tools/tomtom 11 Real dataset 1.Download the dataset: bed file with peaks, choose isogenic replicate 1,2 https://www.encodeproject.org/experiments/ENCSR570WLM/ 2.Download the chromosome 1 fasta reference 3.Unzip the files 12 Real dataset 13 Real dataset 14 Real dataset 15 Real dataset 16 Homework •Re-do the motif analysis on the artificial dataset •4 different datasets (1 dataset per student) + 1 bonus dataset •will be sent by email •Task: •download the data •extend the intervals to 100 nt •extract sequences for the intervals •use MEME-ChIP to analyse motifs in dataset •try to identify domain/protein/protein family (look also at the CISBP database and pfam database - by clicking through the results) •Bonus task 1: •Download the Motifs in MEME Text Format, upload the file to Tomtom tool, choose the CISBP-RNA Single Species RNA (Homo Sapiens) motif database and look at the results of the motif comparison tool •Bonus task 2: •Repeat the analysis on the bonus (voluntary) dataset •We’ll discuss the results on the practicals 3. 12. 2021