DSIB01 Autumn 2021
05 Motif Detection
Mgr. Ondřej Vaculík
437307@mail.muni.cz
Overview
CEITEC at Masaryk University 2
• Peak calling - brief overview
• Motif representation in biology
• PPM
• PWM
• sequence logos
• Tools
• Bedops
• Bedtools
• The MEME Suite
• MEME-ChIP
• Tomtom
• Demo on real dataset
• Homework - Individual work
3
Clip-seq analysis - peak calling
• a statistical procedure, which uses coverage properties of CLIP and Input
samples to find regions which are enriched due to protein binding
• requires mapped reads, and outputs a set of regions, which represent the
putative binding locations. Each region is usually associated with a significance
score which is an indicator of enrichment
• many different tools for peak calling available:
• iCount
• Paraclu
• PureCLIP
• Piranha
4
Sequence motifs
• a nucleotide or amino-acid sequence pattern that is widespread and usually
assumed to be related to biological function of the macromolecule
• short, recurring patterns in DNA/RNA that are presumed to have a biological
function. Often they indicate sequence-specific binding sites for proteins
such as nucleases, transcription factors, RNA-binding proteins. Others are
involved in important processes at the RNA level, including ribosome
binding, mRNA processing (splicing, editing, polyadenylation) and transcription
termination.
5
Sequence motif representation - PPMs
• a position probability matrix
• in general:
• there’s one row for each symbol of the alphabet and one column for each position in the
pattern
• in PPM each number is a probability of nucleotide occurrence in given
position (sum of each column is 1)
6
Sequence motif representation - PWMs
• a position weight matrix
• also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix
(PSSM)
• the most commonly used
• the elements in PWMs are calculated as log likelihoods
• PWMs are often derived from a set of aligned sequences that are thought to be
functionally related and have become an important part of many software tools
for computational motif discovery.
7
Sequence motif representation - Sequence logos
• Graphical representation of PWMs
• the bigger letter the higher chance for the nucleotide to appear in the position
8
Tools - BEDOPS + bedtools
• BEDOPS:
• open-source command-line toolkit that performs efficient and scalable Boolean and other
set operations, statistical calculations, archiving, conversion and other management of
genomic data of arbitrary scale
• https://bedops.readthedocs.io/en/latest/
• functions for today: sort-bed, bedextract
• bedtools:
• a swiss-army knife of tools for a wide-range of genomics analysis tasks
• allows one to intersect, merge, count, complement, and shuffle genomic intervals from
multiple files and in many different formats (.bed, .bam, .gff, …)
• https://bedtools.readthedocs.io/en/latest/
• function for today: getfasta
9
Tools - The MEME Suite
• The MEME Suite is a powerful, integrated set of web-based tools for studying
sequence motifs in proteins, DNA and RNA.
• MEME-ChIP
• web service designed to analyze ChIP-seq ‘peak regions’ - short genomic regions
surrounding declared ChIP-seq ‘peaks’
• works also with CLIP-seq ‘peak regions’
• Given a set of genomic regions, it performs:
• ab initio motif discovery
• motif enrichment analysis
• motif visualization
• binding affinity analysis
• motif identification
• https://meme-suite.org/meme/tools/meme-chip
10
Tools - The MEME Suite
• The MEME Suite is a powerful, integrated set of web-based tools for studying
sequence motifs in proteins, DNA and RNA.
• Tomtom
• web service that allows the user to compare motifs discovered by the suite, by other tools,
or taken from the literature to all of the motifs in a selected database of motifs
• aligns each input motif with each motif in the selected database and reports the most similar
pairs, along with estimates of the statistical significance of each match
• https://meme-suite.org/meme/tools/tomtom
11
Real dataset
1. Download the dataset: bed file with peaks, choose isogenic replicate 1,2
https://www.encodeproject.org/experiments/ENCSR570WLM/
2. Download the chromosome 1 fasta reference
3. Unzip the files
12
Real dataset
4. Create and activate conda environment for today’s practicals
• Open the terminal
conda create --name practicals
conda activate practicals
5. Installation of necessary packages:
• if it turns out you’re missing a channel for installing some of the tool, you can add them by following cmd:
conda config --add channels NAME
conda install bedops
conda install –c bioconda bedtools
13
Real dataset
6. Sort intervals in downloaded file and then extract chromosome 1 positions
• sort-bed PATH/TO/peaks.bed > PATH/TO/OUTPUT/sorted_peaks.bed
7. Unify intervals length to 100 nt
• awk -F '\t' '{X=50; mid=(int($2)+int($3))/2;printf("%s\t%d\t%d\t%s\n",$1,(midX<0?0:mid-X),mid+X,
$4);}' PATH/TO/chr1_peaks.bed >
PATH/TO/OUTPUT/chr1_peaks_extended.bed
14
Real dataset
8. Extract sequences from a reference FASTA file for each of the intervals
bedtools getfasta -s -fi PATH/TO/chr1.fasta -bed PATH/TO/chr1_peaks_extended.bed -fo
PATH/TO/QKI_chr1.fa
15
Real dataset
9. Open the MEME Suite web
10. Open the MEME-ChIP tool
11. Pick appropriate setup
12. Run the analysis
16
Homework
• Re-do the motif analysis on the artificial dataset
• 4 different datasets (1 dataset per student) + 1 bonus dataset
• will be sent by email
• Task:
• download the data
• extend the intervals to 100 nt
• extract sequences for the intervals
• use MEME-ChIP to analyse motifs in dataset
• try to identify domain/protein/protein family
(look also at the CISBP database and pfam database - by clicking through the results)
• Bonus task 1:
• Download the Motifs in MEME Text Format, upload the file to Tomtom tool, choose the
CISBP-RNA Single Species RNA (Homo Sapiens) motif database and look at the results of
the motif comparison tool
• Bonus task 2:
• Repeat the analysis on the bonus (voluntary) dataset
• We’ll discuss the results on the practicals 3. 12. 2021