Alignment and mapping Ing. Stanislav Smatana The Main Goal Genome Fragmentation DNA fragments SequencingReads Mapping on reference genome The Main Goal ● Mapping is the essential step in re-sequencing ● This means we try to explore something with known reference sequence ● We can also construct the reference but let's keep this for another time ● In theory it is quite simple – take a read, compare it with the reference and find the correct place The main goal Reference sequence The Main Goal (SNP/SNV) It’s a sequence alignment problem Reference genome Read from our experiment We need to align reads from sequencing experiment to their corresponding place on reference genome sequence For simplicity, let’s first focus on single (non-paired) reads Sequence Alignment Gene 1 Gene 2 Genome/database Short sequence Global alignment Local alignment Needlman - Wunch Smith - Waterman We want to align two sequences to the same length in order to illuminate evolutionary relationship between them. We want to find occurences of shorter sequence in much longer sequence. Sequence Alignment Gene 1 Gene 2 Genome/database Short sequence Global alignment Local alignment Needlman - Wunch Smith - Waterman We want to align two sequences to the same length in order to illuminate evolutionary relationship between them. We want to find occurences of shorter sequence in much longer sequence. We need this ! Naive approach to local alignment ACTCTCGAGCTAGCTATTCGATCTGAGTCGTGATC ATGT 42 30 ... ● Compare query to subject string at every position and calculate score ● Correct alignment is at position with the highest score Indels Complicate Things ACTCTCGAGCTAGCTATTCGATCTGAGTCGTGATC ATGT TATT ATGT T_ATT ATGT TA_TT ATGT TATT A_TGT TATT AT_GT Much more work ! Dealing with indels - global alignment TGCTGTACTG TATACCA TGCTGTACTG T_ A_ _TACCA We want to align two sequences to the same length by inserting gaps in order to illuminate evolutionary relationship between them. Graph representation of the problem A vs A C vs A C vs GA vs G AC vs AG AC vs -AG -AC vs AG -AC vs -AG -AC vs -AG match = 1 mismatch = 0 gap = -1 Needlman - Wunch algorithm T G C T G T A C T G 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 T -1 A -2 T -3 A -4 C -5 C -6 A -7 d=max(a + match, b - gap, c - gap) Needlman - Wunch algorithm T G C T G T A C T G 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 T -1 1 A -2 T -3 A -4 C -5 C -6 A -7 Needlman - Wunch algorithm T G C T G T A C T G 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 T -1 1 0 A -2 T -3 A -4 C -5 C -6 A -7 Needlman - Wunch algorithm T G C T G T A C T G 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 T -1 1 0 -1 A -2 T -3 A -4 C -5 C -6 A -7 Needlman - Wunch algorithm T G C T G T A C T G 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 T -1 1 0 -1 -2 -3 -4 -5 -6 -7 -8 A -2 0 1 0 -1 -2 -3 -3 -4 -5 -6 T -3 -1 0 1 1 0 -1 -2 -3 -3 -4 A -4 -2 -1 0 1 1 0 0 -1 -2 -3 C -5 -3 -2 0 0 1 1 0 1 0 -1 C -6 -4 -3 -1 0 0 1 1 1 1 0 A -7 -5 -4 -2 -1 0 0 2 1 1 1 Needlman - Wunch algorithm T G C T G T A C T G 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 T -1 1 0 -1 -2 -3 -4 -5 -6 -7 -8 A -2 0 1 0 -1 -2 -3 -3 -4 -5 -6 T -3 -1 0 1 1 0 -1 -2 -3 -3 -4 A -4 -2 -1 0 1 1 0 0 -1 -2 -3 C -5 -3 -2 0 0 1 1 0 1 0 -1 C -6 -4 -3 -1 0 0 1 1 1 1 0 A -7 -5 -4 -2 -1 0 0 2 1 1 1 Needlman - Wunch summary Alignment score = 4x match - 3x gap Alignment score = 1 Given scoring parameters, the algorithm guarantees to find all optimal alignments between the two sequences Local alignment with indels (Smith - Waterman) Genome/database Hit 1 Hit 2 We want to find occurences of shorter sequence in much longer sequence. ● This can be solved by modification of Needlman - Wunch algorithm ○ First row and first column of the matrix are initialized to zeros ○ Mismatch must have negative score (e.g. -1) ○ If score goes below zero, it is saturated to zero ○ Backtracking from all cells with maximum score ● This modification is called Smith - Waterman algorithm ● This algorithm is guaranteed to find all occurrences of the shorter sequence in the longer sequence Tuning alignment - scoring matrices ● Proteins - matrices ○ Blosum (empiric) ○ PAM (based more on theory) ● Nucleotides ○ Typically only match and mismatch score ● Gaps ○ Gap opening penalty ○ Gap extension penalty You can create your own scoring matrix based on your domain knowledge ! Is raw score enough ? We have aligned sequence X to the database Z and the alignment score is 42. Yay ! Is raw score enough ? We have aligned sequence X to the database Z and the alignment score is 42. Yay ! Happy ? Karlin-Altschul alignment statistics (E-value) E = kmne- λS score database size [bp]sequence length [bp] Expected number of random alignments with score S The problem with Smith - Waterman algorithm Number of nucleotides Time needed for computation 100 0.2 ms 1,000 0.02 s 10,000 2 s 100,000 3 m 1,000,000 5 h 10,000,000 23 days 100,000,000 6.5 years 1,000,000,000 650 years Calculated by Ing. Tomáš Martínek, PhD. from BUT FIT. Single Xeon 3Ghz CPU. Indexing - Making local alignment faster Reference sequence Special data structure Reads Indexing Alignments Mapping Idea: Genome is first transformed from plain text into some different form that is more suitable for fast alignment. Indexing - Making local alignment faster Reference sequence Special data structure Reads Indexing Alignments Mapping Idea: Genome is first transformed from plain text into some different form that is more suitable for fast alignment. Indexing - hash table ACT CTA TAG Genome Hash table Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Our sequence will be somewhere in this region, we will use full Smith-Waterman only here ! Indexing - hash table ACT CTA TAG Genome Hash table Input sequence ACTAG ACT CTA TAG Our sequence will be somewhere in this region, we will use full Smith-Waterman only here ! Fast, but what price do we pay for this ? Indexing - suffix tree Substring of a given string is a prefix of one of its suffixes. String: BANANA Substring: NAN Suffixes: BANANA$ ANANA$ NANA$ ANA$ NA$ A$ Similar strategies: suffix array, Burrows-Wheeler transform Indexing - how does it look in practice ? Example using bowtie2 genome mapper: 1. Build index for reference genome bowtie2-build my_reference.fasta my_index_name 2. Align reads using the index bowtie2 -U my_reference.fasta -x my_index_name Note: Some aligners do the index creation implicitly. General Aligners vs Genome Aligners General aligners NGS Aligners (mappers) BLAST, HMMER, MMSeqs2, ... bowtie2, STAR, bwa, ... ● Typically used for search in large databases (e.g. NCBI nt) ● Do not make use of paired reads ● Do not make use of sequence quality information ● Intended for general search of sequences, not only short reads ● Used to align large number of short reads to genome ● Can take advantage of sequence quality information ● Can make use of paired reads ● Optimized for the task of short read mapping ● Produce output in standardized format (SAM/BAM) General Aligners vs Genome Aligners General aligners NGS Aligners (mappers) BLAST, HMMER, MMSeqs2, ... bowtie2, STAR, bwa, ... ● Typically used for search in large databases (e.g. NCBI nt) ● Do not make use of paired reads ● Do not make use of sequence quality information ● Intended for general search of sequences, not only short reads ● Used to align large number of short reads to genome ● Can take advantage of sequence quality information ● Can make use of paired reads ● Optimized for the task of short read mapping ● Produce output in standardized format (SAM/BAM) NGS Aligners - how can paired reads help ATTTTG ATTTTG GTCCT ATTTTG ??? One of the biggest challenges for alignments to genome are repetitions. genome single read NGS Aligners - how can paired reads help ATTTTG ATTTTG GTCCT ATTTTG One of the biggest challenges for alignments to genome are repetitions. genome GTCCTPaired reads remove some of the ambiguity !!! NGS aligners - repeat masking Another way to deal with repeats is to mask them. General Aligners vs Genome Aligners General aligners NGS Aligners (mappers) BLAST, HMMER, MMSeqs2, ... bowtie2, STAR, bwa, ... ● Typically used for search in large databases (e.g. NCBI nt) ● Do not make use of paired reads ● Do not make use of sequence quality information ● Intended for general search of sequences, not only short reads ● Used to align large number of short reads to genome ● Can take advantage of sequence quality information ● Can make use of paired reads ● Optimized for the task of short read mapping ● Produce output in standardized format (SAM/BAM) General Aligners vs Genome Aligners General aligners NGS Aligners (mappers) BLAST, HMMER, MMSeqs2, ... bowtie2, STAR, bwa, ... ● Typically used for search in large databases (e.g. NCBI nt) ● Do not make use of paired reads ● Do not make use of sequence quality information ● Intended for general search of sequences, not only short reads ● Used to align large number of short reads to genome ● Can take advantage of sequence quality information ● Can make use of paired reads ● Optimized for the task of short read mapping ● Produce output in standardized format (SAM/BAM) NGS Aligners - standardized output (SAM/BAM) https://samtools.github.io/hts-specs/SAMv1.pdf NGS Alignment - what can we use alignment for ? ● Whole genome sequencing - we map reads onto reference to find variation ● Exome sequencing - same as before, but only exomic DNA is captured. Saves a lot of money if you are only interested in genes. ● ChIP-Seq/CLIP - sequencing of DNA regions where binding of proteins happens. ● Transcriptome sequencing - sequencing of transcribed RNA in order to get expression profile of genes NGS Alignment - what can we use alignment for ? ● Whole genome sequencing - we map reads onto reference to find variation ● Exome sequencing - same as before, but only exomic DNA is captured. Saves a lot of money if you are only interested in genes. ● ChIP-Seq/CLIP - sequencing of DNA regions where binding of proteins happens. ● Transcriptome sequencing - sequencing of transcribed RNA in order to get expression profile of genes Transcriptome sequencing Gene 1 Gene 2 Expression reads genome Transcriptome sequencing - splicing But how about Eukaryotes and their splicing ? Exon ExonIntron Intron Read What to do about this ? Transcriptome sequencing - splicing 1. Do not map reads onto a reference genome, but reference transcriptome. Reference transcriptome contains whole continuous transcribed sequences after splicing. No worries about introns. 2. Use splice-aware aligner Use aligner which is designed for transcriptome alignment and takes splicing into account. Examples: bowtie2, STAR, BWA, HISAT2 ... Never use non splice-aware aligner to map RNA-seq reads onto a reference genome ! (e.g. bowtie, BFAST, …) NGS aligners - summary ● Choose right aligner for the task at hand ! ○ splice-aware vs non splice-aware ○ gapped vs non-gapped alignment ○ exact alignment vs fast approximate location (e.g. Kallisto) ● Aligners have often optimized default parameters for specific reference. ● Always read the manual. ● Never use settings without knowing what they are ! ● Read the reviews ! http://bioinformatics.oxfordjournals.org/content/27/20/2790 http://www.ncbi.nlm.nih.gov/pubmed/24185836 http://www.biomedcentral.com/1471-2105/14/184 http://bib.oxfordjournals.org/content/11/5/473.full http://omictools.com/read-alignment-c83-p1.html Reference sequences ● It can be ○ reference genome ○ reference transcriptome ○ just some collection of sequences ● Usually a FASTA file ● Usually one long sequence per chromosome ● Unassembled parts of the genome at the end ● Naming of records is important ! Reference sequences ● It can be ○ reference genome ○ reference transcriptome ○ just some collection of sequences ● Usually a FASTA file ● Usually one long sequence per chromosome ● Unassembled parts of the genome at the end ● Naming of records is important ! Reference sequences - naming Reference sequences - human genome ● One representative human genome reference sequence ○ Derived from DNA of 13 volunteers from Buffalo, NY ● Maintained by the Genome Reference Consortium (GRC) ○ New versions are released periodically ○ Results from different versions are not compatible ! ○ Releases are provided by UCSC and NCBI ○ Different sources use different chromosome identifiers (chr1 vs 1) ! Reference sequences - annotations ● Additional description of reference - e.g. annotations of different regions on the reference ● GTF and GFF file format ● Names of chromosomes/sequences have to match the names in reference ● Different types of features ○ Manually verified genes ○ Predicted genes ○ Introns ○ ... Considerations ● How many mismatches to allow ? ○ Vary depending on biology or genome completeness ● How many matches to report ? ○ Are you interested in multiple matches ? ● Require best match, first/any match ? ○ First match only is usually much faster. ● Quality of the reference sequence ○ How much can I trust my results ? If the reference is bad no aligner can save me ! You have to think about these questions !