OCEITEC
Central European Institute of Technology BRNO | CZECH REPUBLIC
Moderní metody analýzy genomu - analýza
Mgr. Nikola Tom
Before we start analysis
We have to know what we are dealing with... and what we want to find out...
Concept of the project
DNA/RNA/methylation/...
DNA
Targeted sequencing (amplicons, gene panels, exomes) Whole genome sequencing
- Finding differences to known reference genome = re-sequencing
De novo assembly
- Genome construction
Before we start analysis...
RNA
- Gene expression, alternative splicing
Metagenomics (bacteria, viruses)
- Their composition, variants
ChIP sequrcing (DNA-protein interactions)
Library preparation - example of DNA library
PCR cycle 0: extension to fill 3'
PCR amplification
c
Read2
Pe«a£ sr«r Ff.-r.'2013
Sequencing-ready DNA fragments
Readl
Library preparation - example of mRNA library
A i
Barcode
Rdl
AAAAAAAAA 1) PolyA+ RNA captured TTTTTTTTTTT©
2) RNA fragmented and primed
3) First strand cDNA synthesized
i A
4) Second strand cDNA synthesized
5) 3' ends adenylated and 5' ends repaired
6) DNA sequencing adapters ligated
<— <—
Rd2 Index
7) Ligated fragments PCR amplified
Bioinformatics
Bioinformatics is a quite new field... (first NGS in 2005) How to analyse data defived from NGS = bottleneck of NGS A lot of tools/software for NGS data analysis... Most of the tools are command-line based No tool is working perfectly...©
Each tools solves only a peace of the cake...
NO tool, that is able to perform analysis from the very beginning
to the end => Need for setup the pipeline
Bioinformatics
Exception: commercial software and ready to use pipelines BUT they have usually not-transparent settings and/or not enough of options
Heavily depends on type of experiment, library preparation and project
Laptop or PC are usually not enough... need for cluster
t*sy* ^ annotation
Base calling
Signal to sequence conversion and assigning base quality scores (fastq file)
Phred score - probability of arising an error (log based)
fastq
•Consists of reads - biological sequences
(each read represents 1 input molecule sequenced on flowcell)
•Corresponding quality score for each base
•ASCII character
•(fasta+ qual, csfasta + csqual, sff) •Pair-end sequencing - 2 fastq files
@
SEQJD GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +
!M*((((***+))%%%++)(%%%%).l***-+*M))**55CCF>>>>>>CCCCCCC65
Quality control (FastQC)
FastQC File Help
bad_sequence.txt
good_sequence_short.txt
(^^1 Basic Statistics
Per base sequence quality
Per sequence quality scores Per Dase sequence content
o 0
Per base GC content Per sequence GC content Per base N content Sequence Length Distribution I        Sequence Duplication Levels Overrepresented sequences
Kmer Content
Quality scores across all bases (Illumina >vl.3 encoding) IIIIIIIIIIII
IIIIIIIIIIII"
J
t*sy* ^ annotation
Cleaning reads (Cutadapt)
•Adaptor trimming (miRNA) •Quality trimming •Length filtering
STRUCTURE DETAILS
Rd1 Seq Primer Index Seq Primer
---->
P^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^
_INDEX
<_ ^-
_I
y
Sequence of Interest
t*sy* ^ annotation
Read mapping (alignment)
•Usually mapping reads on reference sequence (DNA/cDNA/16S/ other seq) to find corresponding location & differences
•Problem with too many sequences and billions bp long references - need for special algorithms (Burrows-Wheeler transform, hash table indexing)
Mapping of DNA reads
•On Existing DNA reference sequence (ready for many organisms)
•To find substitutions, insertions, deletions, inversions, etc... Precisely!
-BWA, Bowtie, Bfast, SHRiMP
Example of DNA re-sequencing
^ " [unpairedjl...] x
74.300 i
74.320 i
74.340 i
74.360 i
Sudden coverage changeC^^^lUnaligned ends unpaired_illumina_miseq contig 44 tatatttaagatgttttgcctgaaaagtgagcgaacgataaagtttttataatttcctcttgtcaggccggaataactccc
Coverage
0 tatat tatat tatat tatat tatat tatat tatat
ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt
ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc
tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc
gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt -agttt
tatat tatat tatat tatat tatat tatat
ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt ttaagatgtt taagatgtt agatgtt
ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc ttgcc
tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc tgaaaagtgagc agtgagc gagc
gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt gaacgataaagttt aaagttt aaagttt aaagttt agttt
ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt ttatattt atattt
gccgg aataactccc *■
tt
ttcgct ttcgct ttcgct ttcgct ttcgct ttcgct tt
ttcgct ttcgct ttcgct
t g ttcgct ttcgct ttcgct
t g ttcgct ttcgct ttcgct
t g ttcgct ttcgct ttcgct g
tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc
tgtcaggc tgtcaggc
tgtcaggc tgtcaggc TGTCAGGr tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc tgtcaggc
cggaataactccc cgg
cggaataactccc cggaataactccc cggaataactccc cggaataactccc ggaataactccc cggaataactcc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc cggaataactccc
tatatttaagatgttttgcctgaaaagtgagcgaacgataaag-
ttttaggctgatttggttgaatgttgcgcggtcagaaaattatttta
aataagcataaagaataaaaaatgcgcggtcagaaaat
aacggggcttttgctgaaaaaatgcgcggtcagaaaat acggggcttttgctgaaaaaatgcgcggtcagaaaat taggctgatttggttgaatgttgcgcggtcagaaaat cttttgctgaaaaaatgcgcggtcagaaaat taaagaataaaaaatgcgcggtcagaaaat
tatatttaagatgttttgcctgaa
tatatttaagatgttttgcctgaaaagtgagcgaacga
tt tt tt tt tt tt tt tt tatt tt tt tt tatt tatt tt tt tt tt
tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a tt a
aatttcctcttgtcaggccggaataactccc
cttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc aatttcctcttgtcaggccggaataactccc
_Cj_
Mapping of RNA reads - alternative splicing
Reads can span exon junctions - mRNA splicing
DNA
Exon
1
Exon
Exon 3
Exon
mRNA
1
Alternative Splicing
mRNA
1     2 3
Protein A
Protein B
Mapping of RNA reads
•To measure gene expression OR alternative splicing
•On existing DNA reference sequence •To find alternative splicing •More tricky, complicated, slower
- TopHat [de novo splice aligner)
On transcriptome reference sequences
Reads can map to multiple transcripts (shared exons)
Easier, faster, no need for special aligners
- BWA
•On miRNA sequences - miRBase -Grouping and annotate against mirBase
De novo assembly
■to uncover unknown genomes/transcriptomes To detect large structural variants
_ Reads
I Assemble
Contig
J Map reads to contigs Contigl --1— Contig2
Assemble contigs to scaffolds
-NNNNNN Scaffold
I
Gap filling
J
Long sequence cluster and assembly -Unigene
SAM/BAM
1 9H0 VN:l.e SO:unsorted
2 *SQ SM:giIlie64e213lreflNC_0O8253.il LN:493892e
3 *>G ID:bowtie2   PN:bowtie2 VN:2.1.e
4 gi11106402131refINC.008253.II.418.952.1:0:0_1:0:0_0/1    0    gi11106402131refIMC. 088253.11     418 42   70M •     0    0 CCAGGCAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAA ATCACCAACCATCTGGTAGCGATGAT 2222222222222222222222222222222222222222222222222222 222222222222222222   AS:i:-3 XN:i:0   XM:i:l   XO:i:0   XG:i:0   NM:i:l HD:Z:8G61 Y7:Z:UU
5 gi11106402131refINC_008253.1I.31_476_0:0:0_0:0:0_1/1 16 gi11106402131refIMC. 008253.11 407 42 70M • 0 0 GGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTG CCCCCGCCAAAATCACCAACCATCTG 2222222222222222222222222222222222222222222222222222 222222222222222222   AS:i:0   XN:i:0   XM:i:0   X0:i:6   XG:i:0   NM:i:0   MO:2:70 YT:2
6 gx11106402131refINC_008253.1I_210_743_2:0:0_1:1:0_2/1 0 gi11106402131refIMC_ 008253.11 210 42 70M • 0 0 CATTACCACCACCATCACCATTACCACAGGAAACGGTGCGGGCT GACGCGTACAGGAAACACCGAAAAAA 2222222222222222222222222222222222222222222222222222 222222222222222222   AS:i:-6 XN:i:0   XM:i:2   XO:i:0   XG:i:0   NM:i:2 MD:Z:30T31A7
Headers
Alignments
Each row describes a single alignment of a raw read against the reference genome. Each alignment has 11 mandatory fields, followed by any number of optional fields.
Mapping, Coverage reports
•Repeat alignment/other steps with different criteria?
•Important checkout for lab protocol •Specificity of PCR
•Settings of variant calling threshold, CNV
t*sy* ^ annotation
I1199V900V0191010001019VV1111111111101110111011109009111101119 I1199V900V0191010001019V111111111111011101110111 9009 L11101119V11. »1199V000V0191910 001019VVl1111111111011101110111 9009 11101119V11. »1199V000V0191010001019VV11101111111011101110111 9009 -11101119V11. H199V000V0191010001019VV1111111111101110111011109009111101119V11. U199V000V0191010001019VV1111111111101110111011109009111101119V11. H199V000V0191010001019VV1111111111101110111011109009111101119V11. »1199V000V0191010001019VV11111111111Olli01110111 9009 - 11101119V11. »1199 V000V0191010001019Will 11111111 Ol 11 Ol 11011109009111101119V11. I1199VOOOV0191010001019VV1111111111101110111011100|Í0111101119V11. »0199V000V0191010001019VV11111111111Olli0111011109009111101119V11. 199V000V0191010001019VV11111111111011101110111 9009 11191119V11. 000V0191010001019VV11111111111Olli011101110 9009 11101119V11.
0 01019VVl111111111101110111011109009 11101119V11. >1199V000V0191010001019VVl111111V111011101110111   9009 11001119V11.
1019VV11111111111011101110111: 9009 11101119V11. 019VVl1111111111Ol11Ol11011109009 11101119V11. V11111111111011101110111   9009 11101119V11.
H199V000V0191010001019VV111111111110111011101110
»1199V000V0191010001019Vll111111111101110111011109339111101119V11. H199V000V0191010001019VV111111111110111Ol11011109009111101119V11. »1199V000V0191010001019VV111011111110111Ol11011109000111101119V11. »1199VOOOV0191010001019VV11111111111Olli0111011109009111101119V11. »1199V000V0191010001019VV1111111111101110111011109009111101119V11. tll99V000V01910100 01019VV1111111111101110111011109009111101119Vll. >1199VO00V0191010 001019VVl111111111101110111011109009111101119V11. U199V000V0191010001019VVlIII111111101110111011109009111101119V11. »1199V000V0191010001019VV1111111111101110111011109009111101119V11. »0199V000V0191010001019VV1111111111101110111011109009111101119V11. 199V000V0191010001019VV1111111111101110111011109009111101119V11. 000V0191010001019VV11111111111Ol11Ol11011109009111101119V11.
1111111011101110111 9009 •111101119V11. »1199V000V0191010001019VV111111111110111Ol11011109009111901119V11.
1019VV1111111111101110111011109009111101119V11.
III11111111Ol11Ol1109009" •- -111101119V11.
Illllll|lll0111011109009-- - -111101119V11.
g
spb8j81 iiljľ
o 81
»1199V000V0191010001019Will 111111110111011101110 - -
sprej 81 ajojoq
o
111101119V11. oouwsjoj
i
El
i
901
i
08
lU9LUuß!|b8j |8puj |b00| joj paeN <= suounwsqns $ s|8pu| 9a!J!sod esiej - pajjad jou si juewußüe á||ensn
lU9UJUß!|B8J |3pU|
*esU<z4* annotation
Remove PCR duplicates
Each read represents 1 input molecule THEORY:
E.g. in case of DNA re-sequencing, 1 diploid cell is represented by 2 reads because of 2 chromosomes
BUT
there is a PCR to amplify genetic material to be analyzable => 1 input molecule from 1 cell could be after PCR represented by more reads => Biased variant allele frequency
How to solve it?
1) Molecular barcodes (very new method)
2) Identity of start-end positions of read pair
Molecular barcodes
Downstream Upstream Targeting Sequence Targeting Sequence
Custom Probe 1
Round 1
Round 2
Product of Round 2
Round 3
Round 4
Custom Probe 2
>aOQ0QOQ0QOQ0QOQ0Q0Q0Q0«.
i I
Clean to remove P5-SMT Add P5 -
and P7-index — -
Additional rounds of PCR
Indexed and Single Molecule Tagged amplicon library ready for cluster generation and sequencing
B
5
I 8j o
1 §
□ OuCioate Reads ■ Unique Rsads
■ Gisrrrtino Sarn;ilw«
SnrnrJ«s
1«*06-
gi 10000-
2,
tooo-
ioo-
Dieiicais Flats S-10S >10-1S% »- >15-20% S- »20-30% •v- 63% 0» a-mar SWT •- 12-mefSMT
~i—r-
8 10
-1
28
Duplicate Cluster size (limes SMT observed per Target. Log seated)
Smith et al. 2014
*esU<z4* annotation
DNA Seq - variant calling
•To detect differences from reference sequence
•Single/multi-nucleotide •Substitutions •Insertions •Deletions
•Inversions
•Large structural variations (translocations, indels) •Copy number variations
DNA Seq variant calling
based on many criteria like: •Coverage
•Variant alelle frequency •Base quality
•Depends also on:
•Genomic context (homopolymers)
•Nucleotide type
•Position in read (errors at the read end) •Alignment errors (importance of realignment) •Presence in both forward and reverse reads
Necessary to take into account type of library preparation (single end; pair end; mate pair)
DNA Seq variant calling
0.6 - 25 kb fragment
^^dd biotin labels
•Mate-pair library •Detection of large indels & translocations
not sequenced insert with known size
_Andy VI»*MtB (2012)
DNA Seq variant calling
36.661.660 I
36,661,680
36.66
22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTT)
Consensus GG AGT T TT TGGGTGAGAACATATCC AAC TT TCT TTCCT T AGC TGGC AAT AC TT)
289
Coverage
bbAb I I  I I I IjOIj I bAOAALA I A I Vlt AA| I
ggagtttt tgggggagaacatatccaac t ggagtttttgg catatccaact ggagbt tt tgggtgagaacata t c caact ggagttttlgggtgagaacatatccaact ggagtttttgggtgagaacatatccaac t ggagtttttgggtgagtac ccaac ggaggttttgggtgagaacatatccaact ggaiagttagggagagaacatatccaaat ggagtttttgggtgag6acatatccaact ggagtttttgggtgagaacatatccaalt ggaggttttggg|gagaacatatccaa|t gaaatgtttgggtgagaacatatccaaat tgaggttgtgggtgagaacatatccaaat
OA
T
TTGT TTCT |TCT
TTCT
CTCC
TTCT
TTCT
TTCT
TTCT
TTG
TTCT
TTCT
I  I UU I  I AbU
TTCCT TTCCT TTCCl
TTCCT
TTCCT
TTCCT
TTCCT
TTCC
TTCCT
|TCCT
TTCCT
TTCCT
TAGC TAGC TAGC TAGC T|GC TAGC TAGC
TGGCAA TGGCAA TGGCAA TGGCAA TGGCAA TGGCAA TGGCAA TGGCAA
tagc t
i au i m»
TACTTi TACTTi TACTTi TACTTi TACTTi TAC|Ti
TACT
CCTTi
TAGCTGGCAATGCTT TAGCTGGCAATACTTi
vcf file
Example
##ti Letormat=VCFv4.0 ##tiLeDate=20100707 ##source=VCFtools ##reterence=NCBI36
Mandatory header lines
■a
(D
Si <
u >
Optional header lines (metadata about the annotations in the VCF body)
Hp'
##INF0=<ID=AA,Number=l,Type=String,Description="Ancestrat AL ##INF0=<ID=H2,Number=0,Type=FLag,Description="HapMap2 member ##F0RMAT=<ID=GT,Number=l,Type=String,Description^'Genotype"
##F0RMAT=<ID=GQ,Number=l,Type=Integer,Description="GenotvJe Quality (phred score)"> ##F0RMAT=<ID=GL,Number=3,Type=Float,Description="LikeliMbods for RR,RA,AA genotypes (R=ref,A=alt)"> ##F0RMAT=<ID=DP,Number=l,Type=Integer,Description="Re/fcl Depth"> ##ALT=<ID=DEL,Description="Deletion">
##INFO=<ID=SVTYPE,Number=l,Type=String,Descriptia^"Type of structural variant"> ##INFO=<ID=END,Number=l,Type=Integer,Description="End position of the variant">
m
#CHR0M POS ID 1 1 .
1 2 1 5 1 100
QUAL FILTER
PASS PASS
Deletion
INFO
H2;AA=T
SVTYPE=DEL;END=300 Other event
FORMAT GT:DP GT:GQ
GT:GQ L GT:GQ:DP At
SAMPLE1
1/2:13 811:100 10:77 /1:12:3
SAMP 0/0 2/2^0 1/1:9 0/0:20
Reference alleles (GT=0)
Insertion
Large SV
Phased data (G and C above are on the same chromosome)
Alternate alleles (GT>0 is an index to the ALT column)
DNA Seq variant calling
• Tumor only (amplicon sequencing & diagnostics)
• Tumor & normal (exome sequencing)
-to do variant calling and genotyping more precisely (somatic, germinal mutations)
• Option is also to analyze tumor vs. group of tumors
Application of many statistical tests:
• negative beta-binomial test
• Bayesian statistics
• Fisher exact test
As higher coverage as higher sensitivity and specificity (but limited) More about statistics and RNA sequencing in the next courses
*esU<z4* annotation
Annotating and filtering of detected variants
•Gene •Transcript •dbSNP •Regulation
•Comparative genomics
•Repeats
•Functional
•Gene ontology
•Etc.