£^2> CEITEC
Central European Institute of Technology BRNO I CZECH REPUBLIC
Moderní metody analýzy genomu Bioinformatika I
Mgr. Nikola Tom
Brno, 11.11.20
EUROPEAN UNION
EUROPEAN REGIONAL DEVELOPMENT FUND INVESTING IN YOUR FUTURE
OP Research and Development for Innovation
Bioinformatics
Bioinformatics is a quite new field... (first NGS in 2005) How to analyse data defived from NGS = bottleneck of NGS
AIM: clean the data and give them biological sense
Bioinformatics SOLUTION 1:
•  commercial software and ready to use pipelines BUT they have usually not-transparent settings and/or not enough of options (good programs expensive)
ClCbia
A Ql AGEN "Company
SOPHIAGENETICS
Bioinformatics
Bioinformatics SOLUTION 2:
• command-line based tools/software Each tools solves only a part of the analysis
• Need for setup the pipeline & tune programs' parameters (challenging & more precise!!!)
ubuntu
Bioinformatics
Choice of programs & settings heavily depends on type of experiment, library preparation, biological question
Laptop or PC are usually not enough... need for cluster
Before we start analysis
We have to know what we are dealing with... and what we want to find out...
Concept of the project
DNA/RNA/methylation/...
DNA
• Targeted sequencing (amplicons, gene panels, exomes)
• Whole genome sequencing
- Finding differences to known reference genome = re-sequencing
De novo assembly
- Genome construction
Before we start analysis
RNA
- Gene expression, ncRNA, alternative splicing Metagenomics (bacteria, viruses)
- Composition of organisms in the sample, genetic variants ChIP sequecing (DNA-protein interactions)
Bioinformatics' starting point
Raw sequencing data - READ Produced during base calling
- signal to sequence conversion and assigning base quality scores (fastq file)
Fastq file
• Consists of reads - biological sequences
(each read represents 1 input molecule sequenced on flowcell)
• Corresponding quality score for each base
• Phred score - probability of arising an error (log based)
• ASCII character
• (fasta+ qual, csfasta + csqual, sff)
• Pair-end sequencing - 2 fastq files
example.fastq
@
S E Q_ ID G ATTTG G G G TTC A A AG C AG T ATCG ATC A A AT AG T A A ATC C ATTTG TTC A ACTC AC AG TTT +
!M*((((***+))%%%++)(%%%%).l***-+*M))**55CCF>>>>>>CCCCCCC65
NGS pipeline
Input
Pre alignment Quality Control
Alignment (Mapping)
Variant calling
Reads (fastq)
fastq
bam
Reference
Output r Variants\
Annotation
dbSNP
HGMD
COSMIC
Input
Reads (fastq)
Pre alignment Quality Control & trimming
Alignment
x  Variant calling (Mapping) b
Output
Annotation
Quality control (FastQC)
£ile Help
bad_sequence,trt | good_sequence_short.txt|
Basic Statistics Per base sequi
o ©
Sequence Length Distribution (j^JJ^ Sequence Duplication Levels (j^^ Overrepresented sequences
^mer Content
3si" se'-..;e:i-:e c,f scores Per base sequence content Per base GC content Per sequence GC content Per base N content
Quality scores across all bases (Illumina >vl ,3 encoding)
Input
Reads (fastq)
Pre alignment Quality Control & trimming
Alignment (Mapping)
Variant calling
Reference
fastq
bam
Cleaning reads (Cutadapt)
• Adaptor trimming (miRNA)
• Quality trimming
• Length filtering
STRUCTURE DETAILS
Rd1 Seq Primer
P5
Index Seq Primer
_2^
P7
_INDEX
-
r
Sequence of Interest
Output
Annotation
COSMIC
Input
Reads (fastq)
Pre alignment Quality Control & trimming
Alignment (Mapping)
Variant calling
fastq
•Usually mapping reads on reference sequence (DNA/cDNA/16S/other seq) to find corresponding location & differences
(substitutions, insertions, deletions, inversions, etc... )
•Problem with too many sequences and billions bp long references - need for special algorithms (Burrows-Wheeler transform, hash table indexing)
•BWA, Bowtie, Bfast, SHRiMP (BAM format)
Example of read mapping
I IB 'illuminajiise... X | =-* [unpairedJ...] *
unpaired_
Sudden covera
lluminamiseq contig 44 rATATTTAAGATGTTTTGCCTGAAAAGTGAGCGAA
ge chanqel ' ><T~lUnaligned ends ACGATAAAGTTT TTATAATTT
CCTCTTGTCAGGCCGGAATAACTCCC
Coverage
TAT AT TAT AT TAT AT TAT AT TAT AT TAT AT TAT AT
TT AA TT AA TT AA TT AA TT AA TT AA TT AA
GATGT GATGT GATGT GATGT GATGT GATGT GATGT
TTTG TTTG TTTG TTTG TTTG TTTG TTTG
CCTGAAAAG CCTGAAAAG CCTGAAAAG CCTGAAAAG CCTGAAAAG CCTGAAAAG CCTGAAAAG
TGAGC TGAGC TGAGC TGAGC TGAGC TGAGC TGAGC
GAAC GAAC GAAC GAAC GAAC GAAC GAAC
TAT AT TAT AT TAT AT TAT AT TAT AT TAT AT
TT AA TT AA TT AA TT AA TT AA TT AA TT AA T AA A
GATGT GATGT GATGT GATGT GATGT GATGT GATGT GATGT GATGT
TTTG TTTG TTTG TTTG TTTG TTTG TTTG TTTG TTTG
CCTGAAAA CCTGAAAA CCTGAAAA CCTGAAAA CCTGAAAA CCTGAAAA CCTGAAAA CCTGAAAA CCTGAAAA A
GTGAGC GTGAGC GTGAGC GTGAGC GTGAGC GTGAGC GTGAGC GTGAGC GTGAGC GTGAGC GAGC
GAAC GAAC GAAC GAAC GAAC GAAC GAAC GAAC GAAC GAAC GAAC
GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT —AGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT GATAAAGT AAAGT AAAGT AAAGT AGT
TATATTTAAGATGTTTTGCCT GAAAAGTGAGC GAACGATAAAG-
T AT AT TAT AT
TTAAGATGT TTAAGATGT
TTTG TTTG
CCTGAA
CCTGAAAAGTGAGCGAACGA
T	T	TT	AT	ATTT				GT	CAGGC	C	GGAATAACTC	C	c
T	T	TT	AT	ATTTT	T								
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GG		
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T						GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	GCT	T	GT	CAGGC	c	GGAATAACTC	c	
T	T	TT	AT	ATTTT	TC	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	GCT	T	GT	CAGGl	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T ■	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
			AT	ATTTT	T	G	T	GT	CAGGC	c	GGAATAACTC	c	c
						GCT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
						CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	t	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A -	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
T	T	TT	A-	AATTT	CC	T CT	T	GT	CAGGC	c	GGAATAACTC	c	c
r >
igt Of
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Output
Reference
fastq
:bam
I i
Usually alignment is not perfect - false positive indels & Substitutions => Need for local indel realignment
12; i
155ts|)
TTAGTTTCTTTT
■ CTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI
0
aftar
B
■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCT
"TTAGTTTCGTTTGCCGC TTTCTTTCTTTCTTTATTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTT- - - -GCCGCTTTCTTTCTTTCTTTTTTT
■TTAGTTTCTT TTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCC "TTAGTTTCTTT TGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCC T CTGTCACC CAGGT "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTCI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTCTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTTAGTCTCCCTCTGTCACCCAGGTT I CTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI
■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTA ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTC ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCT
■TTAGTTTCGTTTGCCGCTTTCTTTCTTTCTTTATTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTT TTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCC "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCC ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGT ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTC I "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "T TAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I "TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTCTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGC TTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTT I ■TTAGTTTCTTTTGCCGCTTTCTTTCTTTCTTTTTTTTTTTTAGTCTCCCTCTGTCACCCAGGTT I GTTTCTT TTGCCGCTTTCTTTCTTTCTTTTTTTTTTTAAGTCTCCCTCTGTCACCCAGGTTI
Annotation
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Of
fastq
Mapping, Coverage reports
• Repeat alignment/other steps with different criteria?
• Important checkout for lab protocol
• Specificity of PCR
• Settings of variant calling threshold, CNV
• Target bed file (Browser Extensible Data)
chrl 127471196 127472363 chrl 127472363 127473530 chrl 127473530 127474697 chrl 127474697 127475864 chrl 127475864 127477031
(bed format)
Output
Annotation
s 100
De novo assembly - alternative for mapping on
reference sequence
To uncover unknown genomes/transcriptomes To detect large structural variants
Reads
I Assemble
Contig
J Map reads to contigs
= = = = = = = "-!
Contigl      —i--1— Contig2
Assemble contigs to scaffolds
-NNNNNN Scaffold
I
Gap filling
j
Long sequence cluster and assembly -Unigene
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Reference
at
fastq
I I
bam
Output
REMOVE PCR DUPLICATES
Each read represents 1 input molecule
THEORY:
E.g. in case of DNA re-sequencing, 1 diploid cell is represented
by 2 reads because of 2 chromosomes
BUT
there is a PCR to amplify genetic material to be analyzable => 1 input molecule from 1 cell could be after PCR represented by more reads => Biased variant allele frequency
Annotation
COSMIC
How to solve it?
1) Molecular barcodes (very new method)
2) Identity of start-end positions of read pair
Introduction of Molecular barcodes during brary preparation
Downstream Upstream
Targeting Sequence Targeting Sequence
Custom              I I Custom
Probe 1     v   + I   S     Probe 2
Round 1
Round 2
Product of Round 2
Round 3
Rounds
Clean fo remove P5-SIMT Add P5 — and P7-index —
Additional rounds of PCR
Indexed and Single Molecule Tagged amplicon library ready for cluster generation and sequencing
B
3 -
		
		
■ Unique fl^Kis		
□ FL
I
r
j Gniirriirirt Sartylns
ISfDir-|
3
■ -1
Diflllceie I'laie	
	
	>1B-fS*
	..!■
	
	
-c-	
	1S-rin&rSMT
6     8 10
I
23
Duplicate Clualeraize (limes SMT observed per Target, Log sea bc;
Smith et al. 2014
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Of
8f
Reference
Output r VariantsN
fastq
Annotation
36,661,601
36,661,660
36,66
Mutation types:
Germinal mutations Somatic mutations
Substitutions Insertions Deletions Complex variants
22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTT) Consensus GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTy
2»-
Coverage
LlUAkJ I   I   I  I   I U la Li I bAÜAML/M I M I Ül-AAfl I     ÜA I   I   I LL I   I Ably I bbL AA I AL I M<
ggagtttttgggggagaacatatccaactt tggc aat ac TTj
ggagtttttgg catatccaactttgtttccttagctggcaatactTj
ggagIttttgggtgagaacatatccaactttctttccttagctggcaatacttj ggagttttIgggtgagaacatatccaactItctttccItagctggcaatacttj
ggagt(tt t gggtgagaacatat c caac t t t ct t t c ct tagc tggcaatac t tj
ggagtttttgggtgag ttccttlgctggcaatacitj
ggaggt tt t gggtgagaacatat c caac t t t ct t t c ct tagc tggcaa
ggaBagttagggagagaacatatccaaAtttctttccttagctggcaatact
ggagtttttgggtgaggacatatccaactttctttcc
ggagt t tt tgggtgagaacatatc caabt ttctttccttagct ctt,
ggaggttttgggigagaacatatccaaltttgtitcct
gaaatgtttgggtgagaacatatccaaatttctttccttagctggcaatgctt
tgaggt tgtgggtgagaacatatc caaät t tct t tc ct tagc tggcaatac tt,
Inversions
Large structural variations (translocations, indels) Copy number variations
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Of
fastq
Experimental designs (also depends on types of samples available):
Normal only (genotyping)
Tumor only (genotyping, somatic mutations)
Tumor + related normal control Tumor + unrelated normal controls Tumor in time
Family (rare diseases, genotyping)
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Output
r VariantsN
Of
fastq
Program algorithms:
• Bayesian statistics (Mutect, DeepSnv)
• Fisher exact test (Varscan, Vardict)
Annotation
36.661,660 I
36,661.660
36,66
22 GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTJ
Consensus GGAGTTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTJ
2sr
Coverage
Giving p-value based on different features
Options for many parameters & filters:
• Minimum coverage
• Variant allele frequency
• Base quality
• Genomic context (homopolymers)
• Position in read (errors at the reads end)
• Mapping quality
• Presence in both forward and reverse reads (strand bias)
UUAU I  I  I I  I IjLiLi I bAUAAL A I A ! bLAAA I   I la A I  J  I 1» U I  I AbL I bbl AA I AU I AI
GGAGTT TT T GGGGGAGAACATATC CAACT T TGGCAATACTTi GGAGTTTTTGG CATATCCAACTTTGTTTCCTTAGCTGGCAATACTTi
GGAGlTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTi GGAGTTTT1gGGTGAGAACATATCCAACT|TCTTTCCItAGCTGGCAATACTTi GGAGT|TTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAATACTTi GGAGTTTTTGGGTGAG TTCCTtIgCTGGCAATACCTj
GGAGGTTTTGGGTGAGAACATATCCAACTTTCTTTCCTTAGCTGGCAA GGACAGTTAGGGAGAGAACATATCCAAÄTTTCTTTCCTTAGCTGGCAATACT GGAGTTTTTGGGTGAGGACATATCCAACTTTCTTTCC
GGAGTTTTTGGGTGAGAACATATCCAA|TTTCTTTCCTTAGCT CTTj jGGAGGTTTTGGGlGAGAACATATCCAAATTTGTATCCT
GAAATGTTTGGGTGAGAACATATCCAAATTTCTTTCCTTAGCTGGCAATGCTT TGAGGTTGTGGGTGAGAACATATCCAAATTTCTTTCCTTAGCTGGCAATACTTi
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Reference
Of
fastq
To distinguish real mutation from ERROR
(library preparation, sequencing, alignment)
Usually 1 approach is not enough => gatk to combine more variant callers (aligners) & different settings
Specific pipeline for each type of mutations (SNV, INDELS, CNV...)
Output
Annotation
dbSNP
SOAPmp
[Number*** SNV* Fcmnlulivnl TFTv Bilm
SAMTools
SNVer
O'Rawe, J. etal. Low concordance of multiple varan!-calling pipelines, practical implications for exome and genome sequencing. Genome MedcineS, 28 (2013).
rto a eal ti me 5 tan ford Un iversi t y
j genomics f
Input
Reads (fastq)
Pre alignment Quality Control & trimming
Alignment (Mapping)
Of
Variant calling
fastq
Output r VariantsN
Annotation
VCF file
Example
T3 IS
2 •<
u >
##fileformat=VCFv4.0 ##fileDate=2Q100707 ##source=vCFtooLs ##reterence=NCBI36 ##INF0=<ID=AA, Numbe r<
Mandatory header lines
"Ancestral ALLei
1,Type=5tring,Description ##INF0=<ID=H2,Number=0,Type=Flag,Description=,,HapMap2 member ##FORMAT=<ID=GT,Number=l,Type=String,Description="Genotype^ ' ##FORMAT=<ID=GQ,Number=l,Type=Integer,Description="Genotype Qu ##F0RMAT=<ID=GL,Number=3,Type=Float,Description="LikeliKbods f ##FORMAT=<ID=DP,Number=l,Type=Integer,Description="Re/fa Depth" ##ALT=<ID=DEL,Description="Deletion"> ##INFO=<ID=SVTYPE,Number=l,Type=String,Descriptig/^"Type of st ##INFO=<ID=END,Number=l,Type=Integer,Description^'End position #CHR0M POS ID        REF   ALT      QUAL FILTER INFO
AA=T
Optional header lines (meta-data about the annotations in the VCF body)
.p">
ality (phred score)"> or RR,RA,AA genotypes
!R=ref,A=alt)">
ructural variant" of the variant">
Deletion
SVTYPE=DEL;END=3G0 Other event
FORMAT GT: DP GT:GQ GT:GQ GT:GQ:DP
SAMPLE1 1/2:13 0|1:10Q 0:77 /1:12:3
IPLK^"
Large SV
SAMP^
0/0' 2/2^0 1/1:9. 0/0:20
Phased data (G and C above are on the same chromosome)
Reference alleles (GT=0)
Alternate alleles (GT>0 is an index to the ALT column)
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Of
fastq
Visualization of genotypes by IGV
Input vcf
Fi.fl 'Jk*^H Mi   £*yf*k wit
Vi ■ t" I
WWW
• I.L
k^:-:■ :
MUTO1T 1*01X111
NW)H
MUT MT
Wiour NA1M1I
Wi'lH NM IUI
hMll-MQ M.1I-M1 ui:ui hu* I l-J W Mll-tttl
Mil-Ill
nil if 11 hp-I lift
k*- " ■: >
iMJ
J1JIL \  !|l l|ll|jL|l|lfLII I l|JIII        Jill Ulli Ell III El ]ia)IJ J IL IIIIIIJEIIJf 1111 |j
ill    I    i Ii       I I ii ii   I ii it
ii
I in
i
IK J
i II
ii Ml
mi
I I
I I
I i
i i
11 i
i ii i in i i ii ii
i i
ii i
I i
i i
■   i I
ii
Ml
i
i i
i i
i  II I
I I
i
i  III      I I
I I I
i ■1
I
II i
i
I I
i ii
I
i
■ iii
i ii
i iii i ii
i I I II
i
■ II J ii
II ii I
I
I I
I I
II
i i-
II I
II i
III
I,
■■
II II
I j
Ii i
Mil
"I!1 i ii i ii i
ii
i
in ii
ii i
ii
ii i
ii
i i
111
111 11
11 11 11 11
111 ii
11
ii
111
ii
in ii
ii ii
ii
i
11 11
ii ii ii
i
ii
ii ii
i ii i ii
,1!
I ii
I hi
ii   ii    li ill
ii    ii iii
h i i
11 I
ii
.! ■ i
ii
ii
in ii ii
i ii i
■ ii I
'i|i J
II
IM J i i i ii
III i
i i
Ml i 1
I Ii I
■ Ii
i
i i ii
I ii
I I
I I I I
i
11 Ii i
Output C VariantsN
Annotation
1) Each bar across the top of the plot shows the allele fraction for a single locus.
2) The genotypes for each locus in each sample. Dark blue = heterozygous, Cyan = homozygous variant, Grey = reference.
Input
Reads (fastq)
Pre alignment Alignment Quality Control & (Mapping) trimming
Variant calling
Output
r VariantsN
Of
fastq
Annotation
From genomic coordinate to biological meaning
Provide links to various databases (RefSeq, dbSNP, etc.) To distinguish significant variant from non-significant (synonymous vs. non-synonymous, gene, exon, intron, cDNA, codon, transcript, freq in population, presence in other diseases...) RefSeq dbSNP Regulation
Comparative genomics Repeats Functional Gene ontology Etc.
Annotation
dbSNP
HGMD
COSMIC
Input
Pre alignment Quality Control
Alignment
/IV/1 v Variant calling (Mapping) b
Reference
Reads (fastq)
Sensitivity & Specificity as a matter of:
•  Experiment design
(library preparation + NGS technology + number of samples + amount of data)
•  Data processing
(pre-processing + alignment + variant calling + annotations + filtering)
Courses
http://meetings.embo.org/event/17-genome http://www.embo.org/events/practical-courses