Moderní metody analýzy genomu BioinformatikaII Karol Pál (Šárka PospíšilováResearch Group - Centre for Molecular Medicine CEITEC) 27.11.17 1 Content of this lecture • Recap • Sequencing reads • Data analysis pipeline (workflow) • Quality control • Raw Reads • Alignment • Variants • Visualizing NGS Data • IGV 27.11.17 2 Recap – Sequencing reads http://nextgen.mgh.harvard.edu/CustomPrimer.html 27.11.17 3 Reference … Cosmic dbSNP Recap – Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants 27.11.17 4 Reference … Cosmic dbSNP Recap – Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants SAM/BAMFASTQ VCF/MAFBCL 27.11.17 5 Quality Control • Different steps • Reads summary statistics • Aligners • Post alignment statistics • Different tools • Different kinds of outputs SAM/BAMFASTQ VCF/MAFBCL 27.11.17 6 27.11.17 7 27.11.17 8 Quality Control 27.11.17 9 Quality Control 27.11.17 10 Reference … Cosmic dbSNP Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants SAM/BAMFASTQ VCF/MAFBCL 27.11.17 11 BCL to FASTQ BCL FASTQ Sample Sheet • BCL - raw sequencing output • Convert to FASTQ format • Split into sample files • May be automated 27.11.17 12 Sample sheet 27.11.17 13 Sample sheet 27.11.17 14 Raw reads – bcl2fastq 27.11.17 15 MultiQC output Raw reads – bcl2fastq 27.11.17 16 MultiQC output Reference … Cosmic dbSNP Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants SAM/BAMFASTQ VCF/MAFBCL 27.11.17 17 FastQC • Summary statistics • Two modes • Stand alone program • Command line (output can be integrated to MultiQC) • Input: Fastq or BAM file • [Demo] 27.11.17 18 Trimming • Adaptors • Low quality ends of reads • Tools: • Cutadapt • Trimmomatic 27.11.17 19 Trimming > cutadapt \ -a AGATCGGAAGAGC \ -A AGATCGGAAGAGC \ -o BR_0296_I.trimmed.1.fastq.gz \ -p BR_0296_I.trimmed.2.fastq.gz \ BR_0296_I.R1.fq.gz BR_0296_I.R2.fq.gz http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf27.11.17 20 27.11.17 21 MultiQC output FastQC II – after trimming 27.11.17 22 FastQC II – after trimming 27.11.17 23 FastQC II – after trimming 27.11.17 24 FastQC • Overrepresented sequences are generally OK • Highly expressed genes? • cca 10 000 • If the number is too high (> 100 000) may indicate a problem • rRNA not depleated? *RNASeq 27.11.17 25 FastQC • Nearly all RNA-Seq libraries inherit an intrinsic bias in the positions at which reads start.[1] *RNASeq [1] https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html27.11.17 26 Reference … Cosmic dbSNP Recap – Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants SAM/BAMFASTQ VCF/MAFBCL 27.11.17 27 DNA • De Novo Assembly • Create a new reference • Find structural variants • Map to an existing reference • Alignment (BWA) • Map against several references • Blast 27.11.17 28 De Novo Assembly https://www.abmgood.com/marketing/knowledge_base/next_generation_sequencing_data_analysis.php 27.11.17 29 https://www.ecseq.com/support/ngs/what-is-mate-pair-sequencing-useful-for27.11.17 30 De Novo Assembly 27.11.17 31 Alignment 27.11.17 32 Alignment 27.11.17 33 GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG Reference sequence CTGATGTGCCGCCTCACTTCGGTGGT Short read 1 TGATGTG-CGCCTCACTACGGTGGTG Short read 2 GATGTG-CGCCTCACTTCGGTGGTGA Short read 3 GCTGATGTGCCGCCTCACTACGGTG Short read 4 GCTGATGTGCCGCCTCACTACGGTG Short read 5 Alignment 27.11.17 34 GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG Reference sequence CTGATGTGCCGCCTCACTTCGGTGGT Short read 1 TGATGTG-CGCCTCACTACGGTGGTG Short read 2 GATGTG-CGCCTCACTTCGGTGGTGA Short read 3 GCTGATGTGCCGCCTCACTACGGTG Short read 4 GCTGATGTGCCGCCTCACTACGGTG Short read 5 Chr7 127471196 127472363 Pos1 0 + BED file Blast http://www.mrc-lmb.cam.ac.uk/rlw/text/bioinfo_tuto/sequence.html 27.11.17 35 27.11.17 36 No alignment? 27.11.17 37 Alignment to human genome • GRCh37(NCBI) vs hg19(UCSC) released 2009 >seq1 ACGTCGTG >seq2 additional info TCGCAGCG Fasta format: Unique sequence name 27.11.17 38 Alignment to human genome • GRCh37(NCBI) vs hg19(UCSC) released 2009 … 27.11.17 39 Alignment to human genome • GRCh37(NCBI) vs hg19(UCSC) released 2009 … … 27.11.17 40 Alignment to human genome GRCh37(NCBI) vs hg19(UCSC) released Feb 2009 VS GRCh38(NCBI) or hg38(UCSC) released Dec 2013 27.11.17 41 Alignment to human genome GRCh37(NCBI) vs hg19(UCSC) released Feb 2009 VS GRCh38(NCBI) or hg38(UCSC) released Dec 2013 27.11.17 42 Alignment to genome NCBI/UCSC applies also to the mouse genome GRCm38/mm10 27.11.17 43 Alignment to genome • Ungapped • BWA • Novoalign • Gapped • Bowtie • Star 27.11.17 44 IGV 27.11.17 45 IGV Annotation Track Navigation Data track 27.11.17 46 IGV Annotation Track Navigation Data track 27.11.17 47 IGV Annotation Track Navigation Data track 27.11.17 48 IGV Annotation Track Navigation Data track 27.11.17 49 Alignment QC - Coverage statistics • How many reads are aligned? • How even is the overall coverage • Average insert size • How many reads come from the region of interest • On/Off target reads • Bed file – defines region of interest • What is the average coverage • How many % of target bases have at least X coverage 27.11.17 50 Alignment – Coverage statistics • [Multiqc demo] 27.11.17 51 Alignment – Coverage statistics Picard-Tools 27.11.17 52 Alignment PCR library 27.11.17 53 Reference … Cosmic dbSNP Recap – Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants SAM/BAMFASTQ VCF/MAFBCL 27.11.17 54 DNA Variant calling • Single Nucleotide Variants (SNV’s) + short indels • Somatic/Germline • Copy Number variants (CNV) • Structural Variants 27.11.17 55 IGV Annotation Track Navigation Data track 27.11.17 56 IGV inspect variants Somatic 27.11.17 57 Compound Heterozygote 27.11.17 58 CNV 27.11.17 59 CNV 27.11.17 60 IGV soft clips 27.11.17 61 IGV soft clips 27.11.17 62 27.11.17 63 Reference … Cosmic dbSNP Recap – Data analysis pipeline Pre alignment Quality Control Alignment (Mapping) + QC Variant calling Annotation Reads Variants SAM/BAMFASTQ VCF/MAFBCL 27.11.17 64 Variant annotation 27.11.17 65 MultiQC output Variant annotation + report https://www.kti.admin.ch/kti/en/home/ueber-uns/nsb-news/weitere-news/sophiagenetics.html27.11.17 66 Variant annotation + report https://bioconductor.org/packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html27.11.17 67 RNAseq pipeline *RNASeq Feature counts (+normalization) Volcano Plot Heat map 27.11.17 68 RNASeq reads aligned to the reference genome https://www.melbournebioinformatics.org.au/tutorials/tutorials/rna_seq_dge_basic/rna_seq_basic_tutorial/ Annotation *RNASeq 27.11.17 69 Taka away • Terminology • Interpreting Different QC metrics • Interpreting NGS data visually • Basic intuition (reads, alignments, references, variants) Thank you for your attention 27.11.17 70