Bi5444 Analysis of sequencing data NGS quality control Eva Budinska The NGS analysis pipeline Exp.design Library prep. Sequencing Quality control .RS« Im. h- Visualization in browser Feature quantification Ref.guided transcrip ^assembly/quantification count matrix gtf file gene sets Pathway analysis Diff.expression Diff.splicing Exploration Step 0: base calling (Image analysis) + base quality control Diff.expression Step 0: base calling (image analysis) The identity of each base of a cluster is read off from sequential images One cycle -> one image TG CT AC GAT ... The PHRED score Qphred = -10 x log10P(error) • The Phred quality score is the negative ratio of the error probability to the reference level of P = 1 expressed in Decibel (dB). • The error estimate is based on statistical model providing measure of certainty of each base call in addition to the nucleotide itself • These statistical models base their error estimate on: • Signal intensities from the recorded image • Number of the sequencing cycle • Distance to other sequence colonies • Phred score is recoded using ASCII in fastq file Phred score Probability of incorrect base call Base call 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10 000 99.99% 50 1 in 100 000 99.999% 60 1 in 1000 000 99.9999% II Phred score encoding in II ASCII https://en.wikipedia.org/wiki/FASTQ_format @D7MHBFN1:Z02:D1BUDACXX:4:1101:1340:1967 1:N:0:CATGCA NATCTTCGGATCACTTTGGTCAMTTGAMCGATACAGAGMGATTGTMGTMCMTATTTACCMGGTTCGAGTCATACTMCTCGTTGTCCTATAGT ^DDFFFHHHHHJJJJJJJHIJIJJJIJIJGIIIJJJJJJIIJIJJJHIIFGIIIIJJJJJIIEHJIIHHGFFF^ADFEDDEDCDDBDDEDCDDDDEC £££££££££££££££££££££££££££££££££££££££££..................................................... ..........................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ................................. JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... ! ()* + ,-. /0123456789s j<*>?fiftBCDEFCinJKLMHOPQRST!UVWXYZ [\ ] *_" abcdef ghij klinnopqrstLivwxyz{ | )~ i iii r i 33 55 64 13 104 126 0........................26. . .31.......40 -5____0........9.............................40 0........9.............................40 3.....9.............................40 0........................26. . .31........41 S - Sanger Phred+33, raw reads typically {0, 40) X - Solexa Solexa+64, raw reads typically {-5, 401 I - lllumina 1.3+ Phred+64, raw reads typically {0, 40) J - lllumina 1.5+ Phred+64, raw reads typically (3, 40} with Optimised, l^unused, 2'Read Segment Quality Control Indicator {bold} {Note: See discussion above}. L - lllumina 1.8+ Fhred+33, raw reads typically {0, 41} FASTQ format • Combines sequence and base call quality information • Typical file extension:. fastq @D7MHBFN1:202:D1BUDACXX:4:1101:1340:1967 1:N:0:CATGCA NATCTTCGGATCACTTTGGTCAMTTGAMCGATACAGAGMGATTGTMGTMC^ + #1=DDFFFHHHHHJJJJJJJHIJIJJJIJIJGIIIJJJJJJIIJIJJJHIIFGIIIIJJJJJIIEHJIIHHGFFF@?ADFEDDEDCDDBDDBDCDDDDEC • Four lines per sequence (read): • ID (starting with @) • Sequence line • Another ID line (starting with +) • Base qualities (one for each letter in the sequence) Step 1: Read quality control and data filtering Library prep. — Quality control 1M file reference gtflile count matrix gtf file vay analysis Diff.expression Diff.splicing Exploration Before we dive in... let's review few concepts and expressions r The steps of lllumina sequencing 1. Fragment genomic DNA, e.g. with a sonicator. 2. Ligate adapters to both ends of the fragments. 3. PCR amplify the fragments with adapters Spread DNA molecules across flowcells. Goal is to get exactly one DNA molecule per flowcell lawn of primers. This depends purely on probability, based on the concentration of DNA. Use bridge PCR to amplify the single molecule on each lawn so that you can get a strong enough signal to detect. Usually this requires several hundred or low thousands of molecules. Sequence by synthesis of complementary strand: reversible terminator chemistry. A. Library Preparation Genomfc DNA ^^^^S J Fragmentation Adapters J Ligation Seojencrig Library NSS Itrary ts prepared by fragmenting agDNA sample and iigatng specialzed adapters to both fragment ends. C . Sequencing Sequencing Cycles ( ) Data it mported to m odput fib | I Digital image Ctisbf l » Riad v QJOJ.. cuaat 2 -FiMd 2: man... C«jsfc«; - Riad ICTkQ. Ou9Qr4>Raad«AW:.. Taxi Fl* Sequencing reagents, incUdhg fluoresce ntly labeled nucleo-tides, are added and the flrat base is incorporated. The flow cell Is Imaged and tne emission from each cluster Is recorded. The emission wavelength and Intensity are used to Identify tne base. This cycle is repeated If tmes to create a read length of'n'bases. A. Cluster Amplification FlowCsll Qrtdge Amplification Cycles ■iVl'l CLiBterB Library Is loaded Into a flow eel and the fragments hybridize to the flow cell surface. Each bound fragment Is amplified Into a clonal cluster through bridge arnplfflcation. D. Alignment & Data Anaylsis ATGGCATTGCAATTTGACAT TGGGATTGCAATTTG AGATGG TATTG GATGGGATTGCAA GCATTGCAATTTGAC ATGGGATTGCAATT AGATGGCATTGCAATTTG AGATGG TATTGCAATTTGACAT Reads are aligned to a reference sequence with Worrformatlcs software. Alter alignment, dfferencea between the reference genome and the newly sequenced reads can be identified Sources of errors: adapters Sequencing random fragments of DNA is possible via the addition of short nucleotide sequences which allow any DNA fragment to: In step 2, adapters are ligated to the end of the fragments • Bind to a flow cell for next generation sequencing Flow Cell binding Region Read Primer 1 Index Primer Read Primer 2 Flow Cell binding Region Universal Adapter DNA Fragment to be Sequenced -idexed Adapter Index Region • Allow for PCR enrichment of adapter ligated DNA fragments only • Allow for indexing or 'barcoding' of samples so multiple DNA libraries can be mixed together into 1 sequencing lane (known as multiplexing) From: http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_lllum Sources of errors: PCR duplicates • In step 3 we are intentionally creating multiple copies of each original genomic DNA molecule so that we have enough of them. • PCR duplicates occur when two copies of the same original molecule get onto different primer lawns in a flowcell. • In consequence we read the very same sequence twice! Higher rates of PCR duplicates e.g. 30% arise when you have too little starting material such that greater amplification of the library is needed in step 3, or when you have too great a variance in fragment size, such that smaller fragments, which are easier to PCR amplify end up over-represented. Adapter DNA fragment Dense lawn of primers I Find beautiful explanation of probabilities and much more at: https://www.cureffi.org/2012/12/ll/how-pcr-duplicates-arise-in-next-generation-sequencing/ Sources of errors: sequencing by synthesis the fluorescence • In step 5 we amplify the signal and detect the fluorescence of each base • The assumption is that in a cycle, every molecule on the flowcell is extended by one base • The reality: • Some molecules are not extended or their base has no fluorescent dye • The previous fluorescent dye is not cleaved -the signal from the cluster after a few cycles is a mix of signals from previous bases © © © © © © © © © © © © 3' 5' Cycle 1: Add sequencing reagents First base incorporated ct signal Cleave terminator and dye UNA O^N-" TP OH free 3' end Excitation- Cycle 2-n: Add sequencing reagents and repeat I Sequencing I coverage Coverage in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence. Reference genome www.metagenomics.wiki I Depth of coverage (coverage depth / mapping depth) How strongly is the genome "covered" by sequenced fragments (short reads)? Per-base coverage is the average number of times a base of a genome is sequenced (in other words, how many reads cover it). Average coverage of the genome (Av) Av = (NxL)/G G - length of the original genome N - number of reads L - average read length 5X CD C7> ro i— QJ > S 4X £ 3X JZ 2X a ix CD Q Reference genome www.metagenomics.wiki The coverage depth of a genome is calculated as the number of bases of all short reads that match a genome divided by the length of this genome. It is often expressed as IX, 2X, 3X,... {1, 2, or, 3 times coverage). Breadth of coverage (covered length) What proportion of the genome is "covered" by short reads? Are there regions that are not covered, even not by a single read? CD ro CD > o u CD-CD □ 5X 4X 3X 2X IX Breadth of coverage Reference genome www.metagenomics.wiki Breadth of coverage is the percentage of bases of a reference genome that are covered with a certain depth. For example: "90% of a genome is covered at IX depth; and still 70% is covered at 5X depth." Sequencing coverage • Deep sequencing refers to the general concept of aiming for high number of unique reads of each region of a sequence. Step 1: Read quality control and data filtering Uses the output file with information about the quality of base calls (.fastq) First step in the pipeline that deals with actual sequencing data in base or color space • Several metrics are evaluated, not all of them use the Phred score information: • Distribution of quality scores at each sequence, Sequence composition, Per-sequence and per-read distribution of GC content, Library complexity, Overrepresented sequences • Initial overview - already in base calling SW • More quality overview - SW solutions SolexaQA, FastQC Main quality control points 1. Base quality 2. Sequence composition - sequence content across bases should not change with cycle (exception are targeted sequencing SNP experiments) 3. Per-sequence and per-read distribution of GC content (shift from expected can indicate contamination by rRNA for instance) 4. Library complexity (too many duplicates?) 5. Overrepresented sequences -may represent highly expressed genes, or presence of adapters or rRNA contamination or PCR duplicates Base quality • Quality of bases (Phred score) should be good across all cycles (all the sequence) Base quality - an excellent example Quality scores across all bases (lllumina 1.5 encoding) • Shows distribution (boxplot) of quality of bases (Phred scores) across all reads in each cycle i 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Position in read (bp) Base quality - a more common example Quality scores across all bases (Sanger / lllumina 1.9 encoding) — I • Decrease of quality towards the end of reads (late cycles) 1 5 6 7 8 9 15-19 25-29 35-39 45-49 55-59 65-69 75-79 85-89 95 Position in read (bp) Base quality - bad example ?ffiffiffiffiÖ Quality scores across all bases (lllumina 1.5 encoding) III Tin Hill 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Position in read (bp) Base quality - sudden quality drop Indicates problems with flow cell, trimming needed 34 0 32.0 30.0 28.0 26.0 24.0 22.0 20.0 18.0 16 0 14.0 12 0 10.0 8.0 6 0 4.0 2.0 0 0 Quality scores across all bases (lllumina >vl.3 encoding) 11 13 15 17 19 21 23 25 Position in read (bp) 27 29 31 33 35 37 39 Base quality -targeted sequencing • The low quality extremes suggest a problem in the beginning of the reads • (primers?, NNNNN sequences...) Quality scores across all bases (Sanger / lllumina 1.9 encoding) 40 123456789 15-19 30-34 45-49 60-64 75-79 90-94 105-109 120-124 135-139 150-151 Position in read (bp) Quality scores across all bases (Sanger / lllumina 1.9 encoding) Base quality microbiome This is what it can look like with very small sample and sequence size ft 1234567B9 15-19 30-34 45-49 60-64 75-79 90-94 105-109 120-124 135-139 150-151 Position in read (bp) Base call errors in last cycles • Towards the end of sequencing, the quality drops, signal worse • We can see it for lllumina and SOLID • Not very important for RNAseq, but crucial for variant calling SNP calling TTAACAG TG TTCAGTA AG TTCAGTAAG ACAG TG TTCAG TAAG TGTTCAGTAAG CAGTGTTCAGTAAG GTTCAGTAAG G TG TTCAG TAAG TCAGTAAG AACAGTGTTCAGTAAG AGTAAG T 1 1 A T TTCCATGAGCT TTCC TTCCATGAG 1 ICCA TTCCATGAGC TTCCATGA TTCCATGAGCTC 1 1 TTCCATGAG CTCT CTTA AC AG TG TTC AG TAAG TTCCATGAG CTCT SNP calling TTA AC AG TG TTC AGTA A< A AC AG TG TTC AG TA A( AC AG TG TTC AG TA A( C AG TG TTC AG TA A< G TG TTC AG TA Ai TGTTCAGTAA( GTTCAGTAAi TTCAGTAA( TCAGTAAl AGTAAi T T IT TTCC ITCCA TTCCATGA TTCCATGAG TTCCATGAGC TTCCATGAGCT TTCCATGAGCTC TTCCATGAGCTCT CTTA AC AG TG TTC AG TA AGi ATTCC ATG AG CTCT SNP error dependent on cycle These errors are not random and look like SNPs (e.g. if there were randomly distributed T, C, G and A's, we would conclude it is error directly) We want the SNPs to be distributed evenly across cycles SNPs coming from towards end of the read are sign of false positive = 8 r T-A C-A G-A A-T C-T G-T A-C T-C G-C A-G T-G C-G 10 15 20 Sequencing cycle SNP error dependent on cycle These errors are not random and look like SNPs (e.g. if there were randomly distributed T, C, G and A's, we would conclude it is error directly) We want the SNPs to be distributed evenly across cycles SNPs coming from towards end of the read are sign of false positive 1.0 - c o 0.8 - 'c/5 o CL E 0.6 - o o CD 0.4 - s CD u □ Z 0.2 - 0.0 - Sequencing cycle Long fragments have lower base quality From: Long fragments achieve lower base quality in lllumina paired-end sequencing 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Fraction of reads with average quality below 30 Fraction of reads with average mismatch rate above 0.01 We plot the fraction of low quality reads in the 138 samples analyzed in our study. Across all samples the R2 reads harbor more low quality reads than the Rl reads. We plot two alternative definitions of 'low quality'. Reads are called low quality if (A) the average Phred score is below 30, or (B) the average mismatch rate of the aligned bases is above 0.01. Both plots show that the R2 reads harbor more low quality reads and that the fraction of low quality reads is more variable across samples. Increase of R2 low quality reads as a function of the content of long fragments From: Long fragments achieve lower base quality in lllumina paired-end sequencing Fraction of reads with fragment length > 500nt Fraction of low quality R1 reads Increase of R2 low quality reads as a function of the content of long fragments. In (A) we plot for individual samples the difference in low quality read content among the R2 and the Rl reads versus the content of long fragments. The plot shows that the more long fragments a samples has the more prevalent are low quality reads among the R2 reads. In (B) we directly compare the fraction of low quality reads in R2 and Rl and color-code the content long fragments. Low quality reads are defined as reads having a mismatch rate above 0.01 in the bases after alignment. The plotted samples have been generated using various protocols on various sequencers in various labs. The dashed lines connect three samples each that have been processed identically except with an increasing targeted fragment length. Per base sequence content • Sequence content across bases should not change with cycle Per base sequence content - RNAseq -typical lllumina library • The primers used in the library are typically not removed Sequence content across all bases 1 2 3 4 5 6 7 8 15-19 25-29 35-39 45-49 55-59 65-69 75-79 85-39 95-! Position in read (bp) Per base sequence content - targeted sequencing Sequence content across all bases In targeted sequencing there is much less genes being sequenced so the base composition of reads is non-random %T %G \ \ L / A 0 A >-* --■ V 1 \ 123456789 15-19 30-34 45-49 60-64 75-79 90-94 105-109 120-124 135-139 150-15! Position in read (bp) Per base sequence content - a bad example? Sequence content across all bases • This suggests that a single sequence makes up a large part of the library - this can mean rRNA contamination in RNAseq Per base sequence content - microbiome Sequence content across all bases • ... however, it is excepted if we sequence 16S rRNA of microbiome where one or few bacteria strains are dominating Position in read (bp) Per sequence quality • All - or at least majority of the sequences should have good average quality (average Phred score across all read bases) Per sequence quality - RNAseq Per sequence quality - microbiome 16 17 IB 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Mean Sequence Quality (Phred Score) Per sequence quality - targeted sequencing Quality score distribution over all sequences Small peaks in lower average quality can suggest low quality ends on part of sequences - attention, if small read diversity (e.g. microbiome), this can be due to highly duplicated reads due to too deep sequencing 2 3 4 5 6 7 3 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 Mean Sequence Quality (Phred Score) Per sequence and per read GC content Mean GC content across reads should correspond to the overall GC content of the genome Evan small shifts can indicate contamination with GC rich sequences (ribosomal RNA with high GC content for instance) 0.5 0.6 0.7 ribosomal RNA fraction Per sequence GC content - RNAseq Per sequence GC content - targeted sequencing GC distributio i over all sequences GC count per read Theoretical Distribution 100000 • This strange theoretical auuuu distribution is due to high 60000 \ I amount of NNNNNN sequences in the reads 40000 y ZUUUU / 0 2 4 6 8 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 Mean GC content (%) Per sequence GC content - targeted sequencing after trimming GC distribution over all sequences • The GC count per read is disturbed because of small number of genes sequenced! 0 2 4 6 8 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 Mean GC content (%) Per sequence GC content - microbiome GC distribution over all sequences • The GC count per read is disturbed because of small diversity of sequences 0 2 4 6 8 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 Mean GC content [%) Per read GC content - good example GC content across all bases • The GC count per read is disturbed because of small diversity of sequences 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Position in read (bp) Per read GC content - typical RNAseq 90 80 70 • GC content different in first 8-10 bases, due to presence f m 40 of primers 30 20 10 0 123456789 15-19 25-29 35-39 45-49 55-59 65-69 75-79 85-89 95-99 Position in read (bp) Per read GC content -targeted sequencing GC content across all bases GC content across different base positions due to high duplication level of reads and small diversity 123456789 15-19 30-34 45-49 60-64 75-79 90-94 105-109 120-124 135-139 150-15! Position in read (bp) Per read GC content - microbiome Position in read (bp) Per base N content • In ideal case, there should be minimum of N calls in the reads • "The HiSeq2000 produces very few Ns. It is very rare to see N content greater than 30%. When Ns are produced it is usually the result of some temporary instrument issue. For example a small bubble in the flow cell may cause focus problems at a certain cycle. Downstream processing of Ns depends on your analysis software and strategy/' • Source: https://www.biotech.wisc.edu/services/dnaseq/seque ncing/lllumina_old/lllumina_QC_FAQs N content across all bases < 123456789 15-19 30-34 45-49 60-64 75-79 90-94 105-109 120-124 135-139 150-15! Position in read (bp) Per base N content -targeted sequencing N content across all bases Enrichment for N calls at the beginning of the sequence 123456739 15-19 30-34 45-49 60-64 75-79 90-94 105-109 120-124 135-139 150-15! Position in read (bp) Sequence duplication level and overrepresented sequences • Indicates the library complexity and possible contamination • (the less duplicates, the more complex) • Too many duplicated sequences means we sequenced "too much". • Overrepresented sequences may indicate: • Presence of adapters, presence of contamination (rRNA), PCR problems • This holds, however, mainly for WGS, WES or RNAseq Sequence duplication level - good example (RNAseq) Sequence Duplication Level >- 6.44% • Most sequences occur only once i: %Duplicate re ative to unique 4 5 6 Sequence Duplication Level 9 10+ Sequence duplication level - bad example (RNAseq)? • Over-amplification! May \ come from highly expressed ~ \ transcripts. \ 20 0 N. / 10 0 / 0 0 I__ 1.3456789 10 seauence Ouphcation Lewi Overrepresented sequences Overrepresented sequences Overrepresented sequences • Indicate remaining adapters, PCR duplicates, but also can be real sequences! Always judge based on type of data and check before filtering! Count Percentage Possible Source TGCACACGAGGC GGCTTCTGCAACTTCATGCATTT6AAGC C CATTTC CAG AC CTTGGGAGGGTTCAGA6AAGGCTGCATGTCAC CAAGGCCCATG CTAAC CTCTATCTTC C CTAGTGTGGTAAC CTCATTTC C C CATAAAGATTC AGAAC C C C CTC CTCAGCATCTTATC C GAGTGGAAGGAAATTTGC GTGTGGAGTAT TGAGCAGGAGGGGAAAAGTGCTAATTAC CATGACAAGAAC ATTGTATTAC C GTCTC CTGTTTTGTA6TC CAAC C CTGT6AT6ATT6ATG C CAAA6AAGTG GTCTTCATCTTATTGATAGTTTTGATGGTCTTCTTATC CAACACGCCGAG TGCAAGCTC CTGGTGGCAGCTCT6AAC G6TATTTAAAACAAAAT6AAATG TGCCCTGGCCCTGGGCTTGTGGGGCTGCCCAGCAGCTGCCC ATAAAGGAC CTGC C C C CAGGGAGCACTAA6C 6AG6TAAG CAAGCAGGACAAGAAGCGGT C CAGAT6TTCTTC GCTAATAAC CAC 6AC CAG6AATTTGT6AGTG C TGGG C C CTGTGTTATCTC CTAGGTTGGCTCTGACTGTAC CACCATCCACTACAAC GGCTCGGCCACGCGCTACCACACCTACCTGCCGCCGCCCTACCCCGGCTC TTCTCTTGGAAACTC C CATTTGAGATCATATTCATATTCTCTGAAATC AA TGCTCATGCCCACAGA6ACTTGCACAACATG CA6AATGG C AG C AC ATTGG AAAGGATGGAAAAGAGAA6AAGGCATGGGTGGGAAACTGTG C CTC C CATT TCTC GAGGAGGCAGTGACAGCAATGGCAGTTACTGTCAAC AGGTG6AC AT CTGGGTCTC CTCTCTTTC GTGTCAAAGGACTTCTTTGC CAAGTTCACAGA ACATC CTGTCTTACATC CTGGCAGGTAC G6ATCTAAACAGC GACTTTTTT CCTGCGGACCC GATGC CTCTTC CTGCT6AGATC C C TC C AGTTTTTCC CAG AGTGAGTGCAGTTGTTTAC CAT6ATAAC 6ACACAACAC AAAATAG C C GTA AMGATGGAACTCCACCCTTTGCTTGGTTTTAAGTATGTATGGAATGTTA ACTG6AAGAAATGGATTC CAAAGAG CAGTTCTCTTC CTTTAGTTGT6AAG GAGCTATGAGCTAC GGCCGCCCCCCTCCC 6ATGTG6AGGGTATGAC CTC C AAC C CAC CAATTTTTGGTAGCAGTGGAGAGC TACAGGACAACTGCCAGCA CCCATCCTCAC CATCATCACACTG6AA6ACTC CAGGTCAGGAGCCACTTGCCACCCTGCACAC TGG 110195 8442 7426 7261 7196 6962 6668 6521 6359 6352 6189 6878 6865 6858 5918 5905 5895 5773 5672 5574 5554 5511 5500 5425 5424 5355 5317 6.337201578736401 0.4854907729723917 0.427061653647593 0.4175726726548846 0.4138345892335147 0.39692695037377956 0.3834698500568476 9.3750160306269801 0.36569957656141183 0.36529701373141815 0.3513223326330657 0.3490794825802437 0.3487919377016768 0.3483893748716831 0.3398780464661022 0.33959O5O15875353 0.33961541183040146 0.3319993167933685 0.3261909102463167 0.320555030626405 0.31940485111213734 0.3169319651564618 0.31629936642361456 0.31198619324511073 0.31192868426939735 0.3079605649451738 0.3057752238680652 NO Hit NO Hit No Hit No Hit No Hit No Hit NO Hit No Hit NO Hit No Hit No Hit No Hit No Hit NO Hit NO Hit No Hit NO Hit No Hit No Hit NO Hit No Hit No Hit No Hit No Hit No Hit No Hit No Hit 10.1) Sequence duplication level - targeted sequencing Sequence Duplication Level >= 94.87% %Duplicate relative to unique 5 6 Sequence Duplication Level Some reads are present more than 10 times. This is due to NNNN sequences and due to few genes sequenced (longer genes get more reads) Overrepresented sequences TGCACACGAGGCGGCTTCTGCAACTTC ATGCATTTGAAGC C CATTTC CAG AC CTTGGGAGGGTTCAGAGAAGGCTGCATGTCAC CAAGGCCCATGCTAAC CTCTATCTTCCCTAGTGTGGTAAC CTCATTTC C C CATAAAGATTCAGAAC CCCCTCCTCAGCATCTTATCC GAGTGGAAGGAAATTTGC GTGTGGAGTAT TGAGCAGGAGGGGAAAAGTGCTAATTACCATGACAAGAACATTGTATTAC C GTCTC CTGTTTTGTAGTC CAAC C CTGTGATGATTGATGC CAAAGAAGTG GTCTTCATC TTATTGATAGTTTTGATGGTC TTC TTATC CAACAC GC C GAG TGCAAGCTC CTGGTGGCAGCTCTGAAC GGTATTTAAAACAAAATGAAATG TGCCCTGGCCCTGGGCTTGTGGGGCTGCCCAGCAGCTGCC CATAAAGGAC CTGC C C C CAGGGAGCACTAAGC GAGGTAAGCAAGCAGGACAAGAAGC GGT C CAGATGTTCTTC GCTAATAAC CAC GAC CAGGAATTTGTGAGTGCTGGGC CCTGTGTTATCTCCTAGGTTGGCTCTGACTGTACCACCATCCACTACAAC GGCTCGGCCACGCGCTACCACACCTACCTGCCGCCGCCCTACCCCGGCTC TTCTCTTGGAAACTC C CATTTGAGATCATATTCATATTCTCTGAAATCAA TGCTCATGCC CACAGAGACTTGCACAACATGCAGAATGGCAGCACATTGG AAAGGATGGAAAAGAGAAGAAGGCATGGGTGGGAAACTGTGCCTCCCATT TCTCGAGGAGGCAGTGACAGC AATGGC AGTT ACTGTC AACAGGTGGACAT CTGGGTCTC CTCTCTTTC GTGTCAAAGGACTTCTTTGC CAAGTTCACAGA ACATC CTGTCTTACATC CTGGCAGGTAC GGATCTAAACAGC GAC111111 CCTGCGGACCCGATGCCTCTTCCTGCTGAGATCCCTC CAGTTTTTC C CAG AGTGAGTGCAGTTGTTTAC CATGATAAC GACACAACACAAAATAGC C GTA AMGATGGMCTCCACCCTTTGCTTGGTTTTAAGTATGTATGGAATGTTA ACTGGAAGAAATGGATTC CAAAGAGCAGTTCTCTTC CTTTAGTTGTGAAG GAGCTATGAGCTACGGCCGCCCCCCTCCC GATGTGGAGGGTATGAC CTC C AAC C C AC CMTTTTTGGTAGCAGTGGAGAGCTAC AGGAC AACTGC C AGC A CCCATCCTCACCATCATCACAC TGGAAGAC TCCAGGTCAGGAGCCACTTGCCACCCTGCACACTGG 10.1) Count Percentage Possible Source 110195 8442 7426 7261 7196 6962 6666 6521 6359 6352 6109 6676 6065 6656 5910 5965 5895 5773 5672 5574 5554 5511 55ee 5425 5424 5355 5317 6.337201578736401 0.4854907729723917 0.427061653647593 0.4175726726548846 0.4138345892335147 0.39692695037377956 0.3834698500568476 0.3750160306269801 0.36569957656141183 0.36529701373141815 0.3513223326330657 0.3490794825802437 0.3487919377016768 0.3483893748716831 0.3398780464661022 0.3395905015875353 0.33901541183040146 0.3319993167933685 0.3261909102463167 0.320555030626405 0.31940485111213734 0.3169319651564618 0.31629936642361456 0.31198619324511073 0.31192868426939735 0.3079605649451738 0.3057752238680652 NO Hit NO Hit No Hit No Hit No Hit No Hit NO Hit No Hit NO Hit No Hit No Hit No Hit No Hit NO Hit NO Hit No Hit NO Hit No Hit No Hit NO Hit No Hit No Hit No Hit No Hit No Hit No Hit Quality control exercise • We continue in our exercise from l_Preprocessing.sh Step 1: Read quality control and data filtering Based on the quality measures, we decide to remove low quality bases and reads • Trimming - removes low quality or unwanted bases from reads, thus shortening them. Is applied to increase the number of mappable reads. • Filtering - removes whole reads that do not meet quality standards (e.g. too short etc) Trimming reads • Read trimming is applied to increase the number of mappable reads by: • Removing low quality bases at the end of the reads that are likely to contain sequencing errors • Removing adapter sequences Ill Removing adapters • Important mainly for very short read sequences of interest (when the input DNA fragment is less than the read length) e.g. for miRNA with 22nt length the adapter gets sequenced more often than for RNA sequences, which are much longer What is the sequence of adapters? Best option: ask which kit was used for preparing libraries Programs: cutadapt, trimmomatic TruSeq Universal Adapter: 5 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 3 TruSeq Indexed Adapter 5 GATCGGAAGAGCACACGTCTGAACTCCAGTCAC NNNNNN ATCTCGTATGCCGTCTTCTGCTTG 3 Here "N" is any nucleotide, and the 6 of them together are a unique sequence which can readily be identified as unique to 1 library. Ill Filtering reads • We can remove whole reads based on: • quality of its base calls • its length (too short reads) • level of duplication Ill Trimming and filtering - exercise Trimming and filtering - exercise • We continue in our exercise from l_Preprocessing.sh • We will use grep command to find adapter sequences and cutadapt to remove them • We will trim low quality bases • Independent work: find specific QC problems in your project data and suggest solutions (what to trim, filter, etc) Recommended literature • Fuller et al. 2009: The challenges of sequencing by synthesis http://arep.med.harvard.edu/pdf/Fuller 09.pdf • https://sequencing.qcfail.com/