Bi5444 Analysis of sequencing data
NGS quality control
Eva Budinska
The NGS
analysis
pipeline
Exp.design
Library prep.
Sequencing
Quality control
.RS«
Im. h-
Visualization in browser
Feature quantification
Ref.guided transcrip ^assembly/quantification
count matrix
gtf file
gene sets
Pathway analysis
Diff.expression
Diff.splicing Exploration
Step 0: base calling (Image analysis)
+ base quality control
Diff.expression
Step 0: base calling (image analysis)
The identity of each base of a cluster is read off from sequential images
One cycle -> one image
TG CT AC GAT ...
The PHRED score
Qphred = -10 x log10P(error)
• The Phred quality score is the negative ratio of the error probability to the reference level of P = 1 expressed in Decibel (dB).
• The error estimate is based on statistical model providing measure of certainty of each base call in addition to the nucleotide itself
• These statistical models base their error estimate on:
• Signal intensities from the recorded image
• Number of the sequencing cycle
• Distance to other sequence colonies
• Phred score is recoded using ASCII in fastq file
Phred score	Probability of incorrect base call	Base call
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10 000	99.99%
50	1 in 100 000	99.999%
60	1 in 1000 000	99.9999%
II Phred score encoding in II ASCII
https://en.wikipedia.org/wiki/FASTQ_format
@D7MHBFN1:Z02:D1BUDACXX:4:1101:1340:1967 1:N:0:CATGCA
NATCTTCGGATCACTTTGGTCAMTTGAMCGATACAGAGMGATTGTMGTMCMTATTTACCMGGTTCGAGTCATACTMCTCGTTGTCCTATAGT
^DDFFFHHHHHJJJJJJJHIJIJJJIJIJGIIIJJJJJJIIJIJJJHIIFGIIIIJJJJJIIEHJIIHHGFFF^ADFEDDEDCDDBDDEDCDDDDEC
£££££££££££££££££££££££££££££££££££££££££.....................................................
..........................xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
................................. JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
! ()* + ,-. /0123456789s j<*>?fiftBCDEFCinJKLMHOPQRST!UVWXYZ [\ ] *_" abcdef ghij klinnopqrstLivwxyz{ | )~
i iii r i
33 55     64 13 104 126
0........................26. . .31.......40
-5____0........9.............................40
0........9.............................40
3.....9.............................40
0........................26. . .31........41
S - Sanger Phred+33, raw reads typically {0, 40)
X - Solexa Solexa+64, raw reads typically {-5, 401
I - lllumina 1.3+ Phred+64, raw reads typically {0, 40)
J - lllumina 1.5+ Phred+64, raw reads typically (3, 40}
with Optimised,  l^unused, 2'Read Segment Quality Control Indicator {bold}
{Note: See discussion above}.
L - lllumina 1.8+ Fhred+33, raw reads typically {0, 41}
FASTQ format
• Combines sequence and base call quality information
• Typical file extension:. fastq
@D7MHBFN1:202:D1BUDACXX:4:1101:1340:1967 1:N:0:CATGCA
NATCTTCGGATCACTTTGGTCAMTTGAMCGATACAGAGMGATTGTMGTMC^
+
#1=DDFFFHHHHHJJJJJJJHIJIJJJIJIJGIIIJJJJJJIIJIJJJHIIFGIIIIJJJJJIIEHJIIHHGFFF@?ADFEDDEDCDDBDDBDCDDDDEC
•   Four lines per sequence (read):
• ID (starting with @)
• Sequence line
• Another ID line (starting with +)
• Base qualities (one for each letter in the sequence)
Step 1:
Read quality control and data filtering
Library prep. —
	
Quality control	
	
1M file
reference gtflile
count matrix
gtf file
vay analysis
Diff.expression
Diff.splicing Exploration
Before we dive in...
let's review few concepts and expressions
r
The steps of lllumina sequencing
1. Fragment genomic DNA, e.g. with a sonicator.
2. Ligate adapters to both ends of the fragments.
3. PCR amplify the fragments with adapters
Spread DNA molecules across flowcells. Goal is to get exactly one DNA molecule per flowcell lawn of primers. This depends purely on probability, based on the concentration of DNA.
Use bridge PCR to amplify the single molecule on each lawn so that you can get a strong enough signal to detect. Usually this requires several hundred or low thousands of molecules.
Sequence by synthesis of complementary strand: reversible terminator chemistry.
A. Library Preparation Genomfc DNA ^^^^S
J Fragmentation
Adapters
J Ligation
Seojencrig Library
NSS Itrary ts prepared by fragmenting agDNA sample and iigatng specialzed adapters to both fragment ends.
C . Sequencing
Sequencing Cycles
( )
Data it mported to m odput fib |
I Digital image
Ctisbf l » Riad v QJOJ..
cuaat 2 -FiMd 2: man...
C«jsfc«; - Riad ICTkQ. Ou9Qr4>Raad«AW:..   Taxi Fl*
Sequencing reagents, incUdhg fluoresce ntly labeled nucleo-tides, are added and the flrat base is incorporated. The flow cell Is Imaged and tne emission from each cluster Is recorded. The emission wavelength and Intensity are used to Identify tne base. This cycle is repeated If tmes to create a read length of'n'bases.
A. Cluster Amplification
FlowCsll
Qrtdge Amplification Cycles
■iVl'l
CLiBterB
Library Is loaded Into a flow eel and the fragments hybridize to the flow cell surface. Each bound fragment Is amplified Into a clonal cluster through bridge arnplfflcation.
D. Alignment & Data Anaylsis
ATGGCATTGCAATTTGACAT TGGGATTGCAATTTG AGATGG TATTG GATGGGATTGCAA
GCATTGCAATTTGAC ATGGGATTGCAATT AGATGGCATTGCAATTTG
AGATGG TATTGCAATTTGACAT
Reads are aligned to a reference sequence with Worrformatlcs software. Alter alignment, dfferencea between the reference genome and the newly sequenced reads can be identified
Sources of errors: adapters
Sequencing random fragments of DNA is possible via the addition of short nucleotide sequences which allow any DNA fragment to:
In step 2, adapters are ligated to the end of the fragments
• Bind to a flow cell for next generation sequencing
Flow Cell binding Region
Read Primer 1
Index Primer
Read Primer 2
Flow Cell binding Region
Universal Adapter DNA Fragment to be Sequenced -idexed Adapter Index Region
• Allow for PCR enrichment of adapter ligated DNA fragments only
• Allow for indexing or 'barcoding' of samples so multiple DNA
libraries can be mixed together into 1 sequencing lane (known as multiplexing)
From:
http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_lllum
Sources of errors: PCR duplicates
• In step 3 we are intentionally creating multiple copies of each original genomic DNA molecule so that we have enough of them.
• PCR duplicates occur when two copies of the same original molecule get onto different primer lawns in a flowcell.
• In consequence we read the very same sequence twice!
Higher rates of PCR duplicates e.g. 30% arise when you have too little starting material such that greater amplification of the library is needed in step 3, or when you have too great a variance in fragment size, such that smaller fragments, which are easier to PCR amplify end up over-represented.
Adapter
DNA fragment
Dense lawn of primers
I
Find beautiful explanation of probabilities and much more at: https://www.cureffi.org/2012/12/ll/how-pcr-duplicates-arise-in-next-generation-sequencing/
Sources of errors: sequencing by synthesis the fluorescence
• In step 5 we amplify the signal and detect the fluorescence of each base
• The assumption is that in a cycle, every molecule on the flowcell is extended by one base
• The reality:
• Some molecules are not extended or their base has no fluorescent dye
• The previous fluorescent dye is not cleaved -the signal from the cluster after a few cycles is a mix of signals from previous bases
	©
©	©
©	
	©
©	©
©	
	©
©	
	©
©	
3' 5'
Cycle 1: Add sequencing reagents First base incorporated
ct signal Cleave terminator and dye
UNA O^N-"
TP
OH free 3' end
Excitation-
Cycle 2-n: Add sequencing reagents and repeat
I Sequencing I coverage
Coverage in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence.
Reference genome
www.metagenomics.wiki
I
Depth of coverage
(coverage depth / mapping depth)
How strongly is the genome "covered" by sequenced fragments (short reads)?
Per-base coverage is the average number of times a base of a genome is sequenced (in other words, how many reads cover it).
Average coverage of the genome (Av)
Av = (NxL)/G
G - length of the original genome N - number of reads L - average read length
5X
CD
C7>
ro i—
QJ
>
S 4X £ 3X JZ 2X
a ix
CD
Q
Reference genome
www.metagenomics.wiki
The coverage depth of a genome is calculated as the number of bases of all short reads that match a genome divided by the length of this genome. It is often expressed as IX, 2X, 3X,... {1, 2, or, 3 times coverage).
Breadth of coverage
(covered length)
What proportion of the genome is "covered" by short reads? Are there regions that are not covered, even not by a single read?
CD
ro
CD >
o u
CD-CD □
5X 4X 3X 2X IX
Breadth of coverage
Reference genome
www.metagenomics.wiki
Breadth of coverage is the percentage of bases of a reference genome that are covered with a certain depth. For example: "90% of a genome is covered at IX depth; and still 70% is covered at 5X depth."
Sequencing coverage
• Deep sequencing refers to the general concept of aiming for high number of unique reads of each region of a sequence.
Step 1: Read quality control and data filtering
Uses the output file with information about the quality of base calls (.fastq)
First step in the pipeline that deals with actual sequencing data
in base or color space
• Several metrics are evaluated, not all of them use the Phred score information:
• Distribution of quality scores at each sequence, Sequence composition, Per-sequence and per-read distribution of GC content, Library complexity, Overrepresented sequences
• Initial overview - already in base calling SW
• More quality overview - SW solutions SolexaQA, FastQC
Main quality control points
1. Base quality
2. Sequence composition - sequence content across bases should not change with cycle (exception are targeted sequencing SNP experiments)
3. Per-sequence and per-read distribution of GC content (shift from expected can indicate contamination by rRNA for instance)
4. Library complexity (too many duplicates?)
5. Overrepresented sequences -may represent highly expressed genes, or presence of adapters or rRNA contamination or PCR duplicates
Base quality
• Quality of bases (Phred score) should be good across all cycles
(all the sequence)
Base quality - an excellent example
Quality scores across all bases (lllumina 1.5 encoding)
• Shows distribution (boxplot) of quality of bases (Phred scores) across all reads in each cycle
i   6   7   8   9 10     12     14     16     18     20     22     24     26     28     30     32     34 36
Position in read (bp)
Base quality - a more common example
Quality scores across all bases (Sanger / lllumina 1.9 encoding)
—				
				
I
• Decrease of quality towards the end of reads (late cycles)
1    5     6     7     8     9        15-19    25-29    35-39    45-49    55-59    65-69    75-79    85-89 95
Position in read (bp)
Base quality - bad example
?ffiffiffiffiÖ
Quality scores across all bases (lllumina 1.5 encoding) III
Tin
Hill
1   2  3 4  5 6  7 8  9 10     12     14     16     18     20     22     24     26     28     30     32     34     36     38 40
Position in read (bp)
Base quality - sudden quality drop
Indicates problems with flow cell, trimming needed
34 0
32.0
30.0
28.0
26.0
24.0
22.0
20.0
18.0
16 0
14.0
12 0
10.0
8.0
6 0
4.0
2.0
0 0
Quality scores across all bases (lllumina >vl.3 encoding)
11    13 15
17    19    21    23 25 Position in read (bp)
27    29    31     33     35     37 39
Base quality -targeted sequencing
• The low quality extremes suggest a problem in the beginning of the reads
• (primers?, NNNNN sequences...)
Quality scores across all bases (Sanger / lllumina 1.9 encoding)
40
123456789    15-19     30-34     45-49     60-64     75-79     90-94    105-109 120-124 135-139 150-151
Position in read (bp)
Quality scores across all bases (Sanger / lllumina 1.9 encoding)
Base quality
microbiome
This is what it can look like with very small sample and sequence size
				
ft				
1234567B9    15-19      30-34      45-49      60-64      75-79      90-94    105-109 120-124 135-139 150-151
Position in read (bp)
Base call errors in last cycles
• Towards the end of sequencing, the quality drops, signal worse
• We can see it for lllumina and SOLID
• Not very important for RNAseq, but crucial for variant calling
SNP calling
TTAACAG TG TTCAGTA AG TTCAGTAAG ACAG TG TTCAG TAAG TGTTCAGTAAG CAGTGTTCAGTAAG GTTCAGTAAG G TG TTCAG TAAG TCAGTAAG AACAGTGTTCAGTAAG AGTAAG	T 1 1 A	T TTCCATGAGCT TTCC TTCCATGAG 1 ICCA TTCCATGAGC TTCCATGA TTCCATGAGCTC 1 1 TTCCATGAG CTCT
CTTA AC AG TG TTC AG TAAG		TTCCATGAG CTCT
SNP calling
TTA AC AG TG TTC AGTA A< A AC AG TG TTC AG TA A( AC AG TG TTC AG TA A( C AG TG TTC AG TA A< G TG TTC AG TA Ai TGTTCAGTAA( GTTCAGTAAi TTCAGTAA( TCAGTAAl AGTAAi
T
T
IT
TTCC ITCCA TTCCATGA TTCCATGAG TTCCATGAGC TTCCATGAGCT
TTCCATGAGCTC TTCCATGAGCTCT
CTTA AC AG TG TTC AG TA AGi ATTCC ATG AG CTCT
SNP error dependent on cycle
These errors are not random and look like SNPs (e.g. if there were randomly distributed T, C, G and A's, we would conclude it is error directly)
We want the SNPs to be distributed evenly across cycles
SNPs coming from towards end of the read are sign of false positive
= 8
r
T-A C-A G-A A-T C-T G-T A-C T-C G-C A-G T-G C-G
10 15 20
Sequencing cycle
SNP error dependent on cycle
These errors are not random and look like SNPs (e.g. if there were randomly distributed T, C, G and A's, we would conclude it is error directly)
We want the SNPs to be distributed evenly across cycles
SNPs coming from towards end of the read are sign of false positive
	1.0 -
c	
o	0.8 -
'c/5	
o	
CL	
E	0.6 -
o	
o	
CD	
	0.4 -
s	
CD u	
□ Z	0.2 -
	0.0 -
Sequencing cycle
Long fragments have lower base quality
From: Long fragments achieve lower base quality in lllumina paired-end sequencing
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
Fraction of reads with average quality below 30 Fraction of reads with average mismatch rate above 0.01
We plot the fraction of low quality reads in the 138 samples analyzed in our study. Across all samples the R2 reads harbor more low quality reads than the Rl reads. We plot two alternative definitions of 'low quality'. Reads are called low quality if (A) the average Phred score is below 30, or (B) the average mismatch rate of the aligned bases is above 0.01. Both plots show that the R2 reads harbor more low quality reads and that the fraction of low quality reads is more variable across samples.
Increase of R2 low quality reads as a function of the content of long fragments
From: Long fragments achieve lower base quality in lllumina paired-end sequencing
Fraction of reads with fragment length > 500nt Fraction of low quality R1 reads
Increase of R2 low quality reads as a function of the content of long fragments. In (A) we plot for individual samples the difference in low quality read content among the R2 and the Rl reads versus the content of long fragments. The plot shows that the more long fragments a samples has the more prevalent are low quality reads among the R2 reads. In (B) we directly compare the fraction of low quality reads in R2 and Rl and color-code the content long fragments. Low quality reads are defined as reads having a mismatch rate above 0.01 in the bases after alignment. The plotted samples have been generated using various protocols on various sequencers in various labs. The dashed lines connect three samples each that have been processed identically except with an increasing targeted fragment length.
Per base sequence content
• Sequence content across bases should not change with cycle
Per base sequence content - RNAseq -typical lllumina library
• The primers used in the library are typically not removed
Sequence content across all bases
1    2    3    4    5    6    7 8
15-19    25-29    35-39    45-49    55-59    65-69    75-79    85-39 95-! Position in read (bp)
Per base sequence content - targeted sequencing
Sequence content across all bases
In targeted sequencing there is much less genes being sequenced so the base composition of reads is non-random
																																					%T %G
																																					
																																					
																																					
																																					
\																																					
\	L																																				/
			A			0	A																				>-*	--■									
					V																																
1																																					\
																																					
123456789    15-19      30-34      45-49      60-64      75-79      90-94    105-109 120-124 135-139 150-15!
Position in read (bp)
Per base sequence content - a bad example?
Sequence content across all bases
• This suggests that a single sequence makes up a large part of the library - this can mean rRNA contamination in RNAseq
Per base sequence content - microbiome
Sequence content across all bases
• ... however, it is excepted if we sequence 16S rRNA of microbiome where one or few bacteria strains are dominating
Position in read (bp)
Per sequence quality
• All - or at least majority of the sequences should have good average quality (average Phred score across all read bases)
Per sequence quality - RNAseq
Per sequence quality - microbiome
16    17    IB    19    20    21    22    23    24    25    26    27    28    29    30    31    32    33    34    35    36 37 Mean Sequence Quality (Phred Score)
Per sequence quality - targeted sequencing
Quality score distribution over all sequences
Small peaks in lower average quality can suggest low quality ends on part of sequences - attention, if small read diversity (e.g. microbiome), this can be due to highly duplicated reads due to too deep sequencing
2  3  4  5  6  7  3   9 10     12     14     16     18     20     22     24     26     28     30     32     34     36 38 Mean Sequence Quality (Phred Score)
Per sequence and per read GC content
Mean GC content across reads should correspond to the overall GC content of the genome
Evan small shifts can indicate contamination with GC rich sequences (ribosomal RNA with high GC content for instance)
0.5     0.6 0.7 ribosomal RNA fraction
Per sequence GC content - RNAseq
Per sequence GC content - targeted sequencing
		
	GC distributio	i over all sequences GC count per read Theoretical Distribution
	100000	
• This strange theoretical	auuuu	
distribution is due to high		
	60000                                                                                               \ I	
amount of NNNNNN		
sequences in the reads	40000 y	
	ZUUUU /	
0 2 4 6 8 11   15   19   23  27  31   35  39  43   47  51   55  59  63  67   71   75  79  83  87  91   95 99
Mean GC content (%)
Per sequence GC content - targeted sequencing after trimming
GC distribution over all sequences
• The GC count per read is disturbed because of small number of genes sequenced!
0 2 4 6 8 11   15  19  23   27   31   35  39  43  47  51   55  59  63  67  71   75   79  83  87  91 95
Mean GC content (%)
Per sequence GC content - microbiome
GC distribution over all sequences
• The GC count per read is disturbed because of small diversity of sequences
0 2 4 6 8 11   15   19   23   27   31   35   39   43   47   51   55   59   63   67   71   75   79   83   87   91 95
Mean GC content [%)
Per read GC content - good example
GC content across all bases
• The GC count per read is disturbed because of small diversity of sequences
4 5 6 7 8 9 10    12    14     16    18    20    22    24    26    28    30    32    34    36    38 40 Position in read (bp)
Per read GC content - typical RNAseq
90 80 70
• GC content different in first 8-10 bases, due to presence
f m 40
of primers
30 20 10 0
123456789       15-19    25-29    35-39    45-49    55-59    65-69    75-79    85-89 95-99
Position in read (bp)
Per read GC content
-targeted sequencing
GC content across all bases
GC content across different base positions due to high duplication level of reads and small diversity
123456789    15-19      30-34      45-49      60-64      75-79      90-94    105-109 120-124 135-139 150-15!
Position in read (bp)
Per read GC content - microbiome
Position in read (bp)
Per base N content
• In ideal case, there should be minimum of N calls in the reads
• "The HiSeq2000 produces very few Ns. It is very rare to see N content greater than 30%. When Ns are produced it is usually the result of some temporary instrument issue. For example a small bubble in the flow cell may cause focus problems at a certain cycle. Downstream processing of Ns depends on your analysis software and strategy/'
• Source:
https://www.biotech.wisc.edu/services/dnaseq/seque ncing/lllumina_old/lllumina_QC_FAQs
N content across all bases
																																				<	
																																					
																																					
																																					
																																					
																																					
																																					
																																					
																																					
																																					
123456789    15-19      30-34      45-49      60-64      75-79      90-94    105-109 120-124 135-139 150-15!
Position in read (bp)
Per base N content
-targeted sequencing
N content across all bases
Enrichment for N calls at the beginning of the sequence
123456739    15-19     30-34     45-49     60-64     75-79     90-94   105-109 120-124 135-139 150-15!
Position in read (bp)
Sequence duplication level and overrepresented sequences
• Indicates the library complexity and possible contamination
• (the less duplicates, the more complex)
• Too many duplicated sequences means we sequenced "too much".
• Overrepresented sequences may indicate:
• Presence of adapters, presence of contamination (rRNA), PCR problems
• This holds, however, mainly for WGS, WES or RNAseq
Sequence duplication level - good example (RNAseq)
Sequence Duplication Level >- 6.44%
• Most sequences occur only once
i:
					%Duplicate re ative to unique
					
					
					
					
					
					
					
					
					
4 5 6
Sequence Duplication Level
9 10+
Sequence duplication level - bad example (RNAseq)?
• Over-amplification! May \ come from highly expressed ~ \
transcripts. \
20 0 N. /
10 0 /
0 0 I__
1.3456789 10
seauence Ouphcation Lewi
Overrepresented sequences
Overrepresented sequences
Overrepresented sequences
• Indicate remaining adapters, PCR duplicates, but also can be real sequences!
Always judge based on type of data and check before filtering!
Count    Percentage    Possible Source
TGCACACGAGGC GGCTTCTGCAACTTCATGCATTT6AAGC C CATTTC CAG AC CTTGGGAGGGTTCAGA6AAGGCTGCATGTCAC CAAGGCCCATG CTAAC CTCTATCTTC C CTAGTGTGGTAAC CTCATTTC C C CATAAAGATTC AGAAC C C C CTC CTCAGCATCTTATC C GAGTGGAAGGAAATTTGC GTGTGGAGTAT TGAGCAGGAGGGGAAAAGTGCTAATTAC CATGACAAGAAC ATTGTATTAC C GTCTC CTGTTTTGTA6TC CAAC C CTGT6AT6ATT6ATG C CAAA6AAGTG GTCTTCATCTTATTGATAGTTTTGATGGTCTTCTTATC CAACACGCCGAG TGCAAGCTC CTGGTGGCAGCTCT6AAC G6TATTTAAAACAAAAT6AAATG TGCCCTGGCCCTGGGCTTGTGGGGCTGCCCAGCAGCTGCCC ATAAAGGAC CTGC C C C CAGGGAGCACTAA6C 6AG6TAAG CAAGCAGGACAAGAAGCGGT C CAGAT6TTCTTC GCTAATAAC CAC 6AC CAG6AATTTGT6AGTG C TGGG C C CTGTGTTATCTC CTAGGTTGGCTCTGACTGTAC CACCATCCACTACAAC GGCTCGGCCACGCGCTACCACACCTACCTGCCGCCGCCCTACCCCGGCTC TTCTCTTGGAAACTC C CATTTGAGATCATATTCATATTCTCTGAAATC AA TGCTCATGCCCACAGA6ACTTGCACAACATG CA6AATGG C AG C AC ATTGG AAAGGATGGAAAAGAGAA6AAGGCATGGGTGGGAAACTGTG C CTC C CATT TCTC GAGGAGGCAGTGACAGCAATGGCAGTTACTGTCAAC AGGTG6AC AT CTGGGTCTC CTCTCTTTC GTGTCAAAGGACTTCTTTGC CAAGTTCACAGA ACATC CTGTCTTACATC CTGGCAGGTAC G6ATCTAAACAGC GACTTTTTT CCTGCGGACCC GATGC CTCTTC CTGCT6AGATC C C TC C AGTTTTTCC CAG AGTGAGTGCAGTTGTTTAC CAT6ATAAC 6ACACAACAC AAAATAG C C GTA AMGATGGAACTCCACCCTTTGCTTGGTTTTAAGTATGTATGGAATGTTA ACTG6AAGAAATGGATTC CAAAGAG CAGTTCTCTTC CTTTAGTTGT6AAG GAGCTATGAGCTAC GGCCGCCCCCCTCCC 6ATGTG6AGGGTATGAC CTC C AAC C CAC CAATTTTTGGTAGCAGTGGAGAGC TACAGGACAACTGCCAGCA CCCATCCTCAC CATCATCACACTG6AA6ACTC CAGGTCAGGAGCCACTTGCCACCCTGCACAC TGG
110195
8442
7426
7261
7196
6962
6668
6521
6359
6352
6189
6878
6865
6858
5918
5905
5895
5773
5672
5574
5554
5511
5500
5425
5424
5355
5317
6.337201578736401
0.4854907729723917
0.427061653647593
0.4175726726548846
0.4138345892335147
0.39692695037377956
0.3834698500568476
9.3750160306269801
0.36569957656141183
0.36529701373141815
0.3513223326330657
0.3490794825802437
0.3487919377016768
0.3483893748716831
0.3398780464661022
0.33959O5O15875353
0.33961541183040146
0.3319993167933685
0.3261909102463167
0.320555030626405
0.31940485111213734
0.3169319651564618
0.31629936642361456
0.31198619324511073
0.31192868426939735
0.3079605649451738
0.3057752238680652
NO Hit NO Hit No Hit No Hit No Hit No Hit NO Hit No Hit NO Hit No Hit No Hit No Hit No Hit NO Hit NO Hit No Hit NO Hit No Hit No Hit NO Hit No Hit No Hit No Hit No Hit No Hit No Hit No Hit
10.1)
Sequence duplication level - targeted sequencing
Sequence Duplication Level >= 94.87%
							%Duplicate relative to unique		
									
									
									
									
									
									
									
									
									
5 6 Sequence Duplication Level
Some reads are present more than 10 times. This is due to NNNN sequences and due to few genes sequenced (longer genes get more reads)
Overrepresented sequences
TGCACACGAGGCGGCTTCTGCAACTTC ATGCATTTGAAGC C CATTTC CAG AC CTTGGGAGGGTTCAGAGAAGGCTGCATGTCAC CAAGGCCCATGCTAAC CTCTATCTTCCCTAGTGTGGTAAC CTCATTTC C C CATAAAGATTCAGAAC CCCCTCCTCAGCATCTTATCC GAGTGGAAGGAAATTTGC GTGTGGAGTAT TGAGCAGGAGGGGAAAAGTGCTAATTACCATGACAAGAACATTGTATTAC C GTCTC CTGTTTTGTAGTC CAAC C CTGTGATGATTGATGC CAAAGAAGTG GTCTTCATC TTATTGATAGTTTTGATGGTC TTC TTATC CAACAC GC C GAG TGCAAGCTC CTGGTGGCAGCTCTGAAC GGTATTTAAAACAAAATGAAATG TGCCCTGGCCCTGGGCTTGTGGGGCTGCCCAGCAGCTGCC CATAAAGGAC CTGC C C C CAGGGAGCACTAAGC GAGGTAAGCAAGCAGGACAAGAAGC GGT C CAGATGTTCTTC GCTAATAAC CAC GAC CAGGAATTTGTGAGTGCTGGGC CCTGTGTTATCTCCTAGGTTGGCTCTGACTGTACCACCATCCACTACAAC GGCTCGGCCACGCGCTACCACACCTACCTGCCGCCGCCCTACCCCGGCTC TTCTCTTGGAAACTC C CATTTGAGATCATATTCATATTCTCTGAAATCAA TGCTCATGCC CACAGAGACTTGCACAACATGCAGAATGGCAGCACATTGG AAAGGATGGAAAAGAGAAGAAGGCATGGGTGGGAAACTGTGCCTCCCATT TCTCGAGGAGGCAGTGACAGC AATGGC AGTT ACTGTC AACAGGTGGACAT CTGGGTCTC CTCTCTTTC GTGTCAAAGGACTTCTTTGC CAAGTTCACAGA ACATC CTGTCTTACATC CTGGCAGGTAC GGATCTAAACAGC GAC111111 CCTGCGGACCCGATGCCTCTTCCTGCTGAGATCCCTC CAGTTTTTC C CAG AGTGAGTGCAGTTGTTTAC CATGATAAC GACACAACACAAAATAGC C GTA AMGATGGMCTCCACCCTTTGCTTGGTTTTAAGTATGTATGGAATGTTA ACTGGAAGAAATGGATTC CAAAGAGCAGTTCTCTTC CTTTAGTTGTGAAG GAGCTATGAGCTACGGCCGCCCCCCTCCC GATGTGGAGGGTATGAC CTC C AAC C C AC CMTTTTTGGTAGCAGTGGAGAGCTAC AGGAC AACTGC C AGC A CCCATCCTCACCATCATCACAC TGGAAGAC TCCAGGTCAGGAGCCACTTGCCACCCTGCACACTGG
10.1)
Count    Percentage    Possible Source
110195
8442
7426
7261
7196
6962
6666
6521
6359
6352
6109
6676
6065
6656
5910
5965
5895
5773
5672
5574
5554
5511
55ee
5425 5424
5355 5317
6.337201578736401
0.4854907729723917
0.427061653647593
0.4175726726548846
0.4138345892335147
0.39692695037377956
0.3834698500568476
0.3750160306269801
0.36569957656141183
0.36529701373141815
0.3513223326330657
0.3490794825802437
0.3487919377016768
0.3483893748716831
0.3398780464661022
0.3395905015875353
0.33901541183040146
0.3319993167933685
0.3261909102463167
0.320555030626405
0.31940485111213734
0.3169319651564618
0.31629936642361456
0.31198619324511073
0.31192868426939735
0.3079605649451738
0.3057752238680652
NO Hit NO Hit No Hit No Hit No Hit No Hit NO Hit No Hit NO Hit No Hit No Hit No Hit No Hit NO Hit NO Hit No Hit NO Hit No Hit No Hit NO Hit No Hit No Hit No Hit No Hit No Hit No Hit
Quality control exercise
• We continue in our exercise from l_Preprocessing.sh
Step 1: Read quality control and data filtering
Based on the quality measures, we decide to remove low quality bases and reads
• Trimming - removes low quality or unwanted bases from reads, thus shortening them. Is applied to increase the number of mappable reads.
• Filtering - removes whole reads that do not meet quality standards (e.g. too short etc)
Trimming reads
• Read trimming is applied to increase the number of mappable reads by:
• Removing low quality bases at the end of the reads that are likely to contain sequencing errors
• Removing adapter sequences
Ill Removing adapters
• Important mainly for very short read sequences of interest (when the input DNA fragment is less than the read length)
e.g. for miRNA with 22nt length the adapter gets sequenced more often than for RNA sequences, which are much longer
What is the sequence of adapters?
Best option: ask which kit was used for preparing libraries Programs: cutadapt, trimmomatic
TruSeq Universal Adapter: 5
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT 3 TruSeq Indexed Adapter 5
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC NNNNNN ATCTCGTATGCCGTCTTCTGCTTG 3
Here "N" is any nucleotide, and the 6 of them together are a unique sequence which can readily be identified as unique to 1 library.
Ill  Filtering reads
• We can remove whole reads based on:
• quality of its base calls
• its length (too short reads)
• level of duplication
Ill Trimming and filtering - exercise
Trimming and filtering - exercise
• We continue in our exercise from l_Preprocessing.sh
• We will use grep command to find adapter sequences and cutadapt to remove them
• We will trim low quality bases
• Independent work: find specific QC problems in your project data and suggest solutions (what to trim, filter, etc)
Recommended literature
• Fuller et al. 2009: The challenges of sequencing by synthesis http://arep.med.harvard.edu/pdf/Fuller 09.pdf
• https://sequencing.qcfail.com/