Quality control (QC) is a crucial step in processing sequencing data to ensure that the data obtained is accurate and reliable before further analysis. In sequencing experiments, errors can arise due to factors like sequencing machine errors, adapter contamination, or low-quality reads. Therefore, QC aims to identify and remove these artifacts to maintain the integrity of downstream analyses, such as variant calling or transcript quantification.
Typically, QC involves evaluating read quality scores, filtering out low-quality bases, trimming adapters, and identifying potential contamination. Tools like FastQC are commonly used to generate reports on sequence quality metrics such as base quality distribution, GC content, and overrepresented sequences. After QC, the cleaned dataset is more robust, which helps avoid false-positive findings and ensures that the biological conclusions drawn are more accurate.
Download the file female_oral2.fastq-4143.gz
from
Zenodo. This is a microbiome sample from a snake, referenced in St. Jacques et al. (2021).
wget https://zenodo.org/record/3977236/files/female_oral2.fastq-4143.gz
Rename it:
mv female_oral2.fastq-4143.gz female_oral2.fastq.gz
Inspect the contents of the file:
zcat female_oral2.fastq.gz | head
Questions:
What do the individual rows represent?
What is the Phred quality score (Illumina 1.8+) of the 5th nucleotide of the 1st sequence?
What is the accuracy of this 5th nuclotide?
To assess sequence quality across all reads, we can use FASTQE, an open-source tool that adds a fun twist by displaying quality control results as emojis. It offers a quick, intuitive way to spot potential issues in your raw sequence data before proceeding with further analysis. Documentation can be found at https://github.com/fastqe/fastqe/
fastqe --min --mean --max --long 500 female_oral2.fastq.gz > fastqe_report.html
Question: What is the lowest mean score in this dataset?
The first step in analyzing sequencing data is to assess the quality of the reads using a tool like FastQC. FastQC provides a detailed summary of key quality metrics such as per-base quality scores, GC content, sequence length distribution, and potential adapter contamination. By examining these metrics, you can identify any issues that might affect downstream analysis, such as low-quality bases or overrepresented sequences. Performing this initial quality check helps ensure that the data is clean and reliable before proceeding with further analysis steps like alignment or variant calling.
Run fastqc on your data:
module add fastqc
fastqc female_oral2.fastq.gz
Inspect the resulting html report.
The Basic Statistics section in FastQC provides an overview of key metrics, including the total number of sequences, sequence length, and GC content. It also flags any low-quality or short sequences, offering a quick snapshot of the overall quality of the sequencing data. This summary helps assess whether the data is suitable for further analysis.
Questions:
What phred quality encoding version was used?
How many reads does your sample contains?
The Per Base Sequence Quality section in FastQC displays the quality scores for each base position across all sequences. It provides a boxplot that shows the distribution of quality scores at each position, highlighting any regions of the reads with consistently low quality. This allows you to quickly identify if there are any issues with specific portions of the reads, such as at the beginning or end, which may require trimming or filtering before further analysis.
Questions:
What is on the x-axis?
How does the mean quality score change along the sequence? Why?
The Per Tile Sequence Quality section in FastQC identifies potential spatial issues on the sequencing flow cell by analyzing the quality scores across different tiles. It provides a heatmap showing the variation in quality across these tiles, which can help detect localized problems, such as lower-quality sequences from specific regions of the flow cell. This allows for the identification of systematic issues that may not be visible in overall quality scores but could still affect the data quality.
Questions:
The Per Sequence Quality Scores section in FastQC provides an overview of the average quality scores across all sequences. It shows how many reads have particular average quality scores, helping to identify if there are subsets of reads with consistently low quality. Ideally, most reads should have high average quality scores, indicating that the sequencing run produced reliable data. If a large proportion of reads have low scores, this may indicate a problem with the sequencing process or the sample quality.
Questions:
The Per Base Sequence Content section in FastQC displays the proportion of each nucleotide (A, T, G, C) at every base position across all sequences. In a good-quality dataset, the distribution of nucleotides should be relatively stable, with no significant biases, especially after the first few bases. Significant deviations or imbalances in nucleotide representation at specific positions can indicate issues like adapter contamination, amplification bias, or uneven base calling, which may require further investigation or corrective measures such as trimming.
Questions
What type of experiment do you have?
How the type of experiment affect this metric?
The Per Sequence GC Content section in FastQC shows the distribution of GC content (the percentage of guanine and cytosine bases) across all sequences. In a normal dataset, the GC content should follow a roughly normal distribution, reflecting the expected GC content for the organism or sample type. Deviations from this expected distribution, such as a bimodal or skewed curve, can indicate contamination, sequencing artifacts, or issues with the library preparation, signaling that further investigation may be needed.
Questions
The Per Base N Content section in FastQC displays the percentage of “N” bases (ambiguous bases where the sequencer was unable to determine the correct nucleotide) at each position across all sequences. Ideally, this percentage should be very low or zero throughout the read. A high presence of “N” bases in certain regions may indicate poor sequencing quality, typically toward the end of reads, and may require trimming or additional filtering to improve the data quality for downstream analysis.
The Sequence Length Distribution section in FastQC displays the distribution of read lengths within the dataset. Ideally, for a uniform dataset, the graph should show a narrow peak at the expected read length. A broader or irregular distribution can indicate problems such as variable-length reads, incomplete sequences, or the presence of adapter sequences that haven’t been trimmed. This section is especially important in datasets with variable-length reads, such as those generated by amplicon sequencing or metagenomics.
The Sequence Duplication Levels section in FastQC measures how often individual sequences are duplicated within the dataset. In high-quality datasets, a majority of sequences should be unique, with minimal duplication, unless the sample is highly enriched for specific sequences (e.g., in PCR amplification or targeted sequencing). High levels of duplication can indicate over-sequencing, PCR bias, or contamination. This section helps identify whether the dataset contains an unusual number of duplicates, which could skew downstream analysis and require deduplication or other corrective actions.
The Overrepresented Sequences section in FastQC identifies sequences that appear more frequently than expected in the dataset. Ideally, there should be few or no highly overrepresented sequences in a random sequencing library. The presence of such sequences may indicate contamination (e.g., adapter sequences or primers), PCR amplification bias, or other artifacts in the library preparation process. Detecting and addressing overrepresented sequences is important for ensuring the accuracy of downstream analyses, such as alignment or variant calling.
The Adapter Content section in FastQC shows the proportion of adapter sequences present at each position in the reads. Adapters are short sequences added during library preparation, which should normally be trimmed during preprocessing. If adapter content is detected, especially toward the ends of reads, it suggests that the reads are longer than the original fragments, and trimming may be required. High adapter content can affect downstream analysis by introducing bias or errors, so it’s important to identify and remove these sequences to ensure clean, high-quality data.
Trimming and filtering are essential steps in preprocessing sequencing data to improve its overall quality and ensure reliable downstream analysis. Trimming involves removing unwanted portions of the reads, such as low-quality bases at the ends, adapter sequences, or regions with ambiguous “N” bases. By trimming low-quality regions, you prevent errors and biases that could affect tasks like alignment, variant calling, or quantification. Tools like Trimmomatic and Cutadapt are commonly used for this purpose, allowing you to set thresholds for trimming based on base quality scores or sequence length.
Filtering, on the other hand, focuses on removing entire reads that do not meet specific quality criteria. This might include discarding reads that are too short, have too many ambiguous bases, or fall below a certain average quality threshold. Filtering ensures that only high-quality, informative reads are retained for analysis, reducing noise and improving the accuracy of subsequent analyses. Together, trimming and filtering form a critical part of data cleanup, helping to maximize the reliability and integrity of sequencing data.
Trimmming:
Filtering:
Fastp is a fast, all-in-one tool for preprocessing high-throughput sequencing data, designed for quality control, trimming, and filtering. It offers features like adapter trimming, quality filtering, and base correction, all while providing detailed quality control reports. Fastp is highly efficient, using multithreading to handle large datasets quickly, making it a popular choice for researchers looking to clean and improve their sequencing data before analysis. Its simplicity and speed, combined with comprehensive functionality, make fastp an excellent tool for data preprocessing. Documentation is available at https://github.com/OpenGene/fastp
Question: Based on the quality report from fastQC, what should be trimmed and filtered?
Lets run fastP on our data.
module add fastp
fastp -i female_oral2.fastq.gz -o female_oral2_clean.fastq.gz --detect_adapter_for_pe --cut_tail -M 20 -e 30 -l 150
Question:
What do all of the additional parameters mean?
How many reads were filtered out?
Paired-end files are generated in paired-end sequencing, where DNA fragments are sequenced from both ends, resulting in two sets of reads for each fragment: one from the forward direction and one from the reverse direction. These files are typically stored in two separate FASTQ files, one for each read direction. Paired-end sequencing provides more accurate alignment and greater coverage of the DNA fragment, especially for detecting structural variants or resolving repetitive regions in genomes. Proper handling of paired-end files during preprocessing and analysis is essential for maintaining the integrity of the paired read relationships.
Download example data of paired-end dataset by Grüning et al. (2016).
wget https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_1.fastq
wget https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_2.fastq
Task 1: Use fastQC to assess quality of both files.
Task 2: Use fastP to do quality control.
MultiQC is a versatile tool that aggregates and visualizes results from multiple quality control tools, such as FastQC, Trimmomatic, and fastp, into a single comprehensive report. It simplifies the process of reviewing QC data by compiling all metrics and summaries in one place, allowing for easier comparison across multiple samples. MultiQC is especially useful in large-scale projects where handling QC outputs from numerous samples would otherwise be time-consuming. Its intuitive visualizations make it easier to spot trends or issues that might require further attention.
module add conda-modules
conda activate multiqc_v1.12_py3.7
multiqc .
Task 1: Inspect multiQC report.
#!/bin/bash
#PBS -l walltime=1:0:0
#PBS -l select=1:ncpus=2:mem=4gb:scratch_local=1gb
#PBS -N quality_control
# create folder for this example:
mkdir qc_example
cd qc_example
# download data for single-end
wget https://zenodo.org/record/3977236/files/female_oral2.fastq-4143.gz
# rename file and inspect
mv female_oral2.fastq-4143.gz female_oral2.fastq.gz
zcat female_oral2.fastq.gz | head
# processing of single file
module add fastqc
module add fastp
fastqc female_oral2.fastq.gz
fastp -i female_oral2.fastq.gz -o female_oral2_clean.fastq.gz --html --detect_adapter_for_pe --cut_tail -M 20 -e 30 -l 150
# processing of paired-end file
fastqc GSM461178_untreat_paired_subset_1.fastq GSM461178_untreat_paired_subset_2.fastq
fastp \
-i GSM461178_untreat_paired_subset_1.fastq \
-I GSM461178_untreat_paired_subset_2.fastq \
-o GSM461178_untreat_paired_subset_1.fastq \
-O GSM461178_untreat_paired_subset_2.fastq \
--detect_adapter_for_pe \
--cut_tail -M 20 -e 30 -l 150
# run multiqc
module add conda-modules
conda activate multiqc_v1.12_py3.7
multiqc .
Task: Edit this bash script to also organize the data.
Question: When to use a batch job and when to use an interactive job?