Mapping is the process of aligning sequencing reads to a reference genome or transcriptome to identify where each read originated. This step is essential for understanding the genomic or transcriptomic context of the sequences, allowing researchers to detect genes, identify variants, and measure expression levels. Mapping tools, such as BWA, HISAT2, or STAR, use sophisticated algorithms to efficiently align millions of reads, even allowing for mismatches or small gaps due to sequencing errors or biological variation. High-quality mapping results are crucial for downstream analyses like variant calling and differential expression analysis.
Download subsampled files from Zenodo. This is RNA-Seq data. Full dataset is available under GSE18508
wget https://zenodo.org/record/6457007/files/GSM461177_1_subsampled.fastqsanger
wget https://zenodo.org/record/6457007/files/GSM461177_2_subsampled.fastqsanger
wget https://zenodo.org/record/6457007/files/GSM461180_1_subsampled.fastqsanger
wget https://zenodo.org/record/6457007/files/GSM461180_2_subsampled.fastqsanger
Or get 7 of the original datasets:
wget https://zenodo.org/record/6457007/files/GSM461177_1.fastqsanger
wget https://zenodo.org/record/6457007/files/GSM461177_2.fastqsanger
Inspect the contents of the file:
cat GSM461177_1_subsampled.fastqsanger | head
Question: What can you tell about the experiment?
Use previous exercises to assess the quality of the samples and preprocess them accordingly.
module add fastqc
mkdir fastqc
fastqc -o fastqc *.fastqsanger
module add conda-modules
conda activate multiqc_v1.12_py3.7
multiqc .
module add fastp
for sample in GSM461177 GSM461180; do
fastp \
-i ${sample}_1_subsampled.fastqsanger \
-I ${sample}_2_subsampled.fastqsanger \
-o ${sample}_clean_1_subsampled.fastqsanger \
-O ${sample}_clean_2_subsampled.fastqsanger \
--detect_adapter_for_pe \
--cut_tail -M 20 -e 30 -l 25 \
--html ${sample}.fastp.html \
--json ${sample}.fastp.json
done
multiqc .
Questions:
How would you describe quality of data?
Were you able to increase quality of data? Which parameters would you like to tune?
STAR (Spliced Transcripts Alignment to a Reference) is a highly efficient and widely used tool for mapping RNA sequencing reads to a reference genome. Designed to handle the unique challenges of RNA-seq data, STAR can align reads across exon-exon splice junctions, which is crucial for accurately mapping transcripts in eukaryotic genomes. STAR achieves high speed and precision by using a two-pass alignment mode, which helps improve the detection of novel splice junctions. With its ability to handle large datasets and produce accurate results, STAR is an excellent choice for transcriptome analysis and is commonly used in gene expression and differential expression studies.
module add star
# generate index file
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir STAR --genomeFastaFiles GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna --sjdbGTFfile genomic.gtf
# run mapping to the indexed reference
for sample in GSM461177 GSM461180; do
STAR \
--outSAMattributes All \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--runThreadN 4 \
--sjdbGTFfile genomic.gtf \
--outReadsUnmapped Fastx \
--outMultimapperOrder Random \
--outWigType wiggle \
--genomeDir STAR \
--readFilesIn ${sample}_clean_1_subsampled.fastqsanger ${sample}_clean_2_subsampled.fastqsanger \
--outFileNamePrefix ${sample}
done
multiqc .
Questions:
Why there is a need to create index first?
What the parameter outMultimapperOrder mean?
Mapping quality assessment is an important step to evaluate the reliability of the read alignments generated during mapping. High mapping quality scores indicate that reads have been accurately aligned to unique locations in the reference genome, while low scores suggest ambiguous or potentially incorrect alignments, often in repetitive regions. Assessing mapping quality involves reviewing metrics such as the percentage of mapped reads, the distribution of mapping quality scores, and the number of uniquely mapped reads. Tools like SAMtools, Qualimap, and Picard provide detailed reports on mapping quality, helping ensure that only high-confidence alignments are used in downstream analyses like variant calling and gene expression quantification.
module add samtools
for sample in GSM461177 GSM461180; do
samtools index ${sample}Aligned.sortedByCoord.out.bam
samtools stats ${sample}Aligned.sortedByCoord.out.bam > ${sample}.stats
samtools flagstat ${sample}Aligned.sortedByCoord.out.bam > ${sample}.flagstat
samtools idxstats ${sample}Aligned.sortedByCoord.out.bam > ${sample}.idxstats
done
multiqc .
module add qualimap
for sample in GSM461177 GSM461180; do
qualimap rnaseq \
-bam ${sample}Aligned.sortedByCoord.out.bam \
-gtf genomic.gtf \
-outdir ${sample}_qualimap
done
multiqc .
Questions:
Which sample have better quality of mapping?
How does quality of raw reads and mapping quality correlate?
What portion of reads mapped ambiguously?