DSIB01 Autumn 2021
04 Alignment
Mgr. Eliška Chalupová
375973@mail.muni.cz
Practicals overview
● STAR
● Repetitive elements removal
● Reads alignment
● Deduplication
● SAM ﬁle format & samtools
Preliminary quality check
● Not necessary if you did one at the end of the last practicals
● Use loops when running the same command for multiple ﬁles
for file in /home/user/encode/*.fastq
do
fastqc -o /home/user/encode/fastqc/01 $file
Done
● Use multiQC to compare multiple results
conda install -c bioconda multiqc
multiqc /home/user/fastqc/01 -o /home/user/multiqc/01
Technical tips
● Environment variables
○ There are no spaces when deﬁning environment variables
○ Use ‘$’ sign to reference deﬁned variables
○ You can manipulate them through ${READ1}
○ More information e.g. here
READ1=ENCFF708YAL echo $READ1 echo ${READ1#ENCFF}
READ2=ENCFF959XKN echo $READ2 echo ${READ2#ENCFF}
OUT_DIR=/home/user/output
● When specifying an output directory, ﬁrst make sure it exists
○ Use option -p to create multiple nested directories at the same time
mkdir -p /home/user/output/star/repeats
Repositioning UMIs
● UMIs will be explained later in this practicals, for now we just need to do one step before alignment
● Check the length of UMIs in your fastq ﬁles and adjust the length (l=10) parameter accordingly
○ Use the head command head -n 4 /home/user/output/cutadapt/round2/$READ1.adapterTrim.fastq
● Run the following command for both fastq ﬁles to reposition UMIs
awk -v l=10 'BEGIN{OFS=FS=" "} substr($1, 1, 1) == "@" {print "@" substr($1, (l+3), 500) "_"
substr($1, 2, l) }; substr($1, 1, 1) != "@" {print}; '
/home/user/output/cutadapt/round2/$READ1.adapterTrim.fastq >
/home/user/output/umis/$READ1.adapterTrim.umi.fastq
Alignment
● There are multiple alignment tools available
● Each tool has many parameters with many options
● The choice of the tool and parameters is crucial
● We have to understand our data to make the right choices
● We have to understand the tool and its options
● E.g., if we do not allow any mismatches, it is not possible to detect SNPs. If we
allow too many mismatches, we get too many false SNPs and wrong alignments.
STAR
● Installation
conda activate Environment
conda install -c bioconda star
● Manual
https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
● Original paper at https://pubmed.ncbi.nlm.nih.gov/23104886/
STAR - Indexing
● First generate genome index, unless available, by running STAR in runMode ‘genomeGenerate’
● Genome ﬁles comprise binary genome sequence, suﬃx arrays, text chromosome names/lengths,
splice junctions coordinates, and transcripts/genes information. Most of these ﬁles use internal
STAR format and are not intended to be utilized by the end user.
● --sjdbGTFﬁle - Optional, STAR will extract splice junctions from the GTF ﬁle and use them to
improve accuracy of the mapping.
● --genomeSAindexNbases - For small genomes, this parameter must be scaled down, with a
typical value of min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is
equal to 9, for 100 kiloBase genome, this is equal to 7.
● --genomeChrBinNbits - If you are using a genome with a large number of references (>5,000
chromosomes/scaﬀolds), you may need to lower this parameter to reduce RAM consumption.
For example, for 3 gigaBase genome with 100,000 chromosomes/scaﬀolds, this is equal to 15.
STAR - Alignment
● STAR (Spliced Transcripts Alignment to a Reference) is an aligner designed to speciﬁcally
address many of the challenges of RNA data mapping by accounting for spliced alignments
● Outperforms other aligners in mapping speed, but it is memory intensive
● The algorithm achieves this highly eﬃcient mapping by performing a two-step process: 1) Seed
searching, and 2) Clustering, stitching, and scoring
Images taken from: https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html
STAR - Alignment
● Note that “STAR’s default parameters are optimized for mammalian genomes. Other
species may require signiﬁcant modiﬁcations of some parameters; in particular, the
maximum and minimum intron sizes have to be reduced for organisms with smaller
introns”.
● --outFilterMultimapNmax - Default ﬁltering allows maximum of 10 multiple alignments
for a read. If it is exceeded, no alignment is outputted.
● The logs and other output ﬁles are created by STAR at the current working directory by
default - make sure to be at the right place cd /home/user/output/star/repeats or use option
--outFileNamePreﬁx /home/user/output/star/repeats/
STAR - ENCODE options
● --outFilterType BySJout - reduces the number of “spurious” junctions
● --outFilterMultimapNmax 20 - max number of multiple alignments allowed for a read: if exceeded, the read is
considered unmapped
● --alignSJoverhangMin 8 - minimum overhang for unannotated junctions
● --alignSJDBoverhangMin 1 - minimum overhang for annotated junctions
● --outFilterMismatchNmax 999 - maximum number of mismatches per pair, large number switches oﬀ this ﬁlter
● --outFilterMismatchNoverReadLmax 0.04 - max number of mismatches per pair relative to read length: for
2x100b, max number of mismatches is 0.04*200=8 for the paired read
● --alignIntronMin 20 - minimum intron length
● --outSAMunmapped Within - output unmapped reads within the main SAM ﬁle
● --alignEndsType EndToEnd - In eCLIP the cross linking position should be at the beginning of the second read. If
we would enable soft-clipping, we would add potential bases with low quality at the end of our second reads that
would blur our cross linking position.
Repetitive elements - RepBase
● “A substantial portion of eukaryotic genomes is composed of multiple DNA copies referred to as
“repetitive DNA”, which can be divided into two major groups” - tandem repeats and transposable
(selﬁsh) elements
● Over 40% of the human genome is still composed of recognizable interspersed repeats of which some are
over 200 million years old
Read more at Repbase Update, a database of eukaryotic repetitive elements
● Recommendation from eCLIP-seq Processing Pipeline - “Removing repetitive elements helps control for
spurious artifacts from rRNA (and other) repetitive reads”
● Case against ﬁltering out the repetitive elements - “By focusing on only a fraction of the genome, only a
fraction of discoveries can be made.”
Repetitive elements
1. Generate index - apply the option for small genomes
STAR \
--runMode genomeGenerate \
--genomeSAindexNbases 5 \
--runThreadN 2 \
--genomeDir /home/user/ref/repeats \
--genomeFastaFiles /home/user/ref/repeats/RepBase_hs_shared_11272018.fasta
Repetitive elements
2. Align the reads
STAR --runThreadN 2 \
--genomeDir /home/user/ref/repeats \
--readFilesIn home/user/output/cutadapt/round2/ ${READ1%.fastq}.adapterTrim.fastq \
home/user/output/cutadapt/round2/ ${READ1%.fastq}.adapterTrim.fastq \ \
-outSAMunmapped Within \
--outSAMattributes All \
--outStd BAM_Unsorted \
--outSAMtype BAM SortedByCoordinate \
--outFilterType BySJout \
--outReadsUnmapped Fastx \
--outFileNamePrefix /home/user/output/star/repeats/ \
--alignEndsType EndToEnd
Repetitive elements
● Reads corresponding to the repetitive elements got aligned
● However, we are interested in those, that did not align
● Further on, we will be working with the unmapped ﬁles Unmapped.out.mate1 and
Unmapped.out.mate2 - those are the reads with the repetitive elements removed
● We can give them more meaningful names, e.g.
cd /home/user/output/star/repeats/
mv Unmapped.out.mate1 $READ1.rm_rep.fastq
mv Unmapped.out.mate2 $READ2.rm_rep.fastq
Reads alignment
● We will align against the newest human genome assembly - Hg38
○ We will use only chromosome 1 for the purposes of this practicals
○ Using only chromosome 1 we need to lower the parameter --genomeSAindexNbases to 12
○ You can get all the genome ﬁles e.g. at USCS https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
● First, we need to index the genome again
○ For the purposes of the practicals, the prepared indexed ﬁles were sent to your email
STAR --runMode genomeGenerate \
--runThreadN 2 \
--genomeSAindexNbases 12 \
--genomeDir /home/user/ref/chr1 \
--genomeFastaFiles /home/user/ref/chr1/chr1.fa
Reads alignment
STAR --runThreadN 2 \
--genomeDir /home/user/ref/chr1 \
--readFilesIn /home/user/output/star/repeats/$READ1.rm_rep.fastq \
/home/user/output/star/repeats/$READ2.rm_rep.fastq \
--outSAMunmapped Within \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 99 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--outSAMattributes All \
--outSAMtype BAM SortedByCoordinate \
--outFilterType BySJout \
--outReadsUnmapped Fastx \
--outFileNamePrefix /home/user/output/star/chr1/ \
--alignEndsType EndToEnd
STAR - Output
● Aligned.out.sam / Aligned.out.bam / Aligned.sortedByCoord.out.bam - alignments in standard
SAM/BAM format
● Log.out - main log ﬁle with a lot of detailed information about the run
● Log.progress.out - reports job progress statistics, such as the number of processed reads
● Log.ﬁnal.out - summary mapping statistics, useful for quality control
● SJ.out.tab - contains high conﬁdence collapsed splice junctions in tab-delimited format
● Unmapped.out.mate1 & Unmapped.out.mate1 - unmapped reads in original fastq format (thanks
to option --outReadsUnmapped set to Fastx)
Images taken from: https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html
Deduplication
● PCR duplicates are reads that are made from the same original cDNA molecule
via PCR.
● A common practice to eliminate PCR duplicates is to remove all but one read of
identical sequences.
● For example, a large number of PCR duplicates containing an
ampliﬁcation-induced error may cause a variant calling algorithm to misidentify
the error as a true variant.
● However, several studies have shown that retaining PCR- and Illumina clustering
duplicates does not cause signiﬁcant artifacts as long as the library complexity is
suﬃcient (e.g. here).
● PCR duplicates are thus mostly a problem for very low input or for extremely
deep RNA-sequencing projects.
Deduplication - UMIs
● UMIs (Unique Molecular Identiﬁers) should be used to prevent the removal of
natural duplicates.
● UMIs, or molecular barcodes, are short sequences used to uniquely tag each
molecule in a sample library.
● UMIs are added before PCR ampliﬁcation, and can be used to reduce errors and
quantitative bias introduced by the ampliﬁcation.
Deduplication - UMI-tools
● UMI-tools contains tools for dealing with Unique Molecular Identiﬁers (UMIs)/Random
Molecular Tags (RMTs) and single cell RNA-Seq cell barcodes.
● Installation
conda create -n umi python=3.7 # UMI-tools does not work with the latest python version
conda activate umi
conda install -c bioconda -c conda-forge umi_tools
● Usage
--dedup - Use this when you want to remove the PCR duplicates
--group - This is useful when you want to manually interrogate the PCR duplicates or perform
bespoke downstream processing such as generating consensus sequences
--count - Use this when you want to obtain a matrix with unique molecules per gene, per cell, for
scRNA-Seq
Deduplication
● UMI-tools require the input BAM ﬁle to be indexed
● To do that, we will use samtools
samtools index /home/user/output/star/hg38_chr1/Aligned.sortedByCoord.out.bam
● Now we do the deduplication
umi_tools dedup --stdin=/home/user/output/star/hg38_chr1/Aligned.sortedByCoord.out.bam \
--log=/home/user/output/dedup/chr1_dedup.log \
> /home/user/output/dedup/chr1_dedup.bam
● We can check the results of the deduplication in the log ﬁle
● The deduplicated ﬁle (chr1_dedup.bam) will be used for the following steps
Sequence Alignment Map (SAM)
● It is a TAB-delimited text format consisting of a header and an alignment section
● The alignment section contains the information for each sequence about
where/how it aligns to the reference genome
● Each alignment line has 11 mandatory ﬁelds for essential alignment information
such as mapping position, and variable number of optional ﬁelds for ﬂexible or
aligner speciﬁc information.
● BAM - binary version, compressed, not human-readable, required by some tools
for downstream analysis
● See more information at https://genome.sph.umich.edu/wiki/SAM or
http://samtools.github.io/hts-specs/SAMv1.pdf
Sequence Alignment Map (SAM)
● Each alignment has:
○ query name - used to group/identify alignments that are together, like paired alignments or a
read that appears in multiple alignments
○ a bitwise set of information describing the alignment, FLAG. Provides the following
information:
■ are there multiple fragments?
■ are all fragments properly aligned?
■ is this fragment unmapped?
■ is the next fragment unmapped?
■ is this query the reverse strand?
■ is the next fragment the reverse strand?
■ is this the 1st fragment?
■ is this the last fragment?
■ is this a secondary alignment?
■ did this read fail quality controls?
■ is this read a PCR or optical duplicate?
Samtools
● Set of utilities that manipulate alignments in the SAM, BAM, and CRAM formats
● Converts between the formats, does sorting, merging and indexing, and can retrieve
reads in any regions swiftly
● Documentation http://www.htslib.org/doc/samtools.html
● Installation conda install -c bioconda samtools
● index - index a sorted SAM or BAM ﬁle for fast random access
● ﬂagstat - calculates statistics based primarily on the bit ﬂags (see the ﬂags description here)
● view - with no options or regions speciﬁed, prints all alignments in the speciﬁed input
alignment ﬁle to the stdout in the SAM format. Use of region speciﬁcations requires a
coordinate-sorted and indexed input ﬁle.
Finalization
● We can look how at the ﬁrst alignment as an example of SAM format
samtools view /home/user/output/dedup/chr1_dedup.bam | head -n 1
● Index the deduplicated ﬁle
samtools index /home/user/output/dedup/chr1_dedup.bam
● Calculate statistics
samtools flagstat /home/user/output/dedup/chr1_dedup.bam
● You can save them to the ﬁle using ‘>’ sign
samtools flagstat /home/user/output/dedup/chr1_dedup.bam \
> /home/user/output/dedup/chr1_dedup.bam.flagstat
Project task
1. Map both read ﬁles to the repetitive elements
2. Use the unmapped ﬁles to map them to the chromosome 1 of human genome
3. Perform deduplication of the mapped reads
4. Perform quality check of aligned deduplicated bam ﬁle
5. Get statistics about the deduplicated ﬁle using samtools
● Mark and discuss all the results in your project report
● Push the Alignment.sh script to Your GitHub repository