E5444 Analysis of sequencing data Feature quantification Autumn 2023 Eva Budinska The NGS analysis pipeline Step 4: Feature detection (quantification) Once we finish the alignment, we can continue with the quantification of the features Important – there are important differences within this and previous steps (alignment) in case of targeted gene sequencing experiments or metagenomics. These will be discussed more in detail later in separate (focused) lectures. Step 4: Feature detection (quantification) • Creates the final table with read counts for further statistical analyses • A feature of interest differs based on the experiment: • gene, exon, intron… (WGS, WES) • transcript, isoform (RNA-seq) • variant - SNP, insertion, deletion, CNV - (WGS, WES, targeted sequencing) • promotor sequence (ChIP-Seq) • In transcriptomics NGS experiments, the emphasis is on quantification of known transcripts (unless the aim is to get new isoforms) – we quantify the abundanceof the RNA. • In genomic NGS experiments, the emphasis is more on the detection of structural changes (the quantification is the % of alternative alleles found). Step 4: Feature detection (quantification) • Creates the final table with read counts for further statistical analyses • The final output of this step is always a matrix with: • Information about the feature (ID, name, variant…) - annotation • Quantification of this feature in each of the samples Featureannotation • Gathering all the information about the feature • Based on the feature type, we are using different information in the annotation files • RNAseq – ID and name of the gene/transcript, positionon the chromosome… • Variants - specific format.vcf - includes referenceallele, variant, annotation, …. • Meteagenomics– taxonomical assignment of the OTUs (ASVs) • We are using GTF/GFF files (remember from the previous alignment lecture) • List of genes,transcripts, exons, introns, CDS, … • We are using known databases for feature annotation • Important! Feature annotation can change using different versions of dbs! GTF/GFFfiles • General feature format (.gff) or Gene transfer format (.gtf) Extensions: • Gtf/.gff2/.gff3 • GTF files can describe a varietyof genomicfeatures,such as genes, transcripts,exons,introns,and more.Each feature is represented as a separate line in the GTF file, with the relevant attributes specified. • We need information about what and where it is located in a reference genome/transcriptome • This information is usually stored in an annotation file • Contains information about each feature and its location • They are very similar but differ in the “strictness” and/or syntax • Is often used for gene expression counting, localization of genes, etc. • There is also .gff1 but it is very rare GTF/GFFfiles • General feature format (.gff) or Gene transfer format (.gtf) Extensions: • gtf/.gff2/.gff3 • Structure: A GTF file is a tab-delimited text file with various columns that provide information about genomic features. The minimum required columns typically include the following: • Chromosome/contig name • Source (the program or database that generated the annotation) • Feature type (e.g., gene, transcript, exon, etc.) • Start position (the beginning of the feature) • End position (the end of the feature) • Strand (either "+" for the positive strand or "-" for the negative strand) • Attributes (a set of key-value pairs describing additional information about the feature, including gene and transcript identifiers) Positivevs negative strand • There are several conventions for labeling the two strands of a piece of DNA depending on the frame of reference. • In the human genome, the National Center for Biotechnology Information (NCBI) website uses the term "positive strand" to refer specifically to the strand whose 5' end begins at the end of the p arm. Basepair numbering starts at the 5' end of this strand. https://researchguides.library.vanderbilt.edu/c.php?g=69346&p=816436 GTF/GFFfiles- example • General feature format (.gff) or Gene transfer format (.gtf) Extensions: • gtf/.gff2/.gff3 • The files are usually TAB delimited and 1 numbering based • .gff2–Sanger Institute http://www.sanger.ac.uk/resources/software/gff/spec.html#t_2 • .gtf – Modification of .gff2, sometimes called gff2.5 http://mblab.wustl.edu/GTF22.html • .gff3 – Sequence Ontology Project http://www.sequenceontology.org/gff3.shtml GTF/GFFfiles– comparisonof differentformats • General feature format (.gff) or Gene transfer format (.gtf) Extensions: • gtf/.gff2/.gff3 .gff2 .gtf .gff3 GTF/GFFfiles– comparisonof differentformats • General feature format (.gff) or Gene transfer format (.gtf) Extensions: • gtf/.gff2/.gff3 • GFF3 is preferred to GFF2, its newer and comprises hierarchical structure GTF/GFFfiles– the hierarchy • Suppose you havea gene, which is a parent feature. • This gene may have multiple child features, such as exons and introns. • Exons can further contain subfeatures,like codingsequences (CDS) and untranslated regions (UTRs). • In GFF3, this hierarchical relationshipis represented usingthe "Parent" and "ID" attributes in the attributes column. This hierarchical structure is useful for accurately representing the organization of genes and their constituent elements in a genome, which is crucial for various bioinformatics analyses, including gene prediction, functional annotation, and visualization of genomic data. Where to get GTF files !IMPORTANT - GTF files downloaded from the UCSC Table Browser have the same entries for gene id and transcript id. (not suitable for analyses of different isoforms) To get the correct formatting, select in the output format field “all fields from selected table“ and apply UCSC tool genePredToGtf on the resulting file. Downloadingfrom UCSC Genome Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) BED files • Could be considered as another type of annotation file but is much simpler • Most often used for visualization in genomic browsers (UCSC Genome Browser: https://genome.ucsc.edu/, IGV https://www.broadinstitute.org/igv/, Tablet https://ics.hutton.ac.uk/tablet/, …) • Very often as a results of ChIP-Seq or similar experiments • Used for targeted experiments to import target regions IMPORTANT – in comparison with .gft/.gff2/.gff3 is zero-based • Has only three mandatory columns – chromosome, start, end • Everything else is optional (gtf/.gff2/.gff3 has 8/9 mandatory columns) • Browser Extensible Data (.bed) BED file example • Browser Extensible Data (.bed) https://genome.ucsc.edu/FAQ/FAQformat.html#format1 VCF/BCF files • VariantCall Format (.vcf) • Binary Call Format (.bcf) is compressed version of .vcf • Used in targeted sequencing while calling SNPs/Indels/… • Contains information about “genetic variations” • Officially: It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position. • Defined format but again not every tool provide all “necessary” fields in the output • Several versions, current at 4.3 • http://www.1000genomes.org/node/101 • http://samtools.github.io/hts-specs/VCFv4.3.pdf Feature quantification • Counting the numbers of reads that aligned to the feature • The concept seems simple – lets have a look where the read mapped and then count all the reads within the feature • For gene-level quantifications, 2 possibilities: 1. Directly count reads overlapping with the gene loci 2. In the case of transcriptomics, we can quantify on the level of transcripts and then aggregatethe values per gene • But, what about multimapped reads – this is a problem especially for isoforms in transcriptome sequencing! Feature quantification – the mostusedSW • GenomicRanges, IRanges- packages developed by core team of the Bioconductor project (R language) - include functions for counting reads that overlap genomic features • HTSeq-count - function of the HT-Seq Python framework for processing RNA-seq or DNAseq data • BEDTools - a popular tool for finding overlaps between genomic features that can be used to count overlaps between reads and features, in C++, much faster, but not specifically designed for RNA-seq data, so can count reads for exons or interval features only, similar to countOverlaps. • featureCounts – optimized read count program, fast and flexible, used to quantify reads generated from either RNA or DNA sequencing technologies in terms of any type of genomic feature • available either as a Unix command or as a function in the R package Rsubread (the core coded in C) Usually count uniquely mapped reads and relies on counting scheme Available in different languages (R, Python, C++, …) Important / The SW (usually) evolves over time and gain more functions and get faster Feature quantification – selectingthe SW • When counting reads, make sure you know how the program handles the following: • overlap size (full read vs. partial overlap); • multi-mapping reads, i.e. reads with multiple hits in the genome; • reads overlapping multiple genomic features of the same kind; • reads overlapping introns • The gene quantification will be strongly affected by the underlying gene models that are usually supplied to the quantification programs via GTF or BED(-like) HTSeq gene-based counting schemes http://www- huber.embl.de/users/anders/ HTSeq/doc/count.html Three different modes to tune its behavior with respect to: - the multimapping - the gaps featureCounts Also allows to count reads overlapping with individual exons. If an exon is part of more than one isoform in the annotation file, featureCounts will return the read counts for the same exon multiple times (n = number of transcripts with that exon). If we want to assess differential expression of exons, it is highly recommended to create an annotation file where overlapping exons of different isoforms are split into artificially disjoint bins before applying featureCounts (e.g. using dexseq_prepare_annotation.py script of the DEXSeq package) Isoform mapping • Simple count-based approaches tend to ignore reads that overlap with more than one feature • This is a problem if the aim is to quantify different isoforms (multiple isoform transcripts of the same gene may overlap) • Special SW: Cufflinks, eXpress, DEXseq, RSEM • RSEM is the one that tends to perform best in most comparisons and the statistical interpretations and assumptions to handle transcript structures have been widely adopted. Quantifying gene vs transcript abundance https://www.semanticscholar.org/paper/Differential-analysis- of-gene-regulation-at-with-Trapnell- Hendrickson/55507aba54a92b31a200e8112f088f5c356ed7bc Cuffdiff SW The alignment free methods for transcript quantification! • “Special” counting – RSEM; (Salmon, Kallisto) • Mapping to transcriptome and dividing multi mapped reads between all isoforms • Expression by transcripts summarized to gene expression • https://f1000research.com/articles/4-1521/v2 Alignment is the most computationally expensive step in the whole pipeline. But what if we skipped this step?? Threemainstepsin the alignmentfree transcript quantification! 1. The sequences for comparison (reads, reference) are sliced up into collections of unique (!) k-mers of a given length k. 2. For each pairwise comparison, we count the number of times a specic k-mer appears in both sequence strings that are being compared. 3. To assess the similarity between the two strings, some sort of distance function is employed (Euclidean distance; identical sequences have a distance of zero) Zielezinski A, Vinga S, Almeida J, and Karlowski WM. Alignment-free sequence comparison: Benefits,applications, and tools. Genome Biology, 2017. doi:10.1186/s13059-017-1319-7. The alignment-free transcript quantification! • In practice, Salmon and Kallisto will first generate an index of k-mers from all known transcript sequences. • These transcript k-mers will then be compared with the k-mers of the sequenced reads, yielding a pseudoalignment that describes how many kmers a read shares with a set of compatible transcripts (based on the distances) • By grouping all pseudoalignments that belong to the same set of transcripts, they can then estimate the expression level of each transcript model. The alignment-free transcript quantification! • The pseudoaligners rely absolutely on a precise and comprehensive transcript annotation. • If a sequenced fragment originates from an intron or an unannotated transcript, it can be falsely mapped to a transcript since the relevant genomic sequence is not available. • Alignment-based tools will discard reads if their edit distance becomes too large, pseudoalignment currently does not entail a comparable scoring system to validate the compatibility; therefore there is no safeguard againstspuriousalignments. • For example, that a 100-bp-read can pseudoalign with a transcript with which it shares only a single k-mer { if no better match can be found within the universe of the pre-generated cDNA index.) The speed is increased but to a cost! Comparison of different algorithms for counting features in RNAseq data Feature featureCounts HTSeq-count Cufflinks eXpress DEXSeq RSEM Developed by Bioconductorteam Simon Anders Cole Trapnell's Lab Lior Pachter's Lab Simon Anders'Lab Bo Li's Lab Compatibility SAM, BAM, CRAM formats Primarily works with SAM SAM, BAM, GTF formats SAM, BAM formats BAM, GFF formats SAM, BAM formats Parallelization Supports parallel processing Does not support parallelization Supports parallel processing Supports parallel processing Supports parallel processing Supports parallel processing Annotation Formats GTF, GFF, SAF formats Requires GFF featuretype format GTF format GTF format GFF format GFF format Output Format Tab-delimited text file with counts Tab-delimited textfilewith counts Various files and tables Various files Tab-delimited text files and tables Various files and tables Speed Known for speed and efficiency Efficient but may be slower for largedatasets Slightly slower for large datasets Known for speed and efficiency Slightly slower for large datasets Known for speed and efficiency Handling Options Various options for read counting, handling multimapped reads, and strandedness Options for secondary alignments, ambiguous reads, andstranded data Performs transcript assembly and quantification Efficient and accurate quantification Differentialexon usage analysis Accurate quantification, support for alternative isoforms Statistical Methods Employs a counting algorithmthat considers multimapping reads Employs a counting algorithmfor read assignment Utilizes a likelihood-based method for transcript quantification Utilizes a Bayesian framework for quantification Employs statistical models for identifying differentially expressed exons Utilizes an Expectation- Maximization (EM) algorithm for quantification Supports Isoform Counts No No Yes Yes No Yes Additional Features Flexibleannotation format compatibility Primarily designedfor SAMfiles and GFF feature types Transcript assembly and differential expression analysis Speed and memory efficiency Differentialexon usage analysis Support for estimating isoform expression Documentation Well-documented and actively maintained Well-documented and widely used Comprehensive documentation Documentation available Documentation available Documentation available Generated with help of ChatGPT