FeatureCounts Ashebir Gogile Uco540613 1 Feature Quantification ht seq- count FeatureCounts BEDToolsmmquant 2 3 • NGS technologies generate millions of short sequence reads, which are usually aligned to a reference genome. • For many downstream analysis is the number of reads mapping to each genomic feature(exons,genes) • The process of counting reads is called read summarization. 4 • featureCounts: ü an ultrafast and accurate read summarization program ürequires far less computer memory. • a highly efficient general-purpose read summarization program that counts mapped reads for genomic features • It can be used to count both RNA-seq and genomic DNAseq reads (SAM/BAM files) • It works with either single or paired-end reads 5 featureCounts • uses genomics annotations in GTF or SAF format for counting genomic features (exons) and metafeatures (genes). 6 • When you want to analyze the data for differential gene expression analysis, it would be convenient to have counts for all samples in a single file (gene count matrix). • Gene count matrix file run featureCounts on all mapped files at once. 7 • But, when you run a featureCounts for large samples individually, then the counts for each sample will be in a separate text file. 8 • To get the merged gene count matrix from all individual counts files, you can use bioinfokit v2.0.5 9 Input and output Inputs • takes as input Sequence Alignment(SAM)/Binary Alignment(BAM) files and • an annotation file including chromosomal coordinates of features. 10 • The annotation file should be in either GTF format or a simplified annotation format (SAF) as shown below: • GeneID Chr Start End Strand • 497097 chr1 3204563 3207049 • 497097 chr1 3411783 3411982 • 497097 chr1 3660633 3661579 - 11 outputs • are numbers of reads assigned to features (meta-features). • stat info for the overall summrization results,(no of successfully assigned reads and no of reads that failed to be assigned due to various reasons 12 • you can see the output file gene_matrix_count.csv in the same folder, which has counts merged for all samples. 13 ALGORITHM • Overlap of reads with features: FeatureCounts takes account of any gaps (Indels, exon–exon junctions) that are found in the read. • Multiple overlaps: featureCounts provides users with the option to either exclude multi-overlap reads or to count them for each feature that is overlapped. 14 • Chromosome hashing: used to generate a hash table for the reference sequence names. • matching reads and features by reference sequence • Genome bins and feature blocks:A two-level hierarchy is created for each reference sequence. The use of a hierarchical data structure (features within blocks within bins) is a key component of the featureCounts algorithm. 15 • The query read compared first with genomics bins, with feature blocks within within bins then features in any overlapping blocks. 16