Metagenomics processing E5444 Analysis of Sequencing Data Vojtěch Bartoň, 2023 Metagenomics • Microbial Community Genetics: Metagenomics studies the genetics of entire microbial communities in environmental samples. • Genomic Snapshot: It doesn't require isolating or culturing individual microorganisms, providing a holistic view of diverse microbes. • Health Insights: It helps us study the human microbiome and its impact on health without traditional culturing methods • Applications in Ecology and Biotech: Metagenomics informs ecosystem understanding and aids in biotech discoveries like enzymes and antibiotics Metagenomics: WMGS x 16S WMGS Captures the entire genetic content of all microorganisms. High-resolution data for functional analysis and taxonomic identification. Suitable for complex ecosystems and novel gene discovery. Used in environmental and clinical metagenomics. 16S Targets a specific 16S gene marker in bacteria and archaea. Lower resolution, primarily for taxonomic identification. Commonly used for microbial community profiling and diversity studies. Cost-effective method for taxonomic analysis. MetaPhlAn • Metagenomic Phylogenetic Analysis • Taxonomic Profiling: MetaPhlAn identifies and quantifies microbes in samples. • Marker Gene Approach: It uses unique genetic markers for speedy and accurate identification. • Efficiency: Known for fast analysis of large datasets. • Applications: Widely used in microbiome research and clinical metagenomics. • https://github.com/biobakery/MetaPhlAn Metaphlan: input data • Shotgun Whole Metagenome Sequencing • Sequences (fasta, fastq) • Database of taxonomically known sequences • Unique regions Metaphlan: marker genes • Clades • Groupsof genomes (organisms) believed to have evolved from a common ancestor • Clade-specificmarker-genes • Strongly conserved within the clade’s genomes • Not similar to any sequence in other clades (of the same level) • Unique markers change as the clade level grows • They also accumulatein a way (direct vs indirect)... Metaphlan: Overview • Reference genomes and their taxonomy • Find clade-specific marker genes • Sequence your sample • Map to marker genes • Count taxonomic units Metaphlan: Reference database • ChocoPhlAn • Acquire reference genomes • De novo assembly • Cultured species • Uniprot core data • Acquire taxonomy • Hierarchical clustering tree based on similarity • Ncbi taxonomy Metaphlan: Reference database • The general process: • Each genome → bag-of-genes representation • Only conserved genes in the clade are saved • Inter-clade uniqueness index elimination • Single-copy genes were preferred of multi-copy genes • Properties of the markers • Gene level • 5.1M filtered genes • 27K species-level genome bins • Not necessarily continuous (bag- of-genes) • ~4% of the total genome length • ~260 markers per specie Metaphlan: Taxonomic profiling • Map reads against reference database of marker genes • Calculate relative abundance • Sum the total reads mapped to clade markers • Divide by marker’s total length • Abundances in every clade-level sum up to 100% Metaphlan: Taxonomic profiling • Unclassified case • Move up in the taxonomy tree Metaphlan: outputs and visualisations • bugs_list.tsv • Utility scripts Metaphlan: application • Shotgun sequencing • Microbiome Profiling • Metagenomics • Metatranscriptomics • As input for HUMAnN • profiling the abundance of microbial metabolic pathways and other molecular functions Metaphlan: pros & cons Pros • Rapid profiling • Accuracy • Versatile • Quantitative output Cons • Limited functional information • Reference-dependent • Computational resources • Interpreting unknowns QIIME2 QIIME2 QIIME2: Output artifacts and visualisations • *.qza - zip folder, containing data and metadata • *.qzv - zip folder, containind data, metadata and visualisations QIIME2: pros & cons Pros • Comprehensive pipeline • Plugins • GUI • Modularity Cons • Own data types • Learning curve • Reference dependent nf-core/Ampliseq pipeline Ampliseq: quality control • Data preprocessing • Check reads quality • Perform filt&trim Ampliseq: ASVs calculation • DADA2 • Error estimation • Chimera removal • Contamination removal • Filtering Ampliseq: Taxonomic classification • Database dependent • Infer species • Confidence intervals • Multiple assignment Ampliseq: Taxonomic filtering • Filter specific taxa • Abundance filtering Ampliseq: Post processing • Visualisation • Diversity computation • Functional analysis Ampliseq: Visualisation Ampliseq: functional profiling • Picrust2 • Phylogenetic Investigation of Communities by Reconstruction of Unobserved States • KEGG and COG database • Based on phylogeny • Genes present in microbial genomes are similar amongst relatives • When sufficient genome sequences are available, it is possible to predict which gene families are present in a given microbial OTU from phylogeny alone. Ampliseq: Report Ampliseq: pros & cons Pros • Standardized • Easy to run • Comprehensive analysis • Community-driven Cons • Learning curve • Resource intensive • Needs setup • Software versions dependent