Metagenomics processing
E5444 Analysis of Sequencing Data
Vojtěch Bartoň, 2023
Metagenomics
• Microbial Community Genetics: Metagenomics studies the genetics of entire microbial
communities in environmental samples.
• Genomic Snapshot: It doesn't require isolating or culturing individual microorganisms,
providing a holistic view of diverse microbes.
• Health Insights: It helps us study the human microbiome and its impact on health
without traditional culturing methods
• Applications in Ecology and Biotech: Metagenomics informs ecosystem understanding
and aids in biotech discoveries like enzymes and antibiotics
Metagenomics: WMGS x 16S
WMGS
Captures the entire genetic content of all
microorganisms.
High-resolution data for functional analysis and
taxonomic identification.
Suitable for complex ecosystems and novel gene
discovery.
Used in environmental and clinical metagenomics.
16S
Targets a specific 16S gene marker in bacteria and
archaea.
Lower resolution, primarily for taxonomic
identification.
Commonly used for microbial community profiling
and diversity studies.
Cost-effective method for taxonomic analysis.
MetaPhlAn
• Metagenomic Phylogenetic Analysis
• Taxonomic Profiling: MetaPhlAn identifies and quantifies microbes in samples.
• Marker Gene Approach: It uses unique genetic markers for speedy and accurate
identification.
• Efficiency: Known for fast analysis of large datasets.
• Applications: Widely used in microbiome research and clinical metagenomics.
• https://github.com/biobakery/MetaPhlAn
Metaphlan: input data
• Shotgun Whole Metagenome Sequencing
• Sequences (fasta, fastq)
• Database of taxonomically known sequences
• Unique regions
Metaphlan: marker genes
• Clades
• Groupsof genomes (organisms)
believed to have evolved from a
common ancestor
• Clade-specificmarker-genes
• Strongly conserved within the
clade’s genomes
• Not similar to any sequence in
other clades (of the same level)
• Unique markers change as the
clade level grows
• They also accumulatein a way
(direct vs indirect)...
Metaphlan: Overview
• Reference genomes and their
taxonomy
• Find clade-specific marker genes
• Sequence your sample
• Map to marker genes
• Count taxonomic units
Metaphlan: Reference database
• ChocoPhlAn
• Acquire reference genomes
• De novo assembly
• Cultured species
• Uniprot core data
• Acquire taxonomy
• Hierarchical clustering tree based on similarity
• Ncbi taxonomy
Metaphlan: Reference database
• The general process:
• Each genome → bag-of-genes
representation
• Only conserved genes in the clade
are saved
• Inter-clade uniqueness index
elimination
• Single-copy genes were preferred
of multi-copy genes
• Properties of the markers
• Gene level
• 5.1M filtered genes
• 27K species-level genome bins
• Not necessarily continuous (bag-
of-genes)
• ~4% of the total genome length
• ~260 markers per specie
Metaphlan: Taxonomic profiling
• Map reads against reference database of marker genes
• Calculate relative abundance
• Sum the total reads mapped to clade markers
• Divide by marker’s total length
• Abundances in every clade-level sum up to 100%
Metaphlan: Taxonomic profiling
• Unclassified case
• Move up in the taxonomy tree
Metaphlan: outputs and visualisations
• bugs_list.tsv
• Utility scripts
Metaphlan: application
• Shotgun sequencing
• Microbiome Profiling
• Metagenomics
• Metatranscriptomics
• As input for HUMAnN
• profiling the abundance of microbial metabolic pathways and other molecular
functions
Metaphlan: pros & cons
Pros
• Rapid profiling
• Accuracy
• Versatile
• Quantitative output
Cons
• Limited functional information
• Reference-dependent
• Computational resources
• Interpreting unknowns
QIIME2
QIIME2
QIIME2: Output artifacts and visualisations
• *.qza - zip folder, containing data and metadata
• *.qzv - zip folder, containind data, metadata and visualisations
QIIME2: pros & cons
Pros
• Comprehensive pipeline
• Plugins
• GUI
• Modularity
Cons
• Own data types
• Learning curve
• Reference dependent
nf-core/Ampliseq pipeline
Ampliseq: quality control
• Data preprocessing
• Check reads quality
• Perform filt&trim
Ampliseq: ASVs calculation
• DADA2
• Error estimation
• Chimera removal
• Contamination removal
• Filtering
Ampliseq: Taxonomic classification
• Database dependent
• Infer species
• Confidence intervals
• Multiple assignment
Ampliseq: Taxonomic filtering
• Filter specific taxa
• Abundance filtering
Ampliseq: Post processing
• Visualisation
• Diversity computation
• Functional analysis
Ampliseq: Visualisation
Ampliseq: functional profiling
• Picrust2
• Phylogenetic Investigation of
Communities by Reconstruction
of Unobserved States
• KEGG and COG database
• Based on phylogeny
• Genes present in microbial
genomes are similar amongst
relatives
• When sufficient genome
sequences are available, it is
possible to predict which gene
families are present in a given
microbial OTU from phylogeny
alone.
Ampliseq: Report
Ampliseq: pros & cons
Pros
• Standardized
• Easy to run
• Comprehensive analysis
• Community-driven
Cons
• Learning curve
• Resource intensive
• Needs setup
• Software versions dependent