MUNI RECETOX Research infrastructure Workflow & Computatinal Environmnet E5444 Analysis of sequencing data Vojtěch Bartoň vo jtec h. ba rto n @ recetox. muni.cz RECETOX, Masaryk University October 2, 2024 Table of Contents Bioinformatics Workflow Bioinformatics Workflows Managers Workflow Environments Grid Computing Metacentrum Linux Commands Hands-on V. Barton • Workflow & Environment • October 2, 2024 2/31 Workflow Bioinformatics workflow Library prapAratlort SAH/BAH file. Quality control Variant Colling Visualisation V. Barton • Workflow & Environment • October 2, 2024 Workflow Bioinformatics workflow r ■ I 4 ^V^* Feature. Cou^ti^g V. Barton • Workflow & Environment • October 2, 2024 Workflow Bioinformatics workflow 1 L Filtering 4 Feature Cou^tin^ Quality Control lc Filtering > ^lij^^ent Feature Counts pVovel Tronscript Discovery Quotation V. Barton • Workflow & Environment • October 2, 2024 5/31 Managers Bioinformatics Morkflow Managers a Analysis workflow Transcript expression quantification Fastq Reference sequence Grch38 Ensembl 91 Step 1: quality control ( fastQcf) fastQC v.0.11.9 Output 1 Output 2 QC report Transcript expression b Traditional pipeline Requirements Platform-specific (fastQC! ) (Salmon! ) ( Pipeline code ) P Re-entrance Local checkpoints_ Input data Execution Step 1 Step 2 Step 3 Output 1 Output 2 | | Input data j_J Output data J_J Software, versions, parameters c Workflow manager Requirements Platform-independent Workflow manager ( Pipeline code ) Portability Execution Re-entrance checkpoints o Local HPC Cloud Scalability Input data .' Containerized steps Step 1 r -> Step 2 K / Step 3 V Automatic resource management Output 1 Output 2 Execution report Re-entrancy Data provenance ! Fixed version, local compute environment Figure: Wratten, L, WiLm, A. & Goke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 18, 1161-1168 (2021). https://doi.org/10.1038/s41592-021-01254-9 V. Barton • Workflow & Environment • October 2, 2024 6/31 Managers Bioinformatics Workflow Managers ■ Modularity ■ Scalability ■ Reusability & Reproducibility ■ Stability ■ Loging, debuging V. Barton • Workflow & Environment • October 2, 2024 7/31 Managers Bioinformatics Workflow Managers ■ CWL ■ SnakeMake ■ Nextflow ■ GaLaxy project V. Barton • Workflow & Environment • October 2, 2024 8/31 Managers CWL (Common Worflow Language) ■ General worflow definition Language ■ format: YAML #!/usr/bin/env cwl-runner cwlVersion: vl.B class: CommandLineTool baseCommand: [bwa, mem] requirements: DockerRequirement: dockerPull: biocontainers/bwa:vS 7.17_CVl inputs: reference_fasta: type: File inputBinding: position: 1 input_reads: type: File inputBinding: position: 2 outputs: output_bam: type: File outputBinding: glob: "*.bam" secondaryFiles: - samtools_sort: run: samtools in: input_bam: $(inputs.output. _bam) out: [sorted_bam] V. Barton • Workflow & Environment • October 2, 2024 9/31 Managers SnakeMake rule all: input: "results/sorted.bam" rule bwajem: input: reference="data/reference.fasta", reads="data/reads.fastq" output: "results/aligned.sam" shell: "bwa mem {input.reference} {input.reads} > {output}" rule samtools_sort: input: "results/aligned.sam" output: "results/sorted.bam" shell: "samtools sort {input} -o {output}" rule index_bam: input: "results/sorted.bam" output: "results/sorted.bam.bai" shell: "samtools index {input}" ■ General worflow manager ■ Pythonic ■ Large community V. Barton • Workflow & Environment • October 2, 2024 10/31 Managers Next Flow #!/jsr/bin/env nextflow params.reads = "data/reads.fastq" params.reference = "data/reference.fasta" process BwaAlign { input: path reference path reads output: path "aligned.sam" ii ii M bwa mem ${reference} ${reads} > aligned.sam ii ii ii } process SamtoolsSort { input: path "aligned.sam" output: path "sorted.bam" ii ii ii samtools sort aligned.sam -o sorted.bam } workflow { BwaAlign(params.reference, params.reads) SamtoolsSort() } ■ Bioinformatics manager ■ Language: Groovy ■ Large community ■ Designed for bioinformatics V. Barton • Workflow & Environment • October 2, 2024 11/31 Managers Nextflow - nf-core Open community project Large repository of pipelines nf-core/ * rnaseq STAGE 1. Pre-processing 2. Genome alignment & quantification 3. Pseudo-alignment & quantification 4. Post-processing 5. Final QC METHOD Aligner: STAR, Quantification: Salmon (default) Aligner: STAR, Quantification: RSEM Aligner: HISAT2, Quantification: None Pseudo-aligner: Salmon. Quantification: Salmon ^^^m Pseudo-aligner: Kallisto, Quantification: Kallisto MultiQC dupRadar SAMtools (sort, index, stats) picard MarkDuplicates MultiQC bedGraphToBigWig Preseq __o_ BEDtools genomecov StringTie o DESeq2 (PCA only) Qualimap rnaseq RSeQC (multiple modules) V. Barton • Workflow & Environment • October 2, 2024 12/31 Managers Galaxy project ■ Bioinformatics manager ■ Graphical ■ Large community ■ usegaLaxy.eu | usegaLaxy.cz > 1: BWA- Ox* MEM2 6 Select first set of reads * © Select second set o1 reads * dataset collection □ GMcLSingle (faslctsanger) O □ & interleaved_fasta. (> (fastqsanger) □ ®fq1 (fastqsanger) Ö □ ©fq2 (fastctsanger) © □ ®fcLU tfastcisangerj f) V. Barton • Workflow & Environment • October 2, 2024 Environments Workflow Environments ■ Conda: ■ Lightweight package/environment management. ■ Ensures reproducibility by creating isolated environments with specific dependencies. ■ Docker: ■ Containerization for full software environments. ■ Ensures portability and reproducibility across different systems. ■ Virtual Machines (VMs): ■ Full operating system virtualization. ■ Ideal for running workflows with complex or legacy dependencies on isolated OS environments. ■ Grid Computing: ■ Distributes tasks across a network of computers. ■ Facilitates high-throughput and large-scale computations. V. Barton • Workflow & Environment • October 2, 2024 14/31 Environments Orchestration levels ANACONDA Powered by Continuum Analytics docker Your analysis or application Conda Environments Docker Containers Virtual Machines V. Barton • Workflow & Environment • October 2, 2024 15/31 Environments Orchestration levels Data Scientist O ANACONDA Powered by Continuum Anil.jtjci Analysis 1 , Analysis 2 Analysis 3 Analysis 1 ll Analysis 2 jJ Analysis 3 conda en v. 1 I con da en v. 2 I con da en v. 3 conda env. 1 I conda env, 2 I conda env. 3 Docker Container Laptop, server, cloud instance Laptop, server, cloud instance Data Science Development Data Science Deployment V. Barton • Workflow & Environment • October 2, 2024 16/31 Grid computing Metacentrum High-Performance Computing for Research ■ What is MetaCentrum? ■ National grid infrastructure operated by CESNET. ■ Provides computational resources for research and education in the Czech Republic. ■ Key Features: ■ High-Performance Computing (HPC): Access to powerful computing clusters. ■ Cloud Services: Virtual machines, storage, and customized environments for specific workflows. ■ Grid Computing: Distributed computing resources across multiple sites for large-scale projects. V. Barton • Workflow & Environment • October 2, 2024 17/31 Metacentrum Metacentrum ■ Storage capacity: ■ Providing storage capacity for data. ■ Several type of storages. ■ Supported Domains: ■ Bioinformatics, Physics, Chemistry, Climate Modeling, Machine Learning, and more. ■ User Access: ■ Free for academic institutions in the Czech Republic. ■ Web-based interface, SSH access, and job scheduling via PBS. V. Barton • Workflow & Environment • October 2, 2024 18/31 On Demand Metacentrum ■ Web-based platform for accessing and running applications on MetaCentrum resources without needing command-Line expertise. ■ Pre-configured Applications: Access to a wide range of scientific applications (e.g., RStudio, Jupyter, MATLAB). ■ ondemand.metacentrum.cz V. Barton • Workflow & Environment • October 2, 2024 19/31 Metacentrum Command-Line access ■ Front-end ■ Storages ■ PBS Scheduler ■ Batch job | Interactive job V. Barton • Workflow & Environment • October 2, 2024 20/31 Linux Basic Linux Commands Overview of Linux Commands ■ Command-Line interface (CLI) used for interacting with the operating system. ■ Efficient for managing fUes, directories, and processes. ■ EssentiaL for bioinformatics workflows and high-performance computing. V. Barton • Workflow & Environment • October 2, 2024 21/31 Linux File and Directory Navigation Common Commands: ■ pwd - Print current working directory. ■ Is - List files and directories. ■ cd - Change directory. ■ mkdir - Create a new directory. ■ rmdir - Remove an empty directory. V. Barton • Workflow & Environment • October 2, 2024 22/31 Linux File Manipulation Common Commands: ■ cp - Copy files and directories. ■ mv - Move or rename files and directories. ■ rm - Remove files or directories. ■ touch - Create an empty file or update file timestamps. ■ cat - Concatenate and display file content. V. Barton • Workflow & Environment • October 2, 2024 23/31 Linux Working with Compressed Files ■ gzip, gunzip - Compress or decompress files (common with FASTO and VCF formats). ■ tar - Archive and extract multiple files (tar -xvf, tar -czvf). ■ zcat, zgrep - View or search within compressed files without uncompressing them. V. Barton • Workflow & Environment • October 2, 2024 24/31 File Permissions Linux Common Commands: ■ chmod - Change file permissions (read, write, execute). ■ chown - Change file owner or group. ■ Is -I - List files with detailed permissions. ■ umask - Set default file permissions. V. Barton • Workflow & Environment • October 2, 2024 25/31 Linux Searching and Finding Files Common Commands: ■ find - Search for files in a directory hierarchy. ■ grep - Search for text patterns within files. ■ locate - Quickly find file Locations using a database. ■ which - Show the Location of an executable command. ■ man - Display manual pages for command help. V. Barton • Workflow & Environment • October 2, 2024 26/31 Linux Networking Commands Common Commands: ■ ping - Check connectivity to a host. ■ if conf ig - Display or configure network interfaces. ■ ssh - Secure shell access to a remote machine. ■ scp - Securely copy files between hosts. ■ wget - Download files from the web. V. Barton • Workflow & Environment • October 2, 2024 27/31 Linux Software Management ■ conda - Manage bioinformatics software environments. ■ module ava - AvaiLibiLity of software modules on HPC systems Like MetaCentrum. ■ module load - Load software modules on HPC systems Like MetaCentrum. ■ apt-get, yum - Install software on Linux systems (depends on the package manager). V. Barton • Workflow & Environment • October 2, 2024 28/31 Linux Job Scheduling (PBS) ■ qsub - Submit jobs on PBS-based HPC systems. ■ qstat - Monitor jobs on PBS-based HPC systems. ■ qdel - Manage jobs on PBS-based HPC systems. V. Barton • Workflow & Environment • October 2, 2024 Hands-on Hands-on ■ docs.metacentrum.cz/computing/concepts/ ■ Try some commands in terminal ■ Submit interactive job ■ Submit batch job ■ Explore outputs V. Barton • Workflow & Environment • October 2, 2024 30/31 MASARYK UNIVERSITY