Bioinformatics workflow management tools Vojta Bystry vojtech.bystry@ceitec.muni.cz NGS data analysis 22 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq… trim primers alignment variant calling report Bioinformatics workflow (pipeline) 3 CNV trim primers alignment variant calling report Bioinformatics workflow (pipeline) 4 CNV Bioinformatic workflow management 7 Bioinformatic workflow management 8 • Reusability and Reproducibility • Parallelization and Scale • Error solving / debuging Bioinformatic workflow managers 9 Common Workflow Language (CWL) 10 • Pushed by EU projects • Not big grassroots community • Scripts in .yaml format Galaxy project 11 • Workflow manager with GUI • Biologists can do their own analysis ??? • It can work - EMBL Nextflow 12 • Great deployability • Great existing workflow repository Snakemake 14 rule VC: rule bcl_to_fastq: rule results_report: rule alignment: rule CNV: rule primer_trimming: input: fastq="{run_name}/raw_fastq/{sample}.fastq" output: fastq="{run_name}/trimmed/{sample}.trimmed.fastq" shell: ”cutadapt {input.fastq} > {output.fastq}" Snakemake 15 wildcards – filled during run rule header rule body variables defined in header rule primer_trimming: input: fastq="{run_name}/raw_fastq/{sample}.fastq" output: fastq="{run_name}/trimmed/{sample}.trimmed.fastq" shell: ”cutadapt {input.fastq} > {output.fastq}" Snakemake 16 rule VC: rule bcl_to_fastq: rule results_report: input: vcf = "{run_name}/VC/{sample}.vcf" cnv = "{run_name}/CNV/{sample}.CNV.tsv" rule alignment: rule CNV: output: "{run_name}/CNV/{sample}.CNV.tsv" config_file.json rule primer_trimming: input: fastq="{run_name}/raw_fastq/{sample}.fastq" output: fastq="{run_name}/trimmed/{sample}.trimmed.fastq" shell: ”cutadapt {input.fastq} > {output.fastq}" Snakemake 17 ► Simple shell script ► Combine languages ► Wrap it in separate script ► Separation of logic and functionality ► Organization ► Re-usability shell: ”mv –R {input} {output}” run: if {params.cluster} is TRUE: R(”cutree(hclust({input}),h = 7)”) else: shell(”mv –R {input} {output}”) script: “my_script.py” shell R python Conda / Anaconda / Bioconda 21 Bioconda is a distribution of bioinformatics software realized as a channel for the versatile Conda package manager. Conda 22 CONDA enviroment Docker image Virtual machine Independence Computational Overhead Easy configuration Conda 23 • Easy installation and management • Instalation recepies: • Isolated environments: conda install vardict conda update vardict conda remove vardict conda env create -f myenv.yaml -n myenv channels: - conda-forge - defaults dependencies: - pandas ==0.20.3 - statsmodels ==0.8.0 - r-dplyr ==0.7.0 - r-base ==3.4.1 Conda 24 • Cheat sheet ‒ https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0b c00ca/conda-cheatsheet.pdf • Google it ‒ conda [bioinformatics tool name] Computational resources and execution 25 • Snakemake is quite flexible in cluster execution • https://snakemake.readthedocs.io/en/stable/executing/cloud.html • ! Nothing works as advertise J