Bioinformatics workflow
management tools
Vojta Bystry
vojtech.bystry@ceitec.muni.cz
NGS data analysis
22
Raw data
.fastq
Genome/Transcriptome
Reference Mapping
.bam
Interaction
analysis
CHIP-seq
Expression
analysis
RNAseq
Variant
analysis
WES
de-multiplexing
Not known
reference
QC
QC
Experiment
design
Not ”classic”
reference
Metagenomics
Reference
assembly
Immunogenetic
VDJ-genes
CRISPR
sgRNA
Methylation
Bisulfide-seq…
trim primers
alignment
variant calling
report
Bioinformatics workflow (pipeline)
3
CNV
trim primers
alignment
variant calling
report
Bioinformatics workflow (pipeline)
4
CNV
Bioinformatic workflow management
7
Bioinformatic workflow management
8
• Reusability and Reproducibility
• Parallelization and Scale
• Error solving / debuging
Bioinformatic workflow managers
9
Common Workflow Language (CWL)
10
• Pushed by EU projects
• Not big grassroots community
• Scripts in .yaml format
Galaxy project
11
• Workflow manager with GUI
• Biologists can do their own analysis ???
• It can work - EMBL
Nextflow
12
• Great deployability
• Great existing workflow repository
Snakemake
14
rule VC:
rule bcl_to_fastq:
rule results_report:
rule alignment:
rule CNV:
rule primer_trimming:
input: fastq="{run_name}/raw_fastq/{sample}.fastq"
output: fastq="{run_name}/trimmed/{sample}.trimmed.fastq"
shell: ”cutadapt {input.fastq} > {output.fastq}"
Snakemake
15
wildcards – filled
during run
rule header
rule body variables
defined in
header
rule primer_trimming:
input: fastq="{run_name}/raw_fastq/{sample}.fastq"
output: fastq="{run_name}/trimmed/{sample}.trimmed.fastq"
shell: ”cutadapt {input.fastq} > {output.fastq}"
Snakemake
16
rule VC:
rule bcl_to_fastq:
rule results_report:
input: vcf = "{run_name}/VC/{sample}.vcf"
cnv = "{run_name}/CNV/{sample}.CNV.tsv"
rule alignment:
rule CNV:
output: "{run_name}/CNV/{sample}.CNV.tsv"
config_file.json
rule primer_trimming:
input: fastq="{run_name}/raw_fastq/{sample}.fastq"
output: fastq="{run_name}/trimmed/{sample}.trimmed.fastq"
shell: ”cutadapt {input.fastq} > {output.fastq}"
Snakemake
17
► Simple shell script
► Combine languages
► Wrap it in separate script
► Separation of logic and functionality
► Organization
► Re-usability
shell: ”mv –R {input} {output}”
run:
if {params.cluster} is TRUE:
R(”cutree(hclust({input}),h = 7)”)
else:
shell(”mv –R {input} {output}”)
script: “my_script.py”
shell
R
python
Conda / Anaconda / Bioconda
21
Bioconda is a distribution of bioinformatics software realized as a channel
for the versatile Conda package manager.
Conda
22
CONDA enviroment
Docker image
Virtual machine
Independence
Computational
Overhead
Easy
configuration
Conda
23
• Easy installation and management
• Instalation recepies:
• Isolated environments:
conda install vardict
conda update vardict
conda remove vardict
conda env create -f myenv.yaml -n myenv
channels:
- conda-forge
- defaults
dependencies:
- pandas ==0.20.3
- statsmodels ==0.8.0
- r-dplyr ==0.7.0
- r-base ==3.4.1
Conda
24
• Cheat sheet
‒ https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0b
c00ca/conda-cheatsheet.pdf
• Google it
‒ conda [bioinformatics tool name]
Computational resources and execution
25
• Snakemake is quite flexible in cluster execution
• https://snakemake.readthedocs.io/en/stable/executing/cloud.html
• ! Nothing works as advertise J