Front-end is a computer which you use to login to the MetaCentrum and from which you submit the jobs. You should never use it to run the commands unless you are sure they will take few minutes to finish. Each of the front-ends has a default home directory. But of course, you can simply cd
to another home directory once you login.
List of front-ends:
Front-end | Home directory/storage |
---|---|
skirit.ics.muni.cz | /storage/brno2/home/ |
alfrid.meta.zcu.cz | /storage/plzen1/home/ |
tarkil.grid.cesnet.cz | /storage/praha1/home/ |
nympha.zcu.cz | /storage/plzen1/home/ |
minos.zcu.cz | /storage/plzen1/home/ |
perian.ncbr.muni.cz | /storage/brno2/home/ |
onyx.ncbr.muni.cz | /storage/brno2/home/ |
zuphux.cerit-sc.cz | /storage/brno3-cerit/home/ |
Now let's go and login to a front-end.
ssh username@zuphux.metacentrum.cz
where username is your metacentrum username.
What do we have here?
ls
We can also go to a different storage. There is more storages than the home directories and you can access all of them. You should see the quota limits right after you login to the MetaCentrum.
cd /storage/brno9-ceitec/home/username
We will try to do an (example analysis)[https://github.com/ekg/alignment-and-variant-calling-tutorial] at the MetaCentrum. Now, we would like to start to work with our data. But first, we should create a directory which will hold our project. It is necessary to keep a nice organization of the project otherwise you will get lost soon.
mkdir -p projects/dnaseq/data/raw
Thanks to the -p option, this will create all the directories even if some of them (= the parent directories of data
directory) do not exist.
Now we can download the data. If we know the files are very small we can download them directly from the front-end. Please, open the ENA link first and check what the data actually are.
cd projects/dnaseq/data/raw
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR177/003/SRR1770413/SRR1770413_1.fastq.gz &
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR177/003/SRR1770413/SRR1770413_2.fastq.gz &
The &
character sends a task to the background so we can do more things in one terminal at the same time. To see all the running jobs you can type jobs
. You can also "kill" the job in the background by kill %1
where the %1
represents the number which was assigned to a task and which can be seen by the jobs
command. But be careful, once you send a job to the background and you close the terminal the jobs will stop! To avoid this you could do disown %1
which will 'disconnect' the jobs from the current terminal and the job will continue running in the background even after closing the terminal.
If the data are big or you have more files than just in this example you should use the regular job submission (see bellow) rather than the front-end.
We will also need a reference genome because we will do an alignment of the data to the reference. Let's make another directory which will hold the reference.
Make a directory to store the reference sequences (genomes).
mkdir -p /storage/brno9-ceitec/home/username/projects/dnaseq/data/references
cd /storage/brno9-ceitec/home/username/projects/dnaseq/data/references
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
Make a directory to store the scripts once we know how to process the data.
mkdir /storage/brno9-ceitec/home/username/projects/dnaseq/src
Once we have the data and the reference we can start with the processing. We will request an interactive job. This is suitable if we want to calculate something quick or we want to test our commands.
qsub -l walltime=1:0:0 -l select=1:ncpus=6:mem=4gb:scratch_local=10gb -I
We ask for 1 hours job, 6 CPUs, 4 GB of RAM and 10 GB of storage space. The -I
tells to the submission system we want the interactive job. If the command above seems too complex to write by hand,
try PBSPro 'assembler' application, which
generates the command for you.
Once the jobs is started (this will take some time) we can continue with the computation.
It is best if you specify the variables at the beginning. We will specificy the input and output directories and the reference we will use for the analysis.
INPUT_DIR=/storage/brno9-ceitec/home/username/projects/dnaseq/data/raw
OUTPUT_DIR=/storage/brno9-ceitec/home/username/projects/dnaseq/results
REFERENCE=/storage/brno9-ceitec/home/username/projects/dnaseq/data/references/GCF_000005845.2_ASM584v2_genomic.fna.gz
Now, we will copy all the files to the assigned fast storage (scratch). This will enable the computation to run
at full speed.
Path to this directory is automatically stored in the $SCRATCH
variable.
cp $INPUT_DIR/*.gz $SCRATCH/ &
cp $REFERENCE $SCRATCH/
wait
The wait
command waits until everything what is running in the background finishes and then it allows you to move on. This is suitable if you send a lot of commands in the background but you don't want to follow each and every one of them if they finished or not.
And we can move to the $SCRATCH
cd $SCRATCH
The first step in the experiment should be always initial quality check. FastQC is one of the nice tools for this purpose. If we want to use it at the MetaCentrum, we first need to add a module with this tool. The list of all tools installed at the MetaCentrum is here.
To add the module execute following command:
module add fastQC-0.11.5
MetaCentrum often has several versions of a single tool so be sure you always use the 'full' name including the version.
Once the FastQC is added we can analyze our fastq files. But first, we create and output directory for the results to make the $SCRATCH
clean.
mkdir -p $SCRATCH/qc/fastqc/raw
fastqc -t $PBS_NUM_PPN -o $SCRATCH/qc/fastqc/raw *fastq.gz
This will exectute FastQC on all the files (*
) which has a .gz
suffix. The *
character represents any string (consisting of both numbers and letters). The $PBS_NUM_PPN
is a special variable assigned by MetaCentrum to each of the job submission (similar to $SCRATCH
variable which is also unique for each job) and holds and information about the number of assigned CPUs (cores).This way we will use all the cores which were assigned to a job. Using the full path to the output directory will ensure you save the results exactly where you want to save it.
Once the analysis is finished we can summarize the FastQC results (we have two) using the MultiQC tool. Again, look for MultiQC at the MetaCentrum. MultiQC is a Python package and therefore is 'hidden' in not so nice module name called python36-modules-gcc
. We can add the whole module and launch MultiQC.
module add python36-modules-gcc
multiqc -o $SCRATCH/qc/fastqc/raw $SCRATCH/qc/fastqc/raw/
In case you will get an error saying 'This system supports the C.UTF-8 locale which is recommended...' you have to change the local settings by
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
But how to visualize the results from the MultiQC? It a html report but it is stored somewhere at the MetaCentrum storage. We can use scp
command to copy the files from a remote server to our computer. To do this we have to specify what we want to copy and where.
Run the following command in your computer:
scp -r username@zuphux.metacentrum.cz:/storage/brno9-ceitec/home/username/projects/dnaseq/results/qc/fastqc/raw .
This command would copy the directory at /storage/brno9-ceitec/home/username/projects/dnaseq/results/qc/fastqc/raw
to the current directory at your computer (.
).
Please note, that the -r option is neccessary if you are copying whole directories, otherwise the cp utility would refuse to copy and warn you that you are trying to copy a directory (it is a safety precaution).
The next step in the analysis could be the raw read preprocessing but we will not do it now. We go straight to the alignment.
There are many tools for the alignment. For a DNA experiment, BWA MEM
is often used.
Again, add the module first.
module add bwa-0.7.15
BWA MEM
(and most of the other aligners) need to create an reference index first. It 'prepares' the reference for an easier access by the aligner. However, BWA MEM
cannot work with gz
files. We have to uncompress the reference first.
If you remember, we stored the name of the reference in variable $REFERENCE
and copied it to the MetaCentrum. But can we work directly with this variable? Let's see where it's pointing at.
echo $REFERENCE
It is pointing to the original location of the reference which is at the storage and we don't want to use it. We would like to use the one we have copied to the $SCRATCH
. We can use a simple trick to get just the name of the reference and store it in the variable. For this, we can use command basename
.
REFERENCE=$(basename $REFERENCE)
echo $REFERENCE
This executes the command basename $REFERENCE
and stores the result (the brackets) again to the same variable name.
But still, the reference is in a gz
format. We have to uncompress it and 'adjust' the variable name.
gunzip $SCRATCH/$REFERENCE
REFERENCE=${REFERENCE%.gz}
echo $REFERENCE
The command with 'curly' brackets in this form removes the '.gz' part of the name of the reference stored at $REFERENCE
variable. The syntax is following:
${variable_name%what_I_want_to_remove_from_the_end_of_the_string}
Now we can finally make the index.
bwa index $SCRATCH/$REFERENCE
But how to align the files to the reference without writing the names (so we can make it automatic next time)? We can use a for loop and our replacement with curly brackets. for
loop goes through all the files/variables and stops only when it went through all the options.
mkdir $SCRATCH/alignment
for READ1 in *_1.fastq.gz
do
READ2=${READ1%_1.fastq.gz}_2.fastq.gz
echo $READ1
echo $READ2
bwa mem -t $PBS_NUM_PPN $SCRATCH/$REFERENCE $SCRATCH/$READ1 $SCRATCH/$READ2 > $SCRATCH/alignment/${READ1%_1.fastq.gz}.sam
done
It is a good practice to use tab
as an indentation of the commands in a loop. It is much clearer for reading. However, this notebook style of tutorials has some troubles with it so we were not able to use it.
So what we did is we scanned the current directory for all files ending with _1.fastq.gz
and stored it to $READ1
variable. Then, we removed the _1.fastq.gz
from the name and appended _2.fastq.gz
which we then stored as $READ2
. Then, we aligned the $READ1
and $READ2
to the reference $REFERENCE
and stored the output is a sam
file.
The sam
format is a text file which can be further compressed into a bam
(binary) format. This can be done using Samtools. As usual, find Samtools at MetaCentrum and add module.
module add samtools-1.6
Now we can do the conversion and we can also see some basic alignment statistics.
mkdir $SCRATCH/qc/flagstat
cd $SCRATCH/alignment/
for file in *.sam
do
samtools view -@ $PBS_NUM_PPN -b $file > ${file%.sam}.bam
samtools flagstat ${file%.sam}.bam > $SCRATCH/qc/flagstat/${file%.sam}.flagstat.txt
rm $file # We don't need the sam file anymore
done
As you can see, Samtools is not using -t
or --threads
to specify number of threads to the calculation but -@
character. Always check the tools manual for the correct setting of each parameter!.
We also often need to sort the alignment before further processing and index the sorted bam file.
for file in *.bam
do
samtools sort -@ $PBS_NUM_PPN -o ${file%.bam}.sorted.bam $file
samtools index ${file%.bam}.sorted.bam
rm $file # We don't need the unsorted bam file anymore
done
We should not forget we are doing the analysis at the $SCRATCH
. We have to copy the results back to our storage otherwise they would be deleted in few hours/days.
Copy the results back to the storage
mkdir -p $OUTPUT_DIR
cp -r $SCRATCH/qc $OUTPUT_DIR/ &
cp -r $SCRATCH/alignment $OUTPUT_DIR/ &
wait
And it's a good thing to clean after ourselves after the analysis is done and exit the job.
rm -r $SCRATCH/*
exit
Hurray, you have just aligned your first experiment!
We can also visualize the alignment using IGV. This tool is better to use directly at your computer (it is availible in virtualbox image).
Download the aligned and sorted bam
file, and the reference genome to your computer. Then, open a new terminal and launch the visualization. Do this on your computer.
scp username@zuphux.metacentrum.cz:/storage/brno9-ceitec/home/username/projects/dnaseq/data/references/*.gz .
scp -r username@zuphux.metacentrum.cz:/storage/brno9-ceitec/home/username/projects/dnaseq/results .
We can logout from MetaCentrum or we can just open a new terminal.
Before we continue with the visualization we have to decompress (gunzip
) the reference and index it, this time with samtools
.
On your computers decompress and index the reference.
First, go to the directory and decompress.
gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz
And index with samtools
samtools faidx GCF_000005845.2_ASM584v2_genomic.fna
Open IGV by opening your terminal and executing the command:
igv
For visualization, please see IGV manual but you will need the reference genome in fasta
format and the aligned bam
file and bai
index file (in the same directory as the bam
file).
Be careful, IGV is quite "sensitive" on available RAM memory. A lot of loaded BAM
files will jam your computer.
We should always check the quality of our alignment before any further processing. One of the tools to get summarized statistics is Qualimap. If you are using the virtualbox image, the tool is already included and stored in ~/Tools/qualimap/qualimap. To run it, simply execute this path as a command in your terminal:
~/Tools/qualimap/qualimap
If you do not have qualimap installed, you can download it here (it's a Java tool without a need for installation) and run the QC.
wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip
unzip qualimap_v2.2.1.zip
Then go to the newly created folder (you can also move it to a directory where you will save the 'installed' tools) and run the analysis
cd qualimap_v2.2.1/
./qualimap -h
We see that Qualimap has quite a lot of different modules for the QC. We will use bamqc
.
~/Tools/qualimap/qualimap bamqc -bam SRR1770413.sorted.bam -nt 2
Note: You could run this in a loop to get all the results at once. The results are now stored in the same directory as the input BAM
file. They can be also put in a separate directory using -outdir
flag in Qualimap
.
Once you have tested all the commands you can put them to a script and submit the whole workflow as a 'regular' job. You can either create the script file at your computer and then upload it to the MetaCentrum
scp dnaseq.sh username@zuphux.metacentrum.cz:/storage/brno9-ceitec/home/username/projects/dnaseq/src/
You have to do this on your computer, not at the MetaCentrum.
Or you can create it right away at the MetaCentrum.
cd /storage/brno9-ceitec/home/username/projects/dnaseq/src
nano dnaseq.sh
and insert the commands you want the job to perform (all the way from specifying variables to copying the results back and cleaning the $SCRATCH
. Then, by pressing ctrl+o
you can save the file to the same file as you specified with nano
command or you can rename the output file.
Once you have all the commands you can submit the job by 'regular' job submission.
cd /storage/brno9-ceitec/home/username/projects/dnaseq/src
qsub -l walltime=1:0:0 -l select=1:ncpus=6:mem=4gb:scratch_local=10gb dnaseq.sh
which should submit the job. You can close the terminal and come back within an hour and hope your results will be copied to the output directory you specified in the script.
To see the queued, running, finished or failed jobs you can visit your MetaCentrum account and go for list of jobs. You should see all the jobs that finished in past 24/48 hours.
You will need putty to work in a terminal and WinSCP to download the data. They are quite simple to use so don't worry.