C2115 Practical introduction to supercomputing -1-Lesson 9 C2115 Practical introduction to supercomputing Lesson 9 Petr Kulhánek kulhanek@chemi.muni.cz National Centre for Biomolecular Research, Faculty of Science Masaryk University, Kamenice 5, CZ-62500 Brno Version 2 C2115 Practical introduction to supercomputing -2-Lesson 9 Batch systems (getting started) C2115 Practical introduction to supercomputing -3-Lesson 9 Batch processing Batch processing is the execution of a series of programs (so-called batches) on a computer without the participation of the user. Batches are prepared in advance so that they can be processed without the participation of the user. All input data is prepared in advance in files (scripts) or entered using parameters on the command line. Batch processing is the opposite of interactive processing, where the user provides the required inputs only when the program is running. Advantages of batch processing ▪ sharing computer resources between many users and programs ▪ postponing batch processing until the computer is less busy ▪ Eliminate delays caused by waiting for user input ▪ maximizing computer utilization improves investment utilization (especially for more expensive computers) source: www.wikipedia.cz, modified C2115 Practical introduction to supercomputing -4-Lesson 9 Tools for batch processing ➢ OpenPBS http://www.mcs.anl.gov/research/projects/openpbs/ ➢ Oracle Grid Engine https://en.wikipedia.org/wiki/Oracle_Grid_Engine ➢ Open Grid Scheduler http://gridscheduler.sourceforge.net/ ➢ Torque https://en.wikipedia.org/wiki/TORQUE ➢ PBSPro https://www.openpbs.org/, https://www.altair.com/ open source PBSPro is used as a batch system on our local clusters (WOLF, …), in MetaCentrum VO, and IT4I open source C2115 Practical introduction to supercomputing -5-Lesson 9 PBSPro https://www.altair.com/pbs-works-documentation/ Documentation: C2115 Practical introduction to supercomputing -6-Lesson 9 Necessary condition Second authentication without password It is necessary to set up login between compute nodes and the server (and vice versa) using ssh without explicit entry of password. • our local clusters (WOLF, ...), MetaCentrum – you must have a valid Kerberos ticket during job submission by the qsub command into the batch system • IT4I - using ssh authorized keys C2115 Practical introduction to supercomputing -7-Lesson 9 Architecture ...... ...... ...... queue(queues) short normal long (pbs_server) state of nodes pbsnodes -a qstat -q qstat qstat -u (pbs_sched) Assigning jobs to computing nodes according to required resources (pbs_mom) (pbs_mom) node#1 np = 2 node# 2 np= 1 qsub qsub qsub ready job pending job running job C2115 Practical introduction to supercomputing -8-Lesson 9 Torque - commands, job states qsub submits job to the batch system qstat prints information about the batch system (job list, queue list) pbsnodes prints information about computing nodes qrls releases job from the state holded (if circumstances allow) Job states: Q (queued) job is waiting in queue to run on computing node R (running) job is running on computing nodes C (completed) job has been completed (information about completed tasks is displayed only for a limited time - most often 24 hours) H (holded) job has been paused, job can be released with command qrls E (exiting) job is ending F (finished) job is completed: successful or unsuccessful termination C2115 Practical introduction to supercomputing -9-Lesson 9 We assign tasks Command qsub is used to submit jobs into the batch system. $ qsub –q default job.sh 1.ubuntu $ ls job.sh job.sh.o1 job.sh.e1 queue name to which we want to submit the task task script, e.g. #!/bin/bash echo "Hello world from `hostname`!" command prints job ID if submission is successful standard job output standard job error output files are not available until the job is completed C2115 Practical introduction to supercomputing -10-Lesson 9 Exercise 1 1. What queues of batch system are available on the WOLF cluster. Use the command qstat and selection according to the documentation. 2. What is the difference between -Q and -q options of qstat command? 3. What jobs are already submitted to the WOLF cluster batch system? 4. Place the job script from the previous example in a separate directory and submit it into the batch system. Use queue default. 5. On which computing node did the task run? 6. Create a new job script that you place in a different directory. The script prints the name of the computing node and pauses its operation for 10 minutes. Submit the job into queue default. 7. Monitor the state of the batch system with commands qstat and pbsnodes. 8. On which compute node did the task run this time? C2115 Practical introduction to supercomputing -11-Lesson 9 Exercise 2 1. Log in to MetaCentrum front node perian.ncbr.muni.cz. 2. What batch system queues are available. Use the command qstat. 3. How many jobs are currently included in the batch system? 4. Place job script from the previous example into a separate directory and submit it to the batch system. Use queue default. 5. On which computing node did the task run? How long did it take for the task to start? C2115 Practical introduction to supercomputing -12-Lesson 9 Exercise 3 1. Log in to MetaCentrum front node zuphux.cerit-sc.cz. 2. What batch system queues are available. Use qstat command. Why is it different from the queues you saw on the perian.ncbr.muni.cz? 3. How many jobs are currently included in the batch system? 4. Place the job script from the previous sample into a separate directory and submit it to the batch system. Use queue default. How does the job identifier differ from the job identifier submitted from the front node onyx.ncbr.muni.cz? 5. On which compute node did the task run? How long did it take for the task to start? C2115 Practical introduction to supercomputing -13-Lesson 9 Resource allocation resources are specified using the option -l of command qsub, you can enter multiple specifications at the same time, e.g.: $ qsub -l select=1:ncpus=1:mem=400mb:scratch_local=10gb \ script.sh or $ qsub -l select=1:ncpus=1:mem=400mb:scratch_local=10gb \ -l walltime=10:00 script.sh https://wiki.metacentrum.cz/wiki/PBS_Professional C2115 Practical introduction to supercomputing -14-Lesson 9 Number and type of nodes and CPUs select=[N1]chunk_specification1[+[N2]chunk_specification1] Number of blocks (chunks) block specification It is only used to reserve computing resources. However, this does not mean that job will run automatically on allocated computing resources. This must be ensured by job script. Example: select=1:ncpus=1:mem=400mb:scratch_local=10gb C2115 Practical introduction to supercomputing -15-Lesson 9 List of allocated CPUs is available as a list of compute nodes listed in the file whose name is specified in the system variable PBS_NODEFILE. This variable is available in running job: Example: $ qsub -l select=1:ncpus=2+1:ncpus=1 script.sh Result: /var/spool/torque/aux//10312644.arien-pro.ics.muni.cz zubat2.ncbr.muni.cz zubat2.ncbr.muni.cz mandos2.ics.muni.cz List of CPU slots is then available in the full job description, item exec_host: $ qstat -f Number and type of nodes and CPUs, II #!/bin/bash echo $PBS_NODEFILE cat $PBS_NODEFILE C2115 Practical introduction to supercomputing -16-Lesson 9 Number and type of nodes and CPUs, III Properties: Computing nodes can have specified properties. These are short strings whose meaning depends on the system administrators. The properties of nodes are listed by the command pbsnodes item resources_available. In the specification of computing resources, user can only request such computing nodes that have the specified properties. Examples: select=1:ncpus=1:brno=True select=1:ncpus=1:os=debian80 select=1:ncpus=1:cl_tarkil=True select=1:ncpus=1:cluster=tarkil select=1:ncpus=1:vnode=zubat1 select=1:ncpus=1:vnode=^zubat1 exclusion C2115 Practical introduction to supercomputing -17-Lesson 9 Additional resource specification Resource Description mem memory size, units mb, gb scratch_local size of local data storage, unit mb, gb scratch_ssd size of local data storage on SSD, units mb, gb walltime maximum job run time in conjunction with the queue default Job with insufficiently specified resource requirements may be terminated prematurely in MetaCentrum. C2115 Practical introduction to supercomputing -18-Lesson 9 Copying files C2115 Practical introduction to supercomputing -19-Lesson 9 Copying files Torque/PBSPro has internal support for copying files using stagein and stageout directives. However, this method is practically unusable, and the user should provide all operations related to copying data to the local data storage within the job (commands cp, scp, rsync). This method is described in the documentation of MetaCentrum VO. User Interface (UI) (Frontend) /job/input/dir Computational Node # 1 Worker Node (WN) /scratch/job_id/ scp, cp, rsync scp, cp, rsync On MetaCentrum VO and NCBR clusters, the local working directory is setup by the batch system. The path to the local storage is available in the SCRATCHDIR environment variable. C2115 Practical introduction to supercomputing -20-Lesson 9 MetaCentrum C2115 Practical introduction to supercomputing -21-Lesson 9 Batch systems MetaCentrum VO consists of three separate batch systems: ➢ meta-pbs.metacentrum.cz handles computing nodes from MetaCentrum, default on all front nodes except zuphux.cerit-sc.cz ➢ cerit-pbs.cerit-sc.cz handles computing nodes from CERIT-SC, default on the front node zuphux.cert-sc.cz ➢ elixir-pbs.elixir-czech.cz handles the computing nodes of the ELIXIR project to which tasks can be moved from arien or wagap if the nodes are not busy Both systems are user compatible (same options), differences can be found in the documentation of MetaCentrum VO. Default Torque server can be changed by setting the PBS_SERVER variable, e.g. [kulhanek@zuphux ~]$ qstat [kulhanek@zuphux ~]$ export PBS_SERVER=meta-pbs.metacentrum.cz [kulhanek@zuphux ~]$ qstat list tasks from CERIT-SC list tasks from MetaCentrum C2115 Practical introduction to supercomputing -22-Lesson 9 Running the gaussian program in MetaCentrum https://wiki.metacentrum.cz/wiki/Gaussian-GaussView http://gaussian.com/ C2115 Practical introduction to supercomputing -23-Lesson 9 Exercise 4 Goal of this exercise is to create a model of C60 molecule and calculate its molecular vibrations with semiempirical quantum-chemical method PM6 in gaussian program version 16. Into the protocol, report the results of the following exercise in summary, provide only important information. 1. Load the structures of the C60 molecule into Nemesis (File → Import structure from → XYZ). 2. Create an input file for the program gaussian (File → Export Structure as ... → Gaussian Input). Choose the PM6 method and geometry optimization. Then add the keyword FREQ (after the keyword Opt) and save the file with an .com extension. 3. Move the created input file to MetaCentrum front node, prepare the job script and include the job in the batch system. Follow the documentation of MetaCentrum, the job must use local data storage on the computing node. 4. Transfer the result of the task (file extension .log) to your workstation and display the calculated molecular vibrations in the Nemesis program, following the instructions below. https://wiki.metacentrum.cz/wiki/Gaussian-GaussView C2115 Practical introduction to supercomputing -24-Lesson 9 Nemesis Starting the program: $ module add nemesis $ nemesis Mouse: Left button selection Middle button rotation Right button translation Wheel zoom Modifiers: Shift XZ -> Y moves Ctrl toggles between secondary and primary manipulator C2115 Practical introduction to supercomputing -25-Lesson 9 Build Project layers graphic models molecule building/editing geometry measurement Force field settings for optimization: menu Geometry-> Optimizer Setup geometry optimization using a force field C2115 Practical introduction to supercomputing -26-Lesson 9 Trajectory: Vibration visualization 1) Project: Trajectory 2) File -> Import Trajectory as -> Gaussian Vibrations double click double click select vibration start animation