How to handle MetaCentrum in a few steps Jiří Vorel, Cesnet, MetaCentrum User Support, 15. 9. 2021 2 ▪ MetaCentrum is… ▪ … a part of The National Grid Infrastructure (NGI), ▪ … a provider of computational resources, application tools (commercial and free/open source) and data storage ▪ … completely free of charge ▪ MetaCentrum is available for… ▪ … employees and students from Czech universities, the Czech Academy of Science, non-commercial research facilities, etc. ▪ … industry users (only for academic and non-profit and public research) 3 https://metavo.metacentrum.cz https://wiki.metacentrum.cz https://metacentrum.cz https://wiki.metacentrum.cz/wiki/Pruvodce_pro_zacatecniky https://wiki.metacentrum.cz/wiki/FAQ/Grid_computing https://wiki.metacentrum.cz/wiki/Reseni_problemu Frontend servers and PBS ▪ 11 frontends, three PBS servers ▪ All user's home directories are available from all frontends 4 https://wiki.metacentrum.cz/wiki/Frontend skirit.ics.muni.cz alfrid.meta.zcu.cz zuphux.cerit-sc.cz elmo.elixir-czech.cz nympha.zcu.cz charon.nti.tul.cz … … meta-pbs.metacentrum.cz cerit-pbs.cerit-sc.cz elixir-pbs.elixir-czech.cz my_local_pc:~$ ssh vorel@elmo.metacentrum.c z vorel@elmo.metacentrum.cz's password : vorel@elmo:~$ pw d /storage/praha5-elixir/home/vore l my_local_pc:~$ ssh vorel@skirit.metacentrum.c z vorel@skirit.metacentrum.cz's password : vorel@skirit:~$ pw d /storage/brno2/home/vore l vorel@skirit:~$ cd /storage/praha5-elixir/home/vore l vorel@skirit:~$ pw d /storage/praha5-elixir/home/vorel 5 PBS and frontend servers https://wiki.metacentrum.cz/wiki/ Kerberos_authentication_system SSH keys are not fully supported! https://wiki.metacentrum.cz/wiki/ Beginners_guide#Log_on_a_frontend_ machine https://wiki.metacentrum.cz/wiki/ Frontend#Login_notes https://wiki.metacentrum.cz/wiki/ NFS4_Servery Queues ▪ Only a limited number of visible queues is suitable for direct use ▪ Which queues are most relevant for me? 6 Go to metavo.metacentrum.cz - Current state - Personal view - Qsub assembler for PBSPro (Stav zdrojů - Osobní pohled - Sestavovač qsub pro PBSPro) Click on it… Queues 7 ▪ You will be able to assemble qsub command and 
 check if resources are available 7 Queues 8 https:// metavo.metacentrum.cz/ pbsmon2/person 1. Transfer large amount of data ▪ Do not use frontend servers, copy data directly on storage, work with compressed files (.tar, .zip, .gz, etc.) 9 https://wiki.metacentrum.cz/wiki/Pruvodce_pro_zacatecniky https://wiki.metacentrum.cz/wiki/Prace_s_daty local PC storage frontend scp my_data.gz vorel@skirit.metacentrum.cz: \ /storage/praha5-elixir/home/vorel scp my_data.gz \ vorel@storage-praha5-elixir.metacentrum.cz:~ https://wiki.metacentrum.cz/wiki/NFS4_Servers 2. Do not run long calculations on frontends ▪ It is not appropriate to run long and demanding calculations directly on frontends and/or on clusters outside of PBS ▪ Ask for an Interactive job 
 

 ▪ You can minimise the time lags in interactive jobs (-m flag) 10 https://wiki.metacentrum.cz/wiki/Pruvodce_pro_zacatecniky qsub -I -l select=1:ncpus=2:mem=4gb:scratch_local=10gb \ -l walltime=1:00:00 -m abe 2. Do not run long calculations on frontends 11 Define resources, set job name and email alert The scratch directory will be cleaned (more information on the next slides) You can define as many variables as you want https://wiki.metacentrum.cz/wiki/ Pruvodce_pro_zacatecniky#D.C3.A1v kov.C3.A9_.C3.BAlohy 3. Use the scratch directory ▪ Very intensive I/O operations can cause network overload and a slowdown of central storage (/storage/city/…) ▪ Copy the input data into the scratch directory on a dedicated machine ▪ $SCRATCHDIR will be set automatically ▪ Faster, more stable 12 qsub -I -l select=1:ncpus=1:mem=4gb:scratch_local=10gb -l walltime=1:00:0 0 cp my_input_data.txt $SCRATCHDIR … … cp $SCRATCHDIR/my_results.txt /storage/city/home/user_name/ _shared (on cluster, slower) _ssd (faster, not everywhere) https://wiki.metacentrum.cz/wiki/Pruvodce_pro_zacatecniky#Typy_scratch_adres.C3.A1.C5.99.C5.AF 4. Clean the scratch directory ▪ Do not forget to clean the scratch directory when your calculation is finish or have been killed by PBS ▪ You can do it manually after each finished job or activate utility clean_scratch 13 trap ‘clean_scratch’ TERM EXIT cp my_input_data.txt $SCRATCHDIR … … … cp my_results.txt /storage/city/home/… || export CLEAN_SCRATCH=false 5. A high number of very short jobs ▪ From the point of view of performance (necessary PBS hardware requirements to run every single job), an ideal job is running at least for 30 minutes ▪ Startup overhead may be a significant part of the whole processing time ▪ Try to imagine what happens when you submit 10k individual jobs at once with a real calculation time of two minutes ▪ Aggregate short jobs into bigger groups with longer walltime 14 #PBS -l walltime=00:30:00 and or more 6. Writing outside of the scratch directory ▪ Computing nodes have very limited quotas (only 1 Gb) outside  of the scratch directory ▪ The most common problems are caused by: ▪ Writing to /tmp/ ▪ Very large stdout and stderr streams ▪ Utility check-local-quota can be executed on each node + email notifications 15 export TMPDIR=$SCRATCHDI R my_app < input … 1>$SCRATCHDIR/stdout 2>$SCRATCHDIR/stderr 7. Avoid non-effective calculation ▪ Optimise your calculations (hardware usage) ▪ Reservation of too many resources decrease your fairshare score and reduces the priority for your future calculations ▪ You can increase your fairshare score by acknowledgement to MetaCentrum in your publications 
▪ Effectivity can be checked on the computation node by standard Linux tools (top, htop) or on metavo.metacentrum.cz web portal 16 https://wiki.metacentrum.cz/wiki/Fairshare https://wiki.metacentrum.cz/wiki/Usage_rules/Acknowledgement 8. Backup and archiving ▪ MetaCentrum storage capacities are dedicated mainly to data in active usage ▪ Unnecessary data should be removed or moved to Cesnet Storage Department for long term archiving ▪ MetaCentrum users can use the following archive: ▪ And for backup: 17 https://du.cesnet.cz/en/start /storage/du-cesnet/home/user_name/VO_metacentrum-tape_tape-archive/ /storage/du-cesnet/home/user_name/VO_metacentrum-tape_tape/ https://wiki.metacentrum.cz/wiki/Working_with_data#Data_archiving_and_backup 9. Parallel computing and IB acceleration ▪ Parallel computing can significantly shorten the time of your job 
 ▪ OpenMP (multiple threads) and MPI (set of nodes) ▪ Remember that not all applications can utilise parallel computing. 
 A typical MC machine has => 32 CPUs ▪ You can request special nodes, which are interconnected by a 
 low-latency InfiniBand (IB) connection to accelerate the speed of your job 18 https://wiki.metacentrum.cz/wiki/Parallelization qsub -l select=8:ncpus=10:mpiprocs=10:ompthreads=1:mem=100gb:scratch_local=10gb \ -l walltime=24:00:00 -l place=group=infiniband Final notes ▪ All users can install the software on their own ▪ Python, Perl and R libraries, Conda package manager, pre-compiled binary, new compilations (gcc, intel, aocc), etc. 
 ▪ If for some reason grid infrastructure does not fulfil your expectations, maybe the MetaCentrum Cloud service would be a better choice 19 https://cloud.metacentrum.cz/
 cloud@metacentrum.cz https://wiki.metacentrum.cz/wiki/How_to_install_an_application Take-home message ▪ There is no reason to be afraid to use MetaCentrum ▪ By your activity, you are not able to "destroy" something ▪ You can find plenty of information and instructions on our wiki 
 ▪ Are you really lost? Send an email! 
 ▪ Registration form for new users
 20 https://wiki.metacentrum.cz, https://wiki.metacentrum.cz/wiki/FAQ meta@cesnet.cz https://metavo.metacentrum.cz/en/application/index.html Thank you for your attention Jiří Vorel, vorel@cesnet.cz, meta@cesnet.cz