cesnet metacentrum KUBERNETES CONTAINER ORCHESTRATOR SCHEDULING PROBLEMS AND CHALLENGES gs cesnet metacentrum Past and Future Kubernete ■ RNDr. Lukáš Hejtmánek, Ph.D. ■ 4. 5. + 11. 5. 2022 - Kubernetes Tutoriál (1 &2) here at Sitola ■ Online Webinar (past) ■ https://metavo. metacentrum .cz/cs/seminars/Webinar_2022/kubernetes2022.ht Platforma Kubernetes — Úvodní seznámení Lukáš Hejtmánek, CERIT-SC • kubernetes 06.04.2022 MetaCentrum cesnet metacentrurr^ What is the Big Deal with K& ■ Containerized applications are popular ■ Containers hide the complexities of modern SW ■ K8s is "container orchestrator## ■ Deploys containers (in so called "Pods") ■ Handles network, storage access ■ Checks their status (availability and scalability) KUDSrilG LGS ■ Organizes them wrt. given rules (Pod-to-node mapping) ■ Kills/restarts Pods when needed ■ CERIT-SC K8s installation ■ 2,560 CPUs in 20 nodes (128 cores, 512 GB RAM, 1 GPU, 7TB local SSD) ■ Web and interactive applications ■ Jupiter Hub, Binder Hub, Ansys, Matlab, RStudio, Word press... 06.04.2022 MetaCentrum : cesnet metacentrurr^ Scheduling Challenges in We know standard batch-oriented HPC scheduling We cannot reuse same techniques in K8s easily Examples, Comparisons & Discussion 06.04.2022 MetaCentrum cesnet metacentrum cesnet metacentrum Batch vs K8s Worklo ■ Batch workloads ■ Scripted executables ■ Non-interactive (mostly) ■ Waiting in queue is OK ■ Resource intensive ■ Rather accurate resource requirements ■ Strict maximum runtime limit K8s workloads ■ Interactive usage is common ■ GUI-based work ■ Long running services ■ Waiting is not OK ■ Overestimated resource requirements ■ Usually not limited runtime CD O o (/) CD r time 06.04.2022 MetaCentrum CD O 11* c/) r 9? IB time cesnet metacentrum Batch vs K8s Scheduling Concepts Batch scheduling basics ■ The organization owns the resources ■ Resources are provided for free ■ So fairness is important How does the scheduler work? ■ Jobs in queue(s) ■ Queue is ordered by priority ■ User-priority is dynamic (fairness) ■ User waiting = priority t ■ User computing = priority J, CD O o (/) CD r II II_M _l | i_i i—ii— —i —' I—hr time ■ Over long time period user's "share" is balanced with other active users Scheduler decides who gets the resources and when 06.04.2022 MetaCentrum cesnet metacentrum Batch vs K8s Schedulin K8s workloads ■ Interactive, no waiting, no maximum runtime... Scheduling basics (cloud, K8s) in commercial world ■ Users "own" the infrastructure __,_ ■ Pay-per-use model ■ Perfect motivation to release resources ■ Unused allocations? Overcommitted ■ Used by low QoS workloads ■ Can be terminated, if needed time CD O o (/) CD There is no "scheduling" needed ■ Instead, "capacity planning" is crucial . You are the "scheduler 06.04.2022 MetaCentrum cesnet metacentrum Batch vs K8s Scheduling Concepts ■ Scheduling = capacity planning ■ Load prediction (Black Friday, Christmas, Superbowl, new season of Mandalorian...) ■ Clever aggregation of different workloads ■ Resource pool can be increased (thanks to the revenue) ■ Good scheduler/capacity planner = money ■ You aggregate better ■ You sell more with less resources needed ■ The main difference between batch and K8s scheduling ■ The user who gives you the money is the "scheduler" ■ So there are no sophisticated schedulers available 06.04.2022 MetaCentrum cesnet metacentrum lercial) Use of Kubernetes ■ We are not commercial providers ■ We have strictly limited resources ■ Yet our users expect similar experience as in the commercial world ■ Partly because we advertise our installation in such way ■ K8s offers basic mechanisms for scheduling ■ Resource quotas ■ Constraints that limit aggregate resource consumption per namespace ■ Pod resource requests and limits ■ Guaranteed requests + best effort upper bound limits ■ Static Priority Classes ■ Higher priority Pod evicts lower priority Pod if needed ■ Pods with limited runtime (called Jobs) 06.04.2022 MetaCentrum 10 cesnet metacentrum PROBLEMS? cesnet metacentrum Common Problems - Bursty Workloads Bursty workloads ■ E.g., long running services that scientists use "three times a week for 2 hours" ■ Such services are mostly idle, but will have peaks ■ Overestimated requests Pod running bursty workload CPU limit CPU request 06.04.2022 MetaCentrum 12 cesnet metacentrum Common Problems - Resour What is the problem? ■ In general, overestimated requests (and zombies) ■ Requests are guaranteed, thus overestimation means resource wasting relative usage (%) of allocated CPU hours per K8s namespace ■allocation usage% ■allocation 100% ■allocation 50% ■allocation 5% O, CO, CO, CD CD CD ü ü Ü CO CO CO a a a co co co CD CD CD E E CO CO CO N O CO CM COi CO, CD CD CD ü ü Ü CO CO CO a a a co co co CD CD CD E E CO CO CO C C C CD CO, CO, CD CD ü Ü CO CO CL CL CO CO CD CD E E CO CO sz sz CM LO CO -*! -*! -*! CD CD CD ü ü Ü CO CO CO CL CL CL CO CO CO CD CD CD E E CO CO CO c c c 1- I"- LO, LO, LO, CD CD CD ü ü Ü CO CO CO CL CL CL CO CO CO CD CD CD E E CO CO CO c c c O CO CO, CD, CD CD ü Ü CO CO CL CL CO CO CD CD E E CO CO sz c CD CO, CD, CD CD ü Ü CO CO CL CL CO CO CD CD E E CO CO 06.04.2022 MetaCentrum 13 cesnet metacentrum Some problems can be addressed quite efficiently Free resources can be used by "scavenger" jobs ■ Jobs that are small and can be evicted/restarted easily ■ Help to utilize free resources Pod requests must be "low" ■ And we must allow the affected Pod to "scale up Pod during idle period CPU limit ->_r Scavenger Job Scavenger Job ::r:::::::::: Scavenger Job it;:;::::::: Scavenger Job ::r:::::::::: Scavenger Job Scavenger Job CPU request idle usage 06.04.2022 MetaCentrum 14 06 cesnet metacentrum ■ It is impossible to modify Pod priority dynamically ■ Or adjust too generous/tight Pod allocations ■ Pod restart is requested ■ No problem for "stateless" microservices ■ Usually bigger deal for "scientific computations" ■ There is a "workaround'' ■ Enables the Pod to use more/less resources 06.04.2022 MetaCentrum 15 cesnet metacentrum Workaround: Placeh! Pod-scaling can be achieved by running "placeholder" job ■ Placeholder evicts scavangers ■ Best effort Manual process Idle Pod + Scavenger Jobs CPU limit Scavenger Job :n:::::::::: Scavenger Job zj:::::::::: Scavenger Job :_t:::::::_: Scavenger Job Scavenger Job 13:::::::::: Scavenger Job CPU request idle usage Busy Pod CPU limit peak usage. 06.04.2022 MetaCentrum idle period > CPU request + Placeholder Job CPU request = CPU limit placeholder job remains idle and reserves capacity for active Pod peak period idle placeholder16 cesnet metacentrum OPEN PROBLEMS cesnet metacentrum Open Problems - HPC vs. K8s ~ Comparison ■ Common HPC batch scheduler ■ When the system is full and new user arrives you can always: ■ Tell the user what is his/her priority ■ And estimate (roughly) when the running jobs of other users will terminate ■ Or even provide him/her a non-destructive reservation ■ This is all automatic ■ In K8s... ■ Impossible to estimate Pod wait time (when we are out of resources) ■ No guarantees - the Pod either starts immediately or... never? ■ Unless we "manually" adjust the priority of the new Pod to evict some running Pod(s) ■ Resource reclaiming is not solved => no Pod life-cycle management ■ There is no such thing as "fairshare" in K8s ■ No automation 06.04.2022 MetaCentrum 18 cesnet metacentrum Alternate Solutions & Future Work ■ Partition the infrastructure into clusters with different "rules" ■ E.g., a cluster with time-limited access only ■ Dedicated schedulers for each such cluster (either our own or third party) ■ Still, infrastructure will suffer from fragmentation ■ The need for long-term solution remains ■ How to offer the service? ■ What "QoS" we want to guarantee ■ Definition of overall usage policy ■ Define mechanisms to implement this policy ■ Either "by hand" or through some automated scheduling policy 06.04.2022 MetaCentrum 19 cesnet metacentrum QUESTIONS? SUGGESTIONS? mor o at: Sitola seminars in May (4.+11. 5. 2022) JSSPP 2022 paper "Using KuberneteslrTAcade Problems and Approaches" Future talk at Kubernetes batch + HP emfc Err Environmei cesnet metacentrum Specific Problems - Resour What is the problem? ■ In general, overestimated requests (and zombies) ■ Requests are guaranteed, thus overestimation means resource wasting relative usage (%) of allocated CPU hours per K8s namespace 13 O Z> Q_ (J T3 QJ 4-" 03 O _o '-I— ■Ii QJ M 03 LT| 13 QJ :-- -t—■ — QJ 1000000 100000 -10000 0,0001 -0,00001 -allocation usage% ■allocation 100% -allocation 50% -allocation 5% ■ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I ■ I I I I I I I I I I I I I I I I I I I I I ooo^Qi(>jLDOo^Hvrr--ocoiDaiojLriOü^H-d-r---otn^Dai w1 u' u1 u' "i *i mi ^1 ^1 ^1 ^| ^1 q_ q_ h h w w ww ww wwwwww ww wwww ajajojajajajajajajajajajajajajajajajajaj EEEEEEEEEEEEEEEEEEEE DI II DI II E E E E rtJ fTJ KJ fTJ C C C C 06.04.2022 MetaCentrum 21