cesnet
metacentrum
KUBERNETES CONTAINER ORCHESTRATOR
SCHEDULING PROBLEMS AND CHALLENGES gs
cesnet
metacentrum
Past and Future Kubernete
■ RNDr. Lukáš Hejtmánek, Ph.D.
■ 4. 5. + 11. 5. 2022 - Kubernetes Tutoriál (1 &2) here at Sitola
■ Online Webinar (past)
■ https://metavo. metacentrum .cz/cs/seminars/Webinar_2022/kubernetes2022.ht
Platforma Kubernetes — Úvodní seznámení
Lukáš Hejtmánek, CERIT-SC •
kubernetes
06.04.2022 MetaCentrum
cesnet
metacentrurr^
What is the Big Deal with K&
■ Containerized applications are popular
■ Containers hide the complexities of modern SW
■ K8s is "container orchestrator##
■ Deploys containers (in so called "Pods")
■ Handles network, storage access
■ Checks their status (availability and scalability) KUDSrilG LGS
■ Organizes them wrt. given rules (Pod-to-node mapping)
■ Kills/restarts Pods when needed
■ CERIT-SC K8s installation
■ 2,560 CPUs in 20 nodes (128 cores, 512 GB RAM, 1 GPU, 7TB local SSD)
■ Web and interactive applications
■ Jupiter Hub, Binder Hub, Ansys, Matlab, RStudio, Word press...
06.04.2022 MetaCentrum :
cesnet
metacentrurr^
Scheduling Challenges in
We know standard batch-oriented HPC scheduling
We cannot reuse same techniques in K8s easily
Examples, Comparisons & Discussion
06.04.2022 MetaCentrum
cesnet
metacentrum
cesnet
metacentrum
Batch vs K8s Worklo
■
Batch workloads
■ Scripted executables
■ Non-interactive (mostly)
■ Waiting in queue is OK
■ Resource intensive
■ Rather accurate resource requirements
■ Strict maximum runtime limit
K8s workloads
■ Interactive usage is common
■ GUI-based work
■ Long running services
■ Waiting is not OK
■ Overestimated resource requirements
■ Usually not limited runtime
CD O
o
(/) CD
r
time
06.04.2022 MetaCentrum
CD O
11*
c/) r
9? IB
time
cesnet
metacentrum
Batch vs K8s Scheduling Concepts
Batch scheduling basics
■ The organization owns the resources
■ Resources are provided for free
■ So fairness is important
How does the scheduler work?
■ Jobs in queue(s)
■ Queue is ordered by priority
■ User-priority is dynamic (fairness)
■ User waiting = priority t
■ User computing = priority J,
CD O
o
(/) CD
r
II      II_M _l		
|	i_i i—ii—	—i —'
I—hr		
time
■ Over long time period user's "share" is balanced with other active users
Scheduler decides who gets the resources and when
06.04.2022
MetaCentrum
cesnet
metacentrum
Batch vs K8s Schedulin
K8s workloads
■ Interactive, no waiting, no maximum runtime...
Scheduling basics (cloud, K8s) in commercial world
■ Users "own" the infrastructure __,_
■ Pay-per-use model
■ Perfect motivation to release resources
■ Unused allocations? Overcommitted
■ Used by low QoS workloads
■ Can be terminated, if needed time
CD O
o
(/) CD
There is no "scheduling" needed
■ Instead, "capacity planning" is crucial
. You are the "scheduler
06.04.2022
MetaCentrum
cesnet
metacentrum
Batch vs K8s Scheduling Concepts
■ Scheduling = capacity planning
■ Load prediction (Black Friday, Christmas, Superbowl, new season of Mandalorian...)
■ Clever aggregation of different workloads
■ Resource pool can be increased (thanks to the revenue)
■ Good scheduler/capacity planner = money
■ You aggregate better
■ You sell more with less resources needed
■ The main difference between batch and K8s scheduling
■ The user who gives you the money is the "scheduler"
■ So there are no sophisticated schedulers available
06.04.2022 MetaCentrum
cesnet
metacentrum
lercial) Use of Kubernetes
■ We are not commercial providers
■ We have strictly limited resources
■ Yet our users expect similar experience as in the commercial world
■ Partly because we advertise our installation in such way
■ K8s offers basic mechanisms for scheduling
■ Resource quotas
■ Constraints that limit aggregate resource consumption per namespace
■ Pod resource requests and limits
■ Guaranteed requests + best effort upper bound limits
■ Static Priority Classes
■ Higher priority Pod evicts lower priority Pod if needed
■ Pods with limited runtime (called Jobs)
06.04.2022 MetaCentrum 10
cesnet
metacentrum
PROBLEMS?
cesnet
metacentrum
Common Problems - Bursty Workloads
Bursty workloads
■ E.g., long running services that scientists use "three times a week for 2 hours"
■ Such services are mostly idle, but will have peaks
■ Overestimated requests
Pod running bursty workload
CPU limit
CPU request
06.04.2022
MetaCentrum
12
cesnet
metacentrum
Common Problems - Resour
What is the problem?
■ In general, overestimated requests (and zombies)
■ Requests are guaranteed, thus overestimation means resource wasting
relative usage (%) of allocated CPU hours per K8s namespace
■allocation usage% ■allocation 100% ■allocation 50% ■allocation 5%
O, CO, CO,
CD CD CD
ü ü Ü
CO CO CO
a a a
co co co
CD CD CD
E E
CO CO CO
N O CO
CM COi CO,
CD CD CD
ü ü Ü
CO CO CO
a a a
co co co
CD CD CD
E E
CO CO CO
C C C
CD <J>
CO, CO,
CD CD
ü Ü
CO CO
CL CL
CO CO
CD CD
E E
CO CO
sz sz
CM LO CO
-*! -*! -*!
CD CD CD
ü ü Ü
CO CO CO
CL CL CL
CO CO CO
CD CD CD
E E
CO CO CO
c c c
1- I"-
LO, LO, LO,
CD CD CD
ü ü Ü
CO CO CO
CL CL CL
CO CO CO
CD CD CD
E E
CO CO CO
c c c
O CO
CO, CD,
CD CD
ü Ü
CO CO
CL CL
CO CO
CD CD
E E
CO CO
sz c
CD <J> CO, CD,
CD CD
ü Ü
CO CO
CL CL
CO CO
CD CD
E E
CO CO
06.04.2022
MetaCentrum
13
cesnet
metacentrum
Some problems can be addressed quite efficiently
Free resources can be used by "scavenger" jobs
■ Jobs that are small and can be evicted/restarted easily
■ Help to utilize free resources
Pod requests must be "low"
■ And we must allow the affected Pod to "scale up
Pod during idle period
CPU limit
->_r
Scavenger Job Scavenger Job
::r::::::::::
Scavenger Job
it;:;:::::::
Scavenger Job
::r::::::::::
Scavenger Job Scavenger Job
CPU request
idle usage
06.04.2022
MetaCentrum
14
06
cesnet
metacentrum
■ It is impossible to modify Pod priority dynamically
■ Or adjust too generous/tight Pod allocations
■ Pod restart is requested
■ No problem for "stateless" microservices
■ Usually bigger deal for "scientific computations"
■ There is a "workaround''
■ Enables the Pod to use more/less resources
06.04.2022 MetaCentrum 15
cesnet
metacentrum
Workaround: Placeh!
Pod-scaling can be achieved by running "placeholder" job
■ Placeholder evicts scavangers
■ Best effort
Manual process
Idle Pod + Scavenger Jobs
CPU limit
Scavenger Job
:n::::::::::
Scavenger Job
zj::::::::::
Scavenger Job
:_t:::::::_:
Scavenger Job Scavenger Job
13::::::::::
Scavenger Job
CPU
request
idle usage
Busy Pod
CPU limit
peak usage.
06.04.2022
MetaCentrum
idle period
>
CPU request
+ Placeholder Job
CPU request = CPU limit
placeholder job remains idle and reserves capacity for active Pod
peak period
idle placeholder16
cesnet
metacentrum
OPEN PROBLEMS
cesnet
metacentrum
Open Problems - HPC vs. K8s ~ Comparison
■ Common HPC batch scheduler
■ When the system is full and new user arrives you can always:
■ Tell the user what is his/her priority
■ And estimate (roughly) when the running jobs of other users will terminate
■ Or even provide him/her a non-destructive reservation
■ This is all automatic
■ In K8s...
■ Impossible to estimate Pod wait time (when we are out of resources)
■ No guarantees - the Pod either starts immediately or... never?
■ Unless we "manually" adjust the priority of the new Pod to evict some running Pod(s)
■ Resource reclaiming is not solved => no Pod life-cycle management
■ There is no such thing as "fairshare" in K8s
■ No automation
06.04.2022 MetaCentrum 18
cesnet
metacentrum
Alternate Solutions & Future Work
■ Partition the infrastructure into clusters with different "rules"
■ E.g., a cluster with time-limited access only
■ Dedicated schedulers for each such cluster (either our own or third party)
■ Still, infrastructure will suffer from fragmentation
■ The need for long-term solution remains
■ How to offer the service?
■ What "QoS" we want to guarantee
■ Definition of overall usage policy
■ Define mechanisms to implement this policy
■ Either "by hand" or through some automated scheduling policy
06.04.2022 MetaCentrum 19
cesnet
metacentrum
QUESTIONS? SUGGESTIONS?
mor
o at:
Sitola seminars in May (4.+11. 5. 2022) JSSPP 2022 paper "Using KuberneteslrTAcade Problems and Approaches" Future talk at Kubernetes batch + HP
emfc Err
Environmei
cesnet
metacentrum
Specific Problems - Resour
What is the problem?
■ In general, overestimated requests (and zombies)
■ Requests are guaranteed, thus overestimation means resource wasting
relative usage (%) of allocated CPU hours per K8s namespace
13 O
Z>
Q_ (J T3 QJ
4-"
03 O
_o
'-I— ■Ii
QJ
M 03
LT|
13
QJ :--
-t—■
—
QJ
1000000 100000 -10000
0,0001 -0,00001
-allocation usage% ■allocation 100% -allocation 50% -allocation 5%
■ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I
■ I I I I I I I I I I I I I I I I I I I I I
ooo^Qi(>jLDOo^Hvrr--ocoiDaiojLriOü^H-d-r---otn^Dai w1   u'   u1   u' "i *i mi ^1  ^1 ^1  ^| ^1
q_      q_      h  h  w  w   ww   ww   wwwwww   ww wwww
ajajojajajajajajajajajajajajajajajajajaj EEEEEEEEEEEEEEEEEEEE
DI II DI II
E E E E
rtJ fTJ KJ fTJ
C C C C
06.04.2022
MetaCentrum
21