# Processes, Threads & Scheduling

In this lecture, we will look at the 2 basic resources that the
computer offers, and which every program needs: CPU and memory. The
main question then will be how the operating system achieves the
illusion that every thread seemingly has its own processor and every
process has its own memory (this is what we mean by multiplexing –
turning a single physical resource into a larger set of virtual
instances of the same).

│ Lecture Overview
│
│  1. processes and virtual memory
│  2. thread scheduling
│  3. interrupts and clocks

There will be 3 parts. We will first look at virtual memory, and the
unit of memory isolation in the operating system: a process. We will
then look at CPU (processor) sharing and the unit of CPU allocation,
a «thread». Finally, we will look in more detail how CPU sharing is
implemented in terms of hardware.

## Processes and Virtual Memory

│ Prehistory: Batch Systems
│
│  • first computers ran «one program» at a time
│  • programs were scheduled «ahead of time»
│  • we are talking punch cards &c.
│  • and computers that took an entire room

The first generation of computers was strictly sequential: one
program could execute at any given time, and that was that. To
execute another program, the first must have finished first.

│ History: Time Sharing
│
│  • “mini” computers could run programs «interactively»
│  • teletype «terminals», screens, keyboards
│  • «multiple users» at the same time
│  • hence, «multiple programs» at the same time

Computers were quite expensive at the time, and programs started to
offer more interactivity. That is, a program could interact with the
operator by asking questions and then waiting for inputs to be
provided, for instance. Together, those two considerations made the
sequential, one program at a time paradigm quite impractical.

This is the origin of «time sharing» systems, where a computer could
execute multiple programs at seemingly the same time, by quickly
switching from one program to another. As soon as this is possible,
it makes sense to allow multiple users to interact with the same
computer (each user through a different program).

│ Processes: Early View
│
│  • process is an «executing» program
│  • there can be «multiple processes»
│  • various «resources» belong to a process
│  • each process belongs to a particular «user»

Since traditional programs were internally sequential, the original
notion of a process encompassed both types of basic resources:
memory and processor. In normal operation, a process P gets to execute
on the processor for some amount of time, before it is interrupted
and the system switches to executing some other process. At some
later point, process P will be awaken and continue execution where
it left off.

│ Process Resources
│
│  • «memory» (address space)
│  • «processor» time
│  • open files (descriptors)
│    ◦ also working directory
│    ◦ also network connections

The most basic resource associated with a process is «memory»: this
is where the program itself (the instructions) are stored, and also
where its data (both static and dynamic) resides. This memory is
typically private: one process cannot see the memory of another
process.

│ Process Memory Segments
│
│  • program «text»: contains instructions
│  • data: static and dynamic «data»
│    ◦ with a separate read-only section
│  • «stack» memory: execution stack
│    ◦ return addresses
│    ◦ automatic variables

The memory of a program is organized into functionally distinct
«segments». Traditionally, those segments were contiguous chunks of
address space, but in modern operating systems, this is not quite
true (and the whole idea of a segment is losing its appeal).
Nonetheless, the traditional split is into program «text», which
contains the instructions of the program (i.e. the executable code,
typically produced by a compiler), a «stack» which holds the C stack
(i.e. the values of local variables and return addresses, along with
other bookkeeping) and finally a «data segment» which holds,
unsurprisingly, data.

In a modern system, none of those segments needs to be contiguous:
the «text» consists of multiple mappings: one for the main program,
and one for each shared library it uses. Those mappings do not need
to be adjacent in the virtual address space (much less so in
physical memory). Likewise, since one process can contain multiple
threads (more on that later), there may be multiple stacks, again,
not necessarily allocated next to each other. Finally, static (and
especially read-only) data comes from executable images too, hence
there might be one mapping per executable, just like with text. The
dynamic data, then, is entirely unconstrained in its shape: the
implementation of ‹malloc› in ‹libc› will request memory from the
operating system as needed, with virtual addresses often being
assigned randomly to each new region of memory.

│ Process Memory
│
│  • each process has its own «address space»
│  • this means processes are «isolated» from each other
│  • requires that the CPU has an MMU
│  • implemented via «paging» (page tables)

The defining characteristic of a modern process is its «address
space»: the virtual addresses allocated to a process is private to
that process and as such invisible to any other process. The data is
mostly stored in RAM, but each process only (typically) sees a small
portion of the physical memory – the majority is simply not visible
in its virtual address space. Behind the scenes, this separation of
address spaces is built on paging, which is, in turn, provided by
the MMU (memory management unit) of the CPU.

│ Process Switching
│
│  • switching processes means switching «page tables»
│  • physical addresses do «not» change
│  • but the «mapping» of virtual addresses does
│  • large part of physical memory is «not mapped»
│    ◦ could be completely unallocated (unused)
│    ◦ or belong to «other processes»

In a typical time-sharing operating system (this covers pretty much
every modern general-purpose OS), an illusion of virtually unlimited
concurrency (the ability to run an arbitrary number of processes at
the same time, regardless of the number of available CPU cores) is
achieved by rapidly switching the execution from one process to the
next. Or, to be more precise, execution is switched from one
«thread» to another (which we will discuss shortly), but often, the
threads will be part of different processes. In those cases, the
operating system must instruct the processor to switch into the
virtual address space of the new process.

Of course, in this process, the physical memory is not rearranged in
any way – that would be way too expensive. Instead, the MMU is
instructed to start using a different set of mapping rules (page
tables) that map virtual addresses to physical addresses. Usually,
the overlap between physical addresses that have been visible
in the old virtual address space and those visible in the new one is
minimal: as outlined earlier, the virtual address space is almost
entirely private to each process.

Thus, most of the physical memory is, from the point of view of any
given process, inaccessible: it is either unused, or it belongs to a
different process.

│ Paging and TLB
│
│  • address translation is «slow»
│  • recently-used pages are stored in a TLB
│    ◦ short for Translation Look-aside Buffer
│    ◦ very fast «hardware» cache
│  • the TLB needs to be «flushed» on process switch
│    ◦ this is fairly «expensive» (microseconds)

An important consideration is the «cost» associated with switching
processes. On the surface, this involves a thread switch (which, as
we will see later, consists of storing the content of
user-accessible registers in memory and loading a new set of values
from a different location in memory) and an additional register
write (to load the new page table into the MMU).

Unfortunately, under the surface, that single register write causes
a cascade of other effects: in particular, all modern processors use
a very fast «cache» for storing recent address translations (i.e.
which physical addresses correspond to a small set of recently-used
virtual addresses), known as the TLB (translation look-aside
buffer).

The TLB is required because otherwise, translating a virtual address
to a physical one involves multiple round-trips into the main
memory, where page tables are stored. In modern computers, each of
those round-trips will take hundreds or even thousands of processor
cycles – clearly, repeating this process on every address
translation would make computation extremely slow.

Nonetheless, upon a process switch, the mapping of virtual addresses
to physical addresses changes, and the information stored in the TLB
becomes invalid, since it refers to the previous address space. The
simplest remedy is simply erasing everything that is stored in the
TLB. This is, in itself, a quick operation, since the TLB is
implemented using very fast on-die memory. Most of the price comes
from the subsequent address translations, which cannot be answered
from the (now empty) TLB and must perform multiple round-trips into
main memory (or, depending on cache pressure, into one of the
general-purpose caches – L1, L2 or L3).

Incidentally, modern processors often use tagged TLBs, which make
the entire process more efficient, but this is well outside of the
scope of this course.

│ Threads
│
│  • the modern unit of CPU scheduling
│  • each thread runs sequentially
│  • one process can have «multiple threads»
│    ◦ such threads share a single address space

While processes are the basic unit of memory management in the
operating system, computation is captured by «threads». Each process
has at least one thread, but might have more than one. While
processor time «accounting» is usually done at process level,
«execution» is tied to threads, since threads is what the processor
ultimately executes.

│ What is a Thread?
│
│  • thread is a «sequence» of instructions
│    ◦ instructions depend on results of previous instructions
│  • different threads run different instructions
│    ◦ as opposed to SIMD or many-core units (GPUs)
│  • each thread has its own «stack»

A thread can be thought of as an instruction stream: essentially
what we imagine a traditional, sequential program to be. Within a
thread, an instruction can freely use results of previous
instructions (data dependencies) and make decisions (based on those
results) as to which instruction to execute next (branching, or
control flow). This is how a processor core executes a program (at
least from the point of view of the programmer). Hence at any given
time, each processor core can execute a single thread. The threads
running concurrently on different cores or different CPUs can be
completely unrelated (they don't even need to belong to the same
process: each core has its own MMU and which can each use a
different page table, and hence, each core can be executing threads
in different address spaces).

Of course, this also means that each thread needs its own execution
stack (i.e. the memory that stores local variables in a C program,
along with return addresses and other execution-related
book-keeping).

│ Processor Time Sharing
│
│  • CPU time is sliced into «time shares»
│  • time shares (slices) are like memory «frames»
│  • process «computation» is like memory «pages»
│  • processes are allocated into «time shares»

Besides memory, the other main commodity in a computer is
«computation time». Like memory, there is a limited amount, and it
needs to be distributed among multiple threads (and processes).
Since different instructions take wildly different amount of time to
execute, allocation of computation is based on «time» instead of
instruction count. As a bonus, time is easier to measure.

│ Multiple CPUs
│
│  • execution of a thread is «sequential»
│  • one CPU = one «instruction sequence» at a time
│  • physical limits on CPU speed → «multiple cores»
│  • more CPU cores = more throughput

With time sharing, it is possible to execute any number of threads
on a single CPU core. However, physical limits in CPU design have
made it, in the last decade, much easier to build processors which
can execute twice as many threads (using twice as many processor
cores), instead of making them execute a given number of threads
twice as fast. As long as there are always enough threads ready to
execute, the overall computational throughput of the system is
doubled either way. Of course, this puts strain on software, which
needs to be separated into more and more threads to saturate the
computational power of a modern processor.

│ Modern View of a Process
│
│  • in a modern view, process is an «address space»
│  • threads are the right «scheduling abstraction»
│
│  • «process» is a unit of «memory management»
│  • «thread» is a unit of «computation»
│  • old view: one process = one thread

To recap, there are two main abstractions that center about memory
and computation. While historically, operating systems made the
assumption that 1 process = 1 thread (and hence older textbooks
about operating systems might follow the same design), this no
longer makes sense in modern systems.

Instead, threads are the abstraction that covers computation itself
(the sequence of instructions to be executed), while processes cover
memory and most other resources.

│ Fork
│
│  • how do we create «new processes»?
│  • by ‹fork›-ing existing processes
│  • fork creates an «identical copy» of a process
│  • execution continues in both processes
│    ◦ each of them gets a different «return value»

Let's now look at processes from the perspective of programs (and
users). The first thing to consider is how processes are created: on
POSIX-like systems, this is almost exclusively through the use of
the ‹fork› system call, which simply makes an identical copy of the
current process. The processes are then very slightly tweaked: in
one, ‹fork› returns a different value, and in the process table,
one is taken to be a «parent» and the other to be a «child». The
return value of ‹fork› tells each process which of the two it is.

│ Lazy Fork
│
│  • paging can make ‹fork› quite efficient
│  • we start by copying the «page tables»
│  • initially, all pages are marked «read-only»
│  • the processes start out «sharing» memory

Making a copy of an entire process would be a comparatively
expensive exercise. However, like we have seen with file systems
before, there are tricks (based on the copy-on-write implementation
technique) which make it possible to make ‹fork› itself quite
efficient. At the time of ‹fork›, the only thing that needs to be
copied is the «page table» which is much smaller than the entire
address space of the process in question.

At the same time, all pages in both copies of the page table are
marked as read-only, since they are now shared by two processes,
which should not have access to each other's address space. Of
course, as long as they are only reading from memory, whether there
are two physical copies or a single shared copy does not make a
difference.

│ Lazy Fork: Faults
│
│  • the shared memory becomes «copy on write»
│  • «fault» when either process tries to write
│    ◦ remember the memory is marked as read-only
│  • the OS checks if the memory is «supposed» to be writable
│    ◦ if yes, it makes a «copy» and allows the write

As soon as either of the processes attempts to issue a memory write,
this will cause a fault: the pages are marked read-only in the page
tables. When this happens, the processor will invoke the fault
handler (which is part of the kernel) to resolve the situation. The
fault handler will then consult its data structures (which may be
partially embedded in the page table itself) to distinguish actual
faults (i.e. the process tried to write into truly read-only memory)
from copy-on-write events.

In the latter case, the page is semantically read-write, but is
marked read-only because multiple processes are using a single
physical copy. At this point, the fault handler will split the
mapping: it will make a new physical copy of the data (i.e. it will
allocate a new frame for it) and adjust the page table of the
process which attempted to write, so that the offending virtual
address is translated to point into this new physical copy.

Finally, the CPU is instructed to restart the offending instruction:
since the page table entry is now marked as read-write (pointing to
the new physical location), the instruction will execute without
further impediment.

│ Init
│
│  • on UNIX, ‹fork› is the only way to make a process
│  • but ‹fork› splits existing processes into 2
│  • the «first process» is special
│  • it is directly spawned by the kernel on boot

Earlier, we have claimed that ‹fork› is essentially the only way to
create a new process. Of course, this cannot be entirely true, since
‹fork› may only be executed by an existing process. For this reason,
there is one special process, also called ‹init› or ‹pid 1› which is
directly created by the kernel upon boot. All other processes in a
running system, however, descend from this original process by a
sequence of forks.

│ Process Identifier
│
│  • processes are assigned «numeric identifiers»
│  • also known as PID (Process ID)
│  • those are used in «process management»
│  • used in calls like ‹kill› or ‹setpriority›

To facilitate process management, each process is assigned a numeric
identifier (known as PID, short for process identifier). Process
management syscalls, then, take this number as an argument.

Traditionally, the ‘namespace’ of processes is global: a PID
identifies a process globally (unlike, for instance, a file
descriptor, which is local to each process). That is, for all users
and all processes, the given number always represents the same
process (until it is terminated). Please note that in
container-based virtualisation, this is no longer strictly true,
since different containers will get different PID namespaces even
though they share the same kernel.

│ Process vs Executable
│
│  • process is a «dynamic» entity
│  • executable is a «static» file
│  • an executable contains an initial «memory image»
│    ◦ this sets up memory layout
│    ◦ and content of the «text» and «data» segments

In previous lectures, we have touched the subject of executables:
those are «files» which contain programs. It is very important to
understand the difference between an executable (a program, so to
speak, at rest) and a process (a program in the process of being
executed).

An executable contains, essentially, an initial image of process
memory: when a program starts executing, the kernel populates its
virtual address space with data from the executable, most
importantly the «text» segment which contains instructions, but also
the static parts of the «data» segment. Additionally, the executable
contains the (virtual) address of the «entry point»: this is the
address of the first instruction of the program that should be
executed.

│ Exec
│
│  • on UNIX, processes are created via ‹fork›
│  • how do we «run programs» though?
│  • ‹exec›: load a new «executable» into a process
│    ◦ this completely «overwrites» process memory
│    ◦ execution starts from the «entry point»
│  • running programs: ‹fork› + ‹exec›

However, when a process is created, its memory is «not» populated
from an executable: instead, it is a clone of an existing process.
To execute a new program, an existing process needs to call an
‹exec› system call, which then simply replaces the entire memory of
the current process with the content of the executable. Only
environment variables and command-line arguments (which are both
stored in process memory) are copied over to the new address space.

The common ‘start a new program’ operation (e.g. what happens when
the user runs a command in the shell) is then implemented as a
‹fork› followed by an ‹exec› in the child process.

## Thread Scheduling

In this section, we will look in more detail how operating systems
decide which thread to run when and for how long.

│ What is a Scheduler?
│
│  • scheduler has two related tasks
│    ◦ «plan» when to run which thread
│    ◦ actually «switch» threads and processes
│  • usually part of the «kernel»
│    ◦ even in micro-kernel operating systems

The «scheduler» is a component of the kernel which fills two basic
roles: planning and thread switching (including preemption).

│ Switching Threads
│
│  • threads of the «same process» share an address space
│    ◦ a «partial» context switch is needed
│    ◦ only «register state» has to be saved and restored
│  • no TLB flushing – lower «overhead»

To switch execution from one thread to another, within a single
process, it is sufficient to store the current state of user-visible
registers into memory, and load the state which corresponds to the
other thread (the state of which was saved when it was last
interrupted and suspended). Since the address of the (top of the)
stack is stored in a register, simply replacing values in registers
from a backup also has the effect of switching the active stack.

In this case, the TLB does not need to be flushed, making the switch
quite efficient, even if not entirely free (there is still a small
penalty due to branch prediction and speculative execution).

│ Fixed vs Dynamic Schedule
│
│  • «fixed» schedule = all processes known «in advance»
│    ◦ only useful in special / embedded systems
│    ◦ can «conserve resources»
│    ◦ planning is not part of the OS
│  • most systems use «dynamic scheduling»
│    ◦ what to run next is «decided periodically»

When it comes to the «schedule», that is, the plan of which thread
to execute in which time slot(s), there are two basic types:

 1. static schedules are computed ahead of time, for a fixed set of
    threads with pre-determined priorities and with their relative
    computational costs also known in advance,
 2. dynamic schedules, in which the above information is not known.

Running on a static schedule makes the runtime scheduler
particularly simple and robust. However, this approach is unsuitable
for all but the simplest use cases, and is usually only found in
high-assurance embedded systems.

A dynamic scheduler, on the other hand, allows threads and processes
to be created in an ad-hoc manner, during execution. Likewise, the
priorities of threads can be determined (and adjusted) at will. The
scheduler periodically evaluates the situation and decides what to
run either now, or in the very near future.

│ Preemptive Scheduling
│
│  • tasks (threads) just run as if they owned the CPU
│  • the OS forcibly «takes the CPU» away from them
│    ◦ this is called «preemption»
│  • pro: a faulty program «cannot block» the system
│  • somewhat «less efficient» than cooperative

In most modern operating systems, the scheduler is «preemptive»,
that is, it can suspend threads at its own discretion, without
cooperation from the threads themselves. In this approach, the
user-space software does not need, in principle, be aware that it is
running on a time-sharing system: when a thread is preempted and
later resumed, the execution continues where it left off, without
any immediately detectable effect on the executing program.

A major advantage of preemptive scheduling is that uncooperative (or
faulty) programs cannot endanger the system as a whole: the
operating system can, at any time, decide to suspend a process,
regardless of what code it executes at the time.

│ Cooperative Scheduling
│
│  • threads (tasks) «cooperate» to share the CPU
│  • each thread has to explicitly «yield»
│  • this can be «very efficient» if designed well
│  • but a «bad program» can easily «block the system»

A different approach is known as «cooperative scheduling». In this
case, there is no preemption: the currently running thread has to
«yield» (give up) the processor voluntarily. This makes it possible
to optimize task switching, but it also means that a program which
does not yield the processor will simply continue to run unimpeded.

While general-purpose operating systems no longer use cooperative
scheduling globally, some programs implement so-called «green
threads» which are scheduled cooperatively – scheduling of those
threads is implemented entirely in the user-space. Usually, a large
number of such cooperative ‘green’ threads is dispatched onto a
small number of ‘real’ OS threads (which are preemptively scheduled
by the kernel).

│ Scheduling in Practice
│
│  • cooperative on Windows 3.x for everything
│  • «cooperative» for «threads» on classic Mac OS
│    ◦ but «preemptive» for «processes»
│  • «preemptive» on pretty much every modern OS
│    ◦ including real-time and embedded systems

The last mainstream operating system to use cooperative scheduling
at the operating system level was ‘classic’ Mac OS (before OS X),
though it used preemptive scheduling for processes. This way,
switching threads within a process could take advantage of the
improved efficiency, while processes were unable to block each
other. The last mainstream system with fully cooperative scheduling
was MS Windows 3.11, released in 1993.

│ Waiting and Yielding
│
│  • threads often need to «wait» for resources or «events»
│    ◦ they could also use software timers
│  • a waiting thread should «not consume» CPU time
│  • such a thread will «yield» the CPU
│  • it is put on a list and later «woken up» by the kernel

In most programs, it is common that a thread cannot continue its
execution until some event takes place, or a resource becomes
available. In this case, it is undesirable that the thread should
wait actively, that is, spin in a loop that checks whether it can
continue. Instead, kernels provide mechanisms which allow threads to
be «suspended» until an event arrives, or a resource becomes
available (at which point it is resumed by the kernel). This process
is not completely free from overhead, but unless the wait is very
short (a few dozen CPU cycles), suspending the thread will give a
much better overall system throughput.

│ Run Queues
│
│  • «runnable» (non-waiting) threads are «queued»
│  • could be «priority», round-robin or other queue types
│  • scheduler «picks» threads from the run queue
│  • «preempted» threads are put back

There are two reasons why a thread is suspended, that is, it is no
longer running: it's either waiting for an event (see above) or its
time slot has ended and it was preempted. In the latter case, the
thread is put on a «run queue», which is a list of threads that are
ready to run, usually sorted by a «dynamic priority», which is a
number that indicates how soon the thread should run, computed by
the scheduler.

Whenever a thread is suspended, the scheduler picks the next thread
to execute from a run queue.

│ Priorities
│
│  • what «share» of the CPU should a thread get?
│  • «priorities» are static and dynamic
│  • «dynamic» priority is adjusted as the thread runs
│    ◦ this is done by the system / scheduler
│  • a «static» priority is assigned by the «user»

Not all threads are equally important, and some should take priority
over others whenever they need to run. This is achieved through a
combination of priorities: a «static» priority is assigned to each
thread by the user. The static priority then feeds into the
computation of a «dynamic» priority, which governs scheduling: while
a thread is running, its dynamic priority drops and while it is
suspended, its dynamic priority increases.

│ Fairness
│
│  • «equal» (or priority-based) share per «thread»
│  • what if one process has many «more threads»?
│  • what if one user has many «more processes»?
│  • what if one user group has many «more active users»?

In the system that we have outlined above, scheduling is only
concerned with individual threads. Such a system is clearly unfair:
processes and users with many threads will get a much higher share
of the CPU than processes and users with fewer threads. To mitigate
this effect, most operating systems implement some form of fair
scheduling.

│ Fair Share Scheduling
│
│  • we can use a «multi-level» scheduling scheme
│  • CPU is sliced fairly first among «user groups»
│  • then among «users»
│  • then among «processes»
│  • and finally among «threads»

There are different levels on which a fair scheduler may operate:
almost all systems will consider processes, while many will also
consider users. In this scheme, the scheduler strives to achieve the
following two goals:

 1. as long as the given entity (process, user, group) has enough
    runnable threads, those threads get equal time when compared to
    any other entity with the same property (i.e. if both process A
    and process B have runnable threads, the total time slice of
    process A is equal to that of process B, assuming equal
    priorities),
 2. resources are allocated efficiently: if an entity does not have
    sufficient runnable threads, the resources it cannot use are
    distributed among the remaining entities of the same type (i.e.
    if process A has 1 runnable thread, process B has 3 runnable
    threads and there are 4 processor cores, process B should not be
    hampered by process A only being able to use one of them – the
    system should use all 4 cores in this scenario).

 

│ Scheduling Strategies
│
│  • first in, first served («batch» systems)
│  • earliest «deadline» first (realtime)
│  • round robin
│  • «fixed priority» preemptive
│  • «fair share» scheduling (multi-user)

When it comes to computing a dynamic schedule, there is a number of
approaches. The simplest of all is a batch scheduler, which is
basically no scheduler at all: when a program is executed, it runs
until it is done; afterwards, the next program is executed until
termination, and so on.

In real-time systems, a fairly simple scheduler is known as
‘earliest deadline first’: in such systems, each task comes with a
«deadline» which tells the system when the task must be done, at the
latest. In a preemptive setting, then, it is an optimal strategy to
first execute the task with the earliest deadline.

A naive general-purpose, preemptive scheduler is known as round
robin: it runs all available threads in a fixed order, switching
from one to the next as they run out of their time slice. In this
system, the per-thread throughput can be prioritised (by tweaking
the relative sizes of time slices of different threads), but latency
can not.

A preemptive scheduler with priorities fixes the latency problem: if
a high-priority thread wakes up, it will be able to run very soon
(with low latency), since it can ‘jump the queue’, unlike in a
round-robin scheduler.

│ Interactivity
│  
│  • «throughput» vs «latency»
│  • «latency» is more important for «interactive» workloads
│    ◦ think phone or desktop systems
│    ◦ but also web servers
│  • «throughput» is more important for «batch» systems
│    ◦ think render farms, compute grids, simulation

A scheduler will often need to make a compromise between maximising
total computational «throughput» and minimising «latency». In
interactive settings, the dominant concern is latency, while in
computational contexts, it's the total throughput that is usually a
priority.

│ Reducing Latency
│
│  • «shorter» time slices
│  • more willingness to switch tasks (more «preemption»)
│  • «dynamic» priorities
│  • priority boost for «foreground» processes

When optimising for low latency, the scheduler should use short time
slices (so that threads that wake up do not need to wait too long
for the processor to free up, even if they are not high-priority
threads). Likewise, the scheduler should be willing to switch to a
higher-priority thread whenever one wakes up, preempting running
threads (even if they did not consume their current time slice yet).

Both decisions of course lead to more context switches, which are
expensive and hence negatively affect total throughput (less total
useful computation is performed in any given time, since more time
is spent on overhead – both in the scheduler, but mainly in
switching threads and processes).

│ Maximising Throughput
│
│  • «longer» time slices
│  • «reduce context switches» to minimum 
│  • cooperative multitasking

The opposite is true for high-throughput systems: the time slice
should be as long as is practical, to minimize the number of context
switches the CPU needs to do. Likewise, the scheduler should be
unwilling to switch tasks before their current time slice runs out,
pushing back threads that woke up in reaction to an event. Whenever
possible, preemption should be avoided in favour of cooperation
(though this is, nowadays, mainly an application-level concern).

│ Multi-Core Schedulers
│
│  • traditionally one CPU, many threads
│  • nowadays: many threads, many CPUs (cores)
│  • more complicated «algorithms»
│  • more complicated & «concurrent-safe» data structures

In traditional designs, an operating system would only expect a
single processor to be available, in which case the scheduler would
be comparatively simple. However, essentially all modern computers
are multi-core systems, and schedulers need to take this into
account in two major ways:

 1. the algorithm to decide which threads run when «and on which
    core» is much more complicated,
 2. the scheduler must be able to run concurrently with itself:
    multiple cores must be able to decide which thread to run next
    in parallel, which means that the data structures used by the
    scheduler to keep track of threads must be concurrent-safe.

 

│ Scheduling and Caches
│
│  • threads can «move» between CPU «cores»
│    ◦ important when a different «core is idle»
│    ◦ and a runnable thread is «waiting for CPU»
│  • but there is a «price» to pay
│    ◦ thread / process data is extensively «cached»
│    ◦ caches are typically «not shared» by all cores

It is entirely possible to suspend a thread on one core, and wake it
up later on another core. Since processor cores are all identical,
this does not make a semantic difference – the thread will not be
able to directly detect that it has moved. However, if some of the
data associated with that thread was still in the per-core part of
the cache (typically at least L1, but also L2 and L3 if the core is
part of a different package) this will cause a performance hit, as
the new core has to re-fetch the data that was already cached by the
other core.

│ Core Affinity
│
│  • modern schedulers try to «avoid moving threads»
│  • threads are said to have an «affinity» to a core
│  • an extreme case is «pinning»
│    ◦ this altogether prevents the thread to be «migrated»
│  • practically, this practice «improves throughput»
│    ◦ even if nominal core «utilisation» may be lower

For the above reason, schedulers will try to avoid moving threads
between cores. Of course, there are competing concerns: if a thread
is runnable and a core other than its preferred core is idle,
inefficiency creeps in: the system keeps a processor core idle while
a thread is waiting to run. A realistic scheduler needs to strike a
balance between those two concerns: if the preferred core is about
to be freed up, it will wait, otherwise it will migrate the thread.

Of course, this will sometimes go wrong: a higher-priority thread
with affinity to the same core might wake up and the original
waiting thread will have to be migrated anyway. For this reason,
stronger affinity will cause a drop in core utilisation, but in a
well-tuned scheduler, this should not cause a drop in throughput.

│ NUMA Systems
│
│  • «non-uniform memory» architecture
│    ◦ different memory is attached to different CPUs
│    ◦ each SMP block within a NUMA is called a «node»
│  • «migrating» a process to a «different node» is expensive
│    ◦ thread vs node ping-pong can kill performance
│    ◦ threads of «one process» should live on one node

In a traditional SMP (symmetric multiprocessing) system, each CPU
has equal access to all RAM. However, in large computers with many
CPUs, it is advantageous (from the point of view of hardware design)
to violate this principle, and allow RAM to be attached locally to a
single processor package. Cores from another such «node» then need
to access this memory indirectly, adding a penalty both in terms of
throughput (how many bytes per unit of time can be transferred from
memory to the CPU cache) and in terms of latency (how many cycles it
takes from the time data is requested to the time it is available to
the processor).

A scheduler needs to take this into account: the affinity of threads
(and entire processes) to a particular NUMA node is usually much
stronger than the affinity of threads to individual cores within an
SMP system: unlike with purely cache-related overhead, the
performance hit of running on a different NUMA node is permanent:
the data is not automatically migrated into a ‘closer’ block of
memory.

## Interrupts and Clocks

In this section, we will look at how preemptive scheduling is
actually implemented, through periodic hardware interrupts.

│ Interrupt
│
│  • a way for hardware to «request attention»
│  • CPU «mechanism» to divert execution
│  • partial (CPU state only) «context switch»
│  • switch to «privileged» (kernel) CPU mode

Interrupts allow peripherals to request the attention of the
operating system. This is a low-level CPU feature, where asserting
an interrupt line (in hardware) causes the processor to stop
executing instructions and execute an «interrupt handler».

When an interrupt comes in, the CPU will consult a table of
interrupt handlers, which is set up by the operating system.
Additionally, the CPU will store its state (the values stored in
registers) into memory (usually on the active stack).

Usually, only a partial context switch is done by the CPU (i.e. the
page table is unaffected), but the interrupt routine will execute in
privileged mode (hence it can, if it needs to, adjust page tables).
In traditional designs, the kernel memory is mapped (but
inaccessible from user mode) into all processes, and hence the
interrupt handler can directly and immediately read and write kernel
memory.

When the interrupt routine is finished, it uses a special
instruction (‹iret› on ‹x86›) which instructs the CPU to restore its
state from the stack (which it stored upon entering the interrupt
handler) and drop back into user (unprivileged) mode.

│ Hardware Interrupts
│
│  • «asynchronous», unlike software interrupts
│  • triggered via «bus signals» to the CPU
│  • IRQ = interrupt request
│    ◦ just a different «name» for hardware interrupts
│  • PIC = programmable «interrupt controller»

We have already talked about software interrupts earlier: interrupt
handlers are quite alike in both cases (that is, for hardware and
software interrupts), but there are other differences. In
particular, a hardware interrupt is «asynchronous»: the CPU will
invoke the handler regardless of the instruction that is currently
being executed. Hardware-wise, interrupts are realized through bus
signalling – a peripheral (or rather an interrupt controller) will
send a signal down a wire to the CPU, which will react by starting
its interrupt handling sequence.

│ Interrupt Controllers
│
│  • PIC: «simple circuit», typically with 8 input lines
│    ◦ peripherals connect to the PIC with wires
│    ◦ PIC delivers prioritised signals to the CPU
│  • APIC: advanced programmable interrupt controller
│    ◦ split into a shared «IO APIC» and per-core «local APIC»
│    ◦ typically 24 incoming «IRQ lines»
│  • OpenPIC, MPIC: similar to APIC, used by e.g. Freescale

An interrupt controller is essentially a hub which provides a number
of interrupt lines to peripherals, and signals the CPU using a
single interrupt line whenever any of the peripherals raises an
interrupt. The interrupt controller of course informs the CPU which
peripheral is the origin of the interrupt.

│ Timekeeping
│
│  • PIT: «programmable interval timer»
│    ◦ crystal oscillator + divider
│    ◦ «IRQ line» to the CPU
│  • local APIC timer: built-in, «per-core clock»
│  • HPET: high-precision event timer
│  • RTC: real-time clock

A (programmable) timer is another traditionally discrete component
that has long been integrated into the main CPU die. Its role is to
keep time (historically using a quartz crystal), and raise an
interrupt in regular, configurable intervals.

│ Timer Interrupt
│
│  • generated by the PIT or the local APIC
│  • the OS can «set the frequency»
│  • a hardware interrupt happens on each «tick»
│  • this creates an opportunity for bookkeeping
│  • and for «preemptive scheduling»

Every time the configured interval expires, the timer will generate
an interrupt. When this happens, the OS kernel has a chance to run,
regardless of the currently executing process. Among other things,
it is possible to do a context switch by doing an ‹iret› (or
equivalent) into a different thread or process from the one that was
interrupted. This is how preemptive scheduling is usually
implemented.

│ Timer Interrupt and Scheduling
│
│  • measure how much time the current thread took
│  • if it ran out of its slice, «preempt it»
│    ◦ «pick» a new «thread» to execute
│    ◦ perform a context switch 
│  • checks are done on each tick
│    ◦ «rescheduling» is usually less frequent

The (scheduler part of) the interrupt handler for the timer
interrupt quickly checks whether anything needs to be done: that is,
it looks up how much time has the currently running thread left of
its time slice. If it ran out, then a «reschedule» is in order: the
handler will return into a different thread (that is, perform a
context switch... if needed, this includes replacing the active page
table). The slice check is done on every tick, since it is cheap. On
most ticks, however, no rescheduling will take place, since a
typical slice length is many ticks (and is much more expensive).

│ Timer Interrupt Frequency
│
│  • typical is 100 Hz
│  • this means a 10 ms «scheduling tick» (quantum)
│  • 1 kHz is also possible
│    ◦ harms throughput but «improves latency»

The traditional timer frequency on UNIX systems is 100 Hz, or one
tick every 10 milliseconds. This means that time slices available to
the scheduler are multiples of 10 milliseconds. For interactive
systems, the frequency can be increased, with 1 kHz being the
highest commonly used setting (yielding a quantum just 1 ms long,
cutting down on scheduling latency quite a bit). Of course, the CPU
will spend a lot more time in the timer interrupt handler, reducing
the amount of useful work it can perform, thus harming throughput.

│ Tickless Kernels
│
│  • the timer interrupt «wakes up» the CPU
│  • this can be «inefficient» if the system is «idle»
│  • alternative: use «one-off» timers
│    ◦ allows the CPU to «sleep longer»
│    ◦ this «improves power efficiency» on light loads

Modern processors implement aggressive power management, shutting
down parts of the CPU that are not in use. This technique is of
course more efficient, if the idle circuits do not need to be woken
up very often. If nothing at all is going on, then an entire core
can be put to sleep. Unfortunately, the timer interrupt interferes
with this somewhat: while the system is completely idle (no threads
are ready to run), the processor will still need to process the
interrupt, 100 times every second. This has a considerable energy
cost, just to figure out that nothing needs to be done.

Instead of using a periodic timer, it's possible to configure a
one-off, non-repeating timer, and when it fires, compute when the
next scheduling interrupt is needed. This way, during idle times,
the processor can sleep undisturbed for much longer periods of time.

│ Tickless Scheduling
│
│  • «quantum length» becomes part of the planning
│  • if a core is idle, wake up on next «software timer»
│    ◦ synchronisation of software timers
│  • other interrupts are «delivered as normal»
│    ◦ network or disk activity
│    ◦ keyboard, mice, ...

This is done by simply doing away with the fixed scheduling quantum.
The scheduler can compute when the next reschedule needs to happen,
and not waste any time in the timer interrupt unless the slice is
over. Sleeping threads may be blocked for any number of reasons, but
the most common are waiting for events or resources – one of the
events that are commonly awaited is a timer: the thread has some
periodic work to do, and may ask the kernel to be woken up in a
second, then do some of its work, and then sleep for another second,
and so on.

If, at a given moment, all processes are sleeping, then the next
timer interrupt needs to happen whenever the soonest of the threads
sleeping on a timer is due to wake up. Of course, other interrupts
are still delivered as usual, so if a thread is blocked waiting for
data to arrive from disk or network, the respective interrupts can
cause that thread to be woken up and scheduled for execution.

│ Other Interrupts
│
│  • serial port
│    ◦ «data is available» on the port
│  • «network» hardware
│    ◦ data is available in a packet queue
│  • keyboards, mice
│    ◦ «user» pressed a key, moved the mouse
│  • USB devices in general

Besides the timer interrupt, there are many others that come up, and
many of them indirectly affect scheduling too (usually because one
of the threads was waiting for the event that the interrupt
signalled).

│ Interrupt Routing
│
│  • not all CPU cores need to see all interrupts
│  • APIC can be told how to deliver IRQs
│    ◦ the OS can «route IRQs» to CPU cores
│  • multi-core systems: IRQ «load balancing»
│    ◦ useful to «spread out» IRQ overhead
│    ◦ especially useful with «high-speed networks»

Finally, let us consider systems with multiple cores (probably the
majority of computers nowadays) and how this affects interrupts:
clearly, if a packet comes in from the network, it does not make
sense to interrupt all cores: we only need a single core to execute
the handler to fetch the data or otherwise deal with the event.

What is more, this does not need to be the same core every time the
interrupt happens: if there are many interrupts, we would
like them spread out somewhat evenly across all the cores. This is
what IRQ load balancing does: it will route interrupts to cores in
such a way that each core has approximately the same amount of work
handling them.

Finally, since scheduling is done per-core, it is useful to have a
separate timer interrupt for each core: those interrupts are of
course not subject to load-balancing or other re-routing: the
interrupt is, in fact, generated locally on each core (in the local
APIC) and the per-core timer which generates the interrupt is
programmed separately.

│ Review Questions
│
│  17. What is a thread and a process?
│  18. What is a (thread, process) scheduler?
│  19. What do ‹fork› and ‹exec› do?
│  20. What is an interrupt?