# The Kernel

This lecture is about the kernel, the lowest layer of an operating system. It
will be in 5 parts:

│ Lecture Overview
│
│ 1. privileged mode
│ 2. booting
│ 3. kernel architecture
│ 4. system calls
│ 5. kernel-provided services

First, we will look at processor modes and how they mesh with the layering of
the operating system. We will move on to the boot process, because it somewhat
illustrates the relationship between the kernel and other components of the
operating system, and also between the firmware and the kernel. We will look
in more detail at kernel architecture: things that we already hinted at in
previous lectures, and will also look at exokernels and unikernels, in
addition to the architectures we already know (micro and monolithic kernels).

The fourth part will focus on system calls and their binary interface – i.e.
how system calls are actually implemented at the machine level. This is closely
related to the first part of the lecture about processor modes, and builds on
the knowledge we gained last week about how system calls look at the C level.

Finally, we will look at what are the services that kernels provide to the
rest of the operating system, what are their responsibilities and we will also
look more closely at how microkernel and hybrid operating systems work.

│ Reminder: Software Layering
│
│  • → the «kernel» ←
│  • system «libraries»
│  • system services / «daemons»
│  • utilities
│  • «application» software

## Privileged Mode

│ CPU Modes
│
│  • CPUs provide a «privileged» (supervisor) and a «user» mode
│  • this is the case with all modern «general-purpose» CPUs
│    ◦ not necessarily with micro-controllers
│  • x86 provides 4 distinct privilege levels
│    ◦ most systems only use «ring 0» and «ring 3»
│    ◦ Xen paravirtualisation uses ring 1 for guest kernels

There is a number of operations that only programs running in supervisor mode
can perform. This allows kernels to enforce boundaries between user programs.
Sometimes, there are intermediate privilege levels, which allow finer-grained
layering of the operating system. For instance, drivers can run in a less
privileged level than the ‘core’ of the kernel, providing a level of protection
for the kernel from its own device drivers. You might remember that device
drivers are the most problematic part of any kernel.

In addition to device drivers, multi-layer privilege systems in CPUs can be
used in certain virtualisation systems. More about this towards the end of the
semester.

│ Privileged Mode
│
│  • many operations are «restricted» in «user mode»
│    ◦ this is how «user programs» are executed
│    ◦ also «most» of the operating system
│  • software running in privileged mode can do ~anything
│    ◦ most importantly it can program the «MMU»
│    ◦ the «kernel» runs in this mode

The kernel executes in privileged mode of the processor. In this mode, the
software is allowed to do anything that's possible. In particular, it can
(re)program the memory management unit (MMU, see next slide). Since MMU is how
program separation is implemented, code executing in privileged mode is allowed
to change the memory of any program running on the computer. This explains why
we want to reduce the amount of code running in supervisor (privileged) mode to
a minimum.

The way most operating systems operate, the kernel is the only piece of
software that is allowed to run in this mode. The code in system libraries,
daemons and so on, including application software, is restricted to the user
mode of the processor. In this mode, the MMU cannot be programmed, and the
software can only do what the MMU allows based on the instructions it got from
the kernel.

│ Memory Management Unit
│
│  • is a subsystem of the processor
│  • takes care of «address translation»
│    ◦ user software uses «virtual addresses»
│    ◦ the MMU translates them to «physical addresses»
│  • the mappings can be managed by the OS kernel

Let's have a closer look at the MMU. Its primary role is «address translation».
Addresses that programs refer to are «virtual» – they do not correspond to
fixed physical locations in memory chips. Whenever you look at, say, a pointer
in C code, that pointer's numeric value is an address in some virtual address
space. The job of the MMU is to translate that virtual address into a physical
one – which has a fixed relationship with some physical capacitor or other
electronic device that remembers information.

How those addresses are mapped is programmable: the kernel can tell the MMU how
the translation goes, by providing it with translation tables. We will discuss
how page tables work in a short while; what is important now is that it is the
job of the kernel to build them and send them to the MMU.

│ Paging
│
│  • physical memory is split into «frames»
│  • virtual memory is split into «pages»
│  • pages and frames have the same size (usually 4KiB)
│  • frames are places, pages are the content
│  • «page tables» map between pages and frames

Before we get to virtual addresses, let's have a look at the other major use
for the address translation mechanism, and that is «paging». We do so, because
it perhaps better illustrates how the MMU works. In this viewpoint, we split
physical memory (the physical address space) into «frames», which are «storage
areas»: places where we can put data and retrieve it later. Think of them as
shelves in a bookcase.

The virtual address space is then split into «pages»: actual pieces of data of
some fixed size. Pages do not physically exist, they just represent some bits
that the program needs stored. You could think of a page as a really big
integer. Or you can think of pages as a bunch of books that fits into a single
shelf.

The page table, then, is a catalog, or an address book of sorts. Programs
attach names to pages – the books – but the hardware can only locate shelves.
The job of the MMU is to take a name of the book and find the physical shelf
where the book is stored. Clearly, the operating system is free to move the
books around, as long as it keeps the page table – the catalog – up to date.
Remaining software won't know the difference.

│ Swapping Pages
│
│  • RAM used to be a scarce resource
│  • paging allows the OS to «move pages» out of RAM
│    ◦ a page (content) can be written to disk
│    ◦ and the frame can be used for another page
│  • not as important with contemporary hardware
│  • useful for «memory-mapping files» (cf. next lecture)

If we are short on shelf space, we may want to move some books into storage.
Then we can use the shelf we freed up for some other books. However, hardware
can only retrieve information from shelves and therefore if a program asks for
a book that is currently in storage, the operating system must arrange things
so that it is moved from storage to a shelf before the program is allowed to
continue.

This process is called swapping: the OS, when pressed for memory, will evict
pages from RAM onto disk or some other high-capacity (but slow) medium. It will
only page them back in when they are required. In contemporary computers,
memory is not very scarce and this use-case is not very important.

However, it allows another neat trick: instead of opening a file and reading it
using ‹open› and ‹read› system calls, we can use so-called memory mapped files.
This basically provides the content of the file as a chunk of memory that can
be read or even written to, and the changes are sent back to the filesystem. We
will discuss this in more detail next week.

│ Look Ahead: Processes
│
│  • process is primarily defined by its «address space»
│    ◦ address space meaning the valid «virtual» addresses
│  • this is implemented via the MMU
│  • when changing processes, a different page table is loaded
│    ◦ this is called a «context switch»
│  • the «page table» defines what the process can see

We will deal with processes later in the course, but let me just quickly
introduce the concept now, so that we can appreciate how important the MMU is
for a modern operating system. Each process has its own «address space» which
describes what addresses are valid for that process. Barring additional
restrictions, the process can write to any of its valid addresses and then
read back the stored value from that address.

The fact that the address space of a process is «abstract» and not tied to any
particular physical layout of memory is quite important. Another important
observation is that the address space does not need to be contiguous, and that
not all physical memory has to be visible in that address space.

│ Memory Maps
│
│  • different view of the same principles
│  • the OS «maps» physical memory into the process
│  • multiple processes can have the same RAM area mapped
│    ◦ this is called «shared memory»
│  • often, a piece of RAM is only mapped in a «single process»

We can look at the same thing from another point of view. Physical memory is a
resource, and the operating system can ‘hand out’ a piece of physical memory
to a process. This is done by «mapping» that piece of memory into the address
space of the process. There is nothing that, in principle, prevents the
operating system from mapping the same physical piece of RAM into multiple
processes. In this case, the data is only stored once, but either process can
read it using an address in its virtual address space (possibly at a different
address in each process).

For ‘working’ memory – that is both read and written by the program – it is
most common that any given area of physical memory is only mapped into a
single process. Instructions are, however, shared much more often: for
instance a shared library is often mapped into multiple different processes.
An executable itself may likewise be mapped into a number of processes if they
all run the same program.

│ Page Tables
│
│  • the MMU is programmed using «translation tables»
│    ◦ those tables are stored in RAM
│    ◦ they are usually called «page tables»
│  • and they are fully in the management of the kernel
│  • the kernel can ask the MMU to replace the page table
│    ◦ this is how processes are isolated from each other

The actual implementation mechanism of virtual memory is known as
«page tables»: those are translation tables that tell the MMU which
virtual address maps to which physical address. Page tables are
stored in memory, just like any other data, and can be created and
changed by the kernel. The kernel usually keeps a separate set of
page tables for each process, and when a context switch happens, it
asks the MMU to replace the active page table with a new one (the
one that belongs to the process which is being activated). This is
usually achieved by storing the physical address of the first level
of the new page table (the page directory, in x86 terminology) in a
special register.

Often, the writable physical memory referenced by the first set of page tables
will be unreachable from the second set and vice versa. Even if there is
overlap, it will be comparatively small (processes can request shared memory
for communicating with each other). Therefore, whatever data the previous
process wrote into its memory becomes completely invisible to the new process.

│ Kernel Protection
│
│  • kernel memory is usually mapped into «all processes»
│    ◦ this «improves performance» on many CPUs
│    ◦ (until «meltdown» hit us, anyway)
│  • kernel pages have a special 'supervisor' flag set
│    ◦ code executing in user mode «cannot touch them»
│    ◦ else, user code could «tamper» with kernel memory

Replacing the page tables is usually a rather expensive operation and we want
to avoid doing it as much as possible. We especially want to avoid it in the
«system call» path (you probably remember system calls from last week, and we
will talk about system calls in more detail later today). For this reason, it
is a commonly employed trick to map the kernel into «each process», but make
the memory inaccessible to user-space code. Unfortunately, there have been
some CPU bugs which make this less secure than we would like.

## Booting

The boot process is a sequence of steps which starts with the computer powered
off and ends when the computer is ready to interact with the user (via the
operating system).

│ Starting the OS
│
│  • upon power on, the system is in a «default state»
│    ◦ mainly because «RAM is volatile»
│  • the entire «platform» needs to be «initialised»
│    ◦ this is first and foremost «the CPU»
│    ◦ and the «console» hardware (keyboard, monitor, ...)
│    ◦ then the rest of the devices

Computers can be turned off and on (clearly). When they are turned off, power
is no longer available and dynamic RAM will, without active refresh, quickly
forget everything it held. Hence when we turn the computer on, there is
nothing in RAM, the CPU is in some sort of default state and variations of the
same are true of pretty much every sub-device in the computer. Except for the
content of persistent storage, the computer is in the state it was when it
left the factory door. The computer in this state is, to put it bluntly, not
very useful.

│ Boot Process
│
│  • the process starts with a built-in hardware init
│  • when ready, the hardware hands off to the «firmware»
│    ◦ this was BIOS on 16 and 32 bit systems
│    ◦ replaced with EFI on current ‹amd64› platforms
│  • the firmware then loads a «bootloader»
│  • the bootloader «loads the kernel»

We will not get into the hardware part of the sequence. The switch is flipped,
the hardware powers up and does its thing. At some point, firmware takes over
and does some more things. The hardware and firmware is finally put into a
state, where it can begin loading the operating system. There is usually a
piece of software that the firmware loads from persistent storage, called a
«bootloader». This bootloader is, more or less, a part of the operating
system: its purpose is to find and load the kernel (from persistent storage,
usually by using firmware services to identify said storage and load data from
it). It may or may not understand file systems and similar high-level things.
In the simplest case, the bootloader has a list of disk blocks in which the
kernel is stored, and requests those from the firmware. In modern systems,
both the firmware and the bootloader are quite sophisticated, and understand
complicated, high-level things (including e.g. encrypted drives).

│ Boot Process (cont'd)
│
│  • the kernel then initialises «device drivers»
│  • and the «root filesystem»
│  • then it hands off to the ‹init› process
│  • at this point, the «user space» takes over

We are finally getting to familiar ground. The bootloader has loaded the
kernel into RAM and jumped at a pre-arranged address inside the kernel image.
The instructions stored at that address kickstart the kernel initialization
sequence. The first part is usually still rather low-level: it puts the CPU
and some basic peripherals (console, timers and so on) into a state in which
the operating system can use them. Then it hands off control into C code,
which then sets up basic data structures used by the kernel. Then the kernel
starts initializing individual peripheral devices – this task is performed by
individual device drivers. When peripherals are initialized, the kernel can
start looking for the «root filesystem» – it is usually stored on one of the
attached persistent storage devices (which should now be operational and
available to the kernel via their device drivers).

After mounting the root filesystem, the kernel can set up an empty process and
load the ‹init› program into that process, and hand over control. At this
point, kernel stops behaving like a sequential program with ‹main› in it and
fades into background: all action is driven by user-space processes from now
on (or by hardware interrupts, but we will talk about those much later in the
course).

│ User-mode Initialisation
│
│  • ‹init› mounts the remaining file systems
│  • the ‹init› process starts up user-mode «system services»
│  • then it starts «application services»
│  • and finally the ‹login› process

We are far from done. The ‹init› process now needs to hunt down all the other
file systems and mount them, start a whole bunch of «system services» and
perhaps some «application services» (daemons which are not part of the
operating system – things like web servers).

Once all the essential services are ready, ‹init› starts the ‹login› process,
which then presents the familiar login screen, asking the user to type in
their name and password. At this point, the boot process is complete, but we
will have a quick look at one more step.

│ After Log-In
│
│  • the ‹login› process initiates the «user session»
│  • loads «desktop» modules and «application software»
│  • drops the user in a (text or graphical) «shell»
│  • now you can start using the computer

When the user logs in, another initialization sequence starts: the system
needs to set up a «session» for the user. Again, this involves some steps, but
at the end, it's finally possible to interact with the computer.

│ CPU Init
│
│  • this depends on both «architecture» and «platform»
│  • on ‹x86›, the CPU starts in «16-bit» mode
│  • on legacy systems, BIOS & bootloader stay in this mode
│  • the kernel then switches to «protected mode» during its boot

Let's go back to start and fill in some additional details. First of all, what
is the state of the CPU at boot, and why does the operating system need to do
anything? This has to do with backward compatibility: a CPU usually starts up
in the most-compatible mode – in case of 32b x86 processors, this is 16b mode
with the MMU disabled. Since the entire platform keeps backward compatibility,
the firmware keeps the CPU in this mode and it is the job of either the
bootloader or the kernel itself to fix this. This is not always the case
(modern 64b x86 processors still start up in 16b mode, but the firmware puts
them into «long mode» – that is the 64b one – before handing off to the
bootloader).

│ Bootloader
│
│  • historically limited to tens of «kilobytes» of code
│  • the bootloader locates the kernel «on disk»
│    ◦ may allow the operator to choose different kernels
│    ◦ «limited» understanding of «file systems»
│  • then it «loads the kernel» image into «RAM»
│  • and hands off control to the kernel

A bootloader is a short, platform-specific program which loads the kernel from
persistent storage (usually a file system on a disk) and hands off execution
to the kernel. The bootloader might do some very basic hardware
initialization, but most of that is done by the kernel itself in a later
stage.

│ Modern Booting on ‹x86›
│
│  • the bootloader nowadays runs in «protected mode»
│    ◦ or even the long mode on 64-bit CPUs
│  • the firmware understands the ‹FAT› filesystem
│    ◦ it can «load files» from there into memory
│    ◦ this vastly «simplifies» the boot process

The boot process has been considerably simplified on x86 computers in the last
decade or so. Much higher-level APIs have been added to the standardized
firmware interface, making the boot code considerably simpler.

│ Booting ARM
│
│  • on ARM boards, there is «no unified firmware» interface
│  • U-boot is as close as one gets to unification
│  • the bootloader needs «low-level» hardware knowledge
│  • this makes writing bootloaders for ARM quite «tedious»
│  • current U-boot can use the «EFI protocol» from PCs

Unlike the x86 world, the ARM ecosystem is far less standardized and each
system on a chip needs a slightly different boot process. This is extremely
impractical, since there are dozens of SoC models from many different vendors,
and new ones come out regularly. Fortunately, U-boot has become a de-facto
standard, and while U-boot itself still needs to be adapted to each new SoC or
even each board, the operating system is, nowadays, mostly insulated from the
complexity.

## Kernel Architecture

In this section, we will look at different architectures (designs) of kernels:
the main distinction we will talk about is which services and components are
part of the kernel proper, and which are outside of the kernel.

│ Architecture Types
│
│  • «monolithic» kernels (Linux, *BSD)
│  • microkernels (Mach, L4, QNX, NT, ...)
│  • «hybrid» kernels (macOS)
│  • type 1 «hypervisors» (Xen)
│  • exokernels, rump kernels

We have already mentioned the main two kernel types earlier in the course.
Those types represent the extremes of mainstream kernel design: microkernels
are the smallest (most exclusive) mainstream design, while monolithic kernels
are the biggest (most inclusive). Systems with «hybrid» kernels are a natural
compromise between those two extremal designs: they have 2 components, a
microkernel and a so-called «superserver», which is essentially a gutted
monolithic kernel – that is, the functionality covered by the microkernel is
removed.

Besides ‘mainstream’ kernel designs, there are a few more exotic choices. We
could consider type 1 (bare metal) hypervisors to be a special type of an
operating system kernel, where «applications» are simply virtual machines –
i.e. ‘normal’ operating systems (more on this later in the course). Then there
are «exokernel» operating systems, which drastically cut down on services
provided to applications and «unikernels» which are basically libraries for
running entire applications in kernel mode.

│ Microkernel
│
│  • handles «memory protection»
│  • (hardware) interrupts
│  • task / process «scheduling»
│  • «message passing»
│
│  • everything else is «separate»

A microkernel handles only the essential services – those that cannot be
reasonably done outside of the kernel (that is, outside of the privileged mode
of the CPU). This obviously includes programming the MMU (i.e. management of
address spaces and memory protection), handling interrupts (those switch the
CPU into privileged mode, so at least the initial interrupt routine needs to
be part of the kernel), thread and process switching (and typically also
scheduling) and finally some form of inter-process communication mechanism
(typically message passing). With those components in the kernel, almost
everything else can be realized outside the kernel proper (though device
drivers do need some additional low-level services from the kernel not listed
here, like DMA programming and delegation of hardware interrupts).

│ Monolithic Kernels
│
│  • all that a microkernel does
│  • plus device drivers
│  • file systems, volume management
│  • a network stack
│  • data encryption, ...

A monolithic kernel needs to include everything that a microkernel does (even
though some of the bits have a slightly different form, at least typically:
inter-process communication is present, but may be of different type, driver
integration looks different). However, there are many additional
responsibilities: many device drivers (those that need interrupts or DMA, or
are otherwise performance-critical) are integrated into the kernel, as are
file systems and volume (disk) management. A complete TCP/IP stack is almost a
given. A number of additional bits and pieces might be part of the kernel,
like cryptographic services (key management, disk encryption, etc.), packet
filtering, a kitchen sink and so on. Of course, all that code runs in
privileged mode, and as such has complete power over the operating system and
the computer as a whole.

│ Microkernel Redux
│
│  • we need a lot more than a microkernel provides
│  • in a “true” microkernel OS, there are many modules
│  • each «device driver» runs in a «separate process»
│  • the same for «file systems» and networking
│  • those modules / processes are called «servers»

The question that now arises is who is responsible for all the services listed
on the previous slide (those that are part of a monolithic kernel, but are
missing from a microkernel). In a ‘true’ microkernel operating system, those
services are individually covered, each by a separate process (also known as a
«server» in this context).

│ Hybrid Kernels
│
│  • based around a microkernel
│  • «and» a gutted monolithic kernel
│
│  • the monolithic kernel is a big server
│    ◦ takes care of stuff not handled by the microkernel
│    ◦ easier to implement than true microkernel OS
│    ◦ strikes middle ground on performance

In a hybrid kernel, most of the services are provided by a single large
server, which is somewhat isolated from the hardware. It is often the case
that the server is based on a monolithic OS kernel, with the lowest-level
layers removed, and replaced with calls to the microkernel as appropriate.

Hybrid kernels are both cheaper to design and theoretically perform better
than ‘true’ (multi-server) microkernel systems.

│ Micro vs Mono
│
│  • microkernels are more «robust»
│  • monolithic kernels are more «efficient»
│    ◦ less context switching
│  • what is easier to implement is debatable
│    ◦ in the short view, monolithic wins
│  • hybrid kernels are a «compromise»

The main advantage of microkernels is their robustness in face of software
bugs. Since the kernel itself is small, chances of a bug in the kernel proper
are much diminished compared to the relatively huge code base of a monolithic
kernel. The impact of bugs outside the kernel (in servers) is considerably
smaller, since those are isolated from the rest of the system and even if they
provide vital services, the system can often recover from a failure by
restarting the failed server.

On the other hand, monolithic kernels offer better performance, mainly through
reduced context switching, which is still fairly expensive even on modern,
virtualisation-capable processors. However, as monolithic kernels adopt
technologies such as kernel page table isolation to improve their security
properties, the performance difference becomes smaller.

Implementation-wise, monolithic kernels offer two advantages: in many cases,
code can be written in direct, synchronous style, and different parts of the
kernel can share data structures without additional effort. In contrast, a
proper multi-server system often has to use asynchronous communication
(message passing) to achieve the same goals, making the code harder to write
and harder to understand. Long-term, improved modularity and isolation of
components could outweigh the short-term gains in programming efficiency due
to more direct programming style.

│ Exokernels
│
│  • smaller than a microkernel
│  • much «fewer abstractions»
│    ◦ applications only get «block» storage
│    ◦ networking is much reduced
│  • only «research systems» exist

Operating systems based on microkernels still provide the full suite of
services to their applications, including file systems, network stacks and so
on. The difference lies in where this functionality is implemented, whether
the kernel proper, or in a user-mode server. With exokernels, this is no
longer true: the services provided by the operating system are severely cut
down. The resulting system is somewhere between a paravirtualized computer (we
will discuss this concept in more detail near the end of the course) and a
‘standard’ operating system. Unlike virtual machines (and unikernels),
process-based application isolation is still available, and plays an important
role. No production systems based on this architecture currently exist.

│ Type 1 Hypervisors
│
│  • also known as «bare metal» or «native» hypervisors
│  • they resemble microkernel operating systems
│    ◦ or exokernels, depending on the viewpoint
│  • “applications” for a hypervisor are «operating systems»
│    ◦ hypervisor can use «coarser abstractions» than an OS
│    ◦ entire storage devices instead of a filesystem

A bare metal hypervisor is similar to an exokernel or a microkernel operating
system (depending on a particular hypervisor and on our point of view).
Typically, a hypervisor provides interfaces and resources that are
traditionally implemented in hardware: block devices, network interfaces, a
virtual CPU, including a virtual MMU that allows the ‘applications’ (i.e. the
guest operating systems) to take advantage of paging.

│ Unikernels
│
│  • kernels for running a «single application»
│    ◦ makes little sense on real hardware
│    ◦ but can be very useful on a «hypervisor»
│  • bundle applications as «virtual machines»
│    ◦ without the overhead of a general-purpose OS

Unikernels constitute a different strand (compared to exokernels) of
minimalist operating system design. In this case, process-level multitasking
and address space isolation are not part of the kernel: instead, the kernel
exists to support a single application by providing (a subset of) traditional
OS abstractions like a networking stack, a hierarchical file system and so on.
When an application is bundled with a compatible unikernel, the result can be
executed directly on a hypervisor (or an exokernel).

│ Exo vs Uni
│
│  • an exokernel runs «multiple applications»
│    ◦ includes process-based isolation
│    ◦ but «abstractions» are very «bare-bones»
│  • unikernel only runs a «single application»
│    ◦ provides more-or-less «standard services»
│    ◦ e.g. standard hierarchical file system
│    ◦ socket-based network stack / API

## System Calls

In the remainder of this lecture, we will focus on monolithic kernels, since
the more progressive designs do not use the traditional system call mechanism.
In those systems, most ‘system calls’ are implemented through message passing,
and only services provided directly by the microkernel use a mechanism that
resembles system calls as described in this section.

│ Reminder: Kernel Protection
│
│  • kernel executes in «privileged» mode of the CPU
│  • kernel memory is protected from user code
│
│ But: Kernel Services
│
│  • user code needs to ask kernel for «services»
│  • how do we «switch the CPU» into privileged mode?
│  • «cannot» be done arbitrarily (security)

The main purpose of the system call interface is to allow secure transfer of
control between a user-space application and the kernel. Recall that each
executes with different level of privileges (at the CPU level). A viable
system call mechanism must allow the application to switch the CPU into
privileged mode (so that the CPU can execute kernel code), but in a way that
does not allow the application to execute its own code in this mode.

│ System Calls
│
│  • hand off execution to a «kernel routine»
│  • pass «arguments» into the kernel
│  • obtain «return value» from the kernel
│  • all of this must be done «safely»

We would like system calls to behave more-or-less like standard subroutines
(e.g. those provided by system libraries): this means that we want to pass
arguments to the subroutine and obtain its return value. Like with the
transfer of control flow, we need the argument passing to be safe: the
user-space side of the call must not be able to read or modify kernel memory.

│ Trapping into the Kernel
│
│  • there are a few possible mechanisms
│  • details are very «architecture-specific»
│  • in general, the kernel sets a fixed «entry address»
│    ◦ an instruction changes the CPU into privileged mode
│    ◦ while «at the same time» jumping to this address

Security from execution of arbitrary code by the application is achieved by
tying the privilege escalation (i.e. the entry into the privileged CPU mode)
to a simultaneous transfer of execution to a fixed address, which the
application is unable to change. The exact mechanism is highly
architecture-dependent, but the principle outlined here is universal.

│ Trap Example: ‹x86›
│
│  • there is an ‹int› instruction on those CPUs
│  • this is called a «software interrupt»
│    ◦ interrupts are normally a «hardware» thing
│    ◦ interrupt «handlers» run in «privileged mode»
│  • it is also synchronous
│  • the handler is set in ‹IDT› (interrupt descriptor table)

On traditional (32 bit) x86 CPUs, the preferred method of implementing the
system call trap was through «software interrupts». In this case, the
application uses an ‹int› instruction, which causes the CPU to perform a
process analogous to a hardware interrupt. The two important aspects are:

 1. the CPU switches into privileged mode to execute the «interrupt handler»,
 2. reads the address to jump to from an «interrupt handler table», which is a
    data structure stored in RAM, at an address given by a special register.

The kernel sets up the interrupt handler table in such a way that user-level
code cannot change it (via standard MMU-based memory protection). The register
which holds its address cannot be changed outside of privileged mode.

│ Software Interrupts
│
│  • those are available on a range of CPUs
│  • generally «not very efficient» for system calls
│  • extra level of indirection
│    ◦ the handler address is retrieved from memory
│    ◦ a «lot of CPU state» needs to be saved

A similar mechanism is available on many other processor
architectures. There are, however, some downsides to using this
approach for system calls, the main being their poor performance.
Since the mechanism piggy-backs on the hardware variety of
interrupts, the CPU usually saves a lot more computation state than
would be required. As an additional inconvenience, there are
multiple entry-points, which must therefore be stored in RAM
(instead of a register), causing additional delays when the CPU
needs to read the interrupt table. Finally, arguments must be passed
through memory, since registers are reset by the interrupt, again
contributing to increased latency.

│ Aside: SW Interrupts on PCs
│
│  • those are used even in «real mode»
│    ◦ legacy 16-bit mode of 80x86 CPUs
│    ◦ BIOS (firmware) routines via ‹int 0x10› & ‹0x13›
│    ◦ MS-DOS API via ‹int 0x21›
│  • and on older CPUs in 32-bit «protected mode»
│    ◦ Windows NT uses ‹int 0x2e›
│    ◦ Linux uses ‹int 0x80›

On the ubiquitous x86 architecture, software interrupts were the
preferred mechanism to provide services to application programs
until the end of the 32-bit x86 era. Interestingly, x86 CPUs since
80386 offer a mechanism that was directly intended to implement
operating system services (i.e. syscalls), but it was rather complex
and largely ignored by operating system programmers.

│ Trap Example: ‹amd64› / ‹x86_64›
│
│  • ‹sysenter› and ‹syscall› instructions
│    ◦ and corresponding ‹sysexit› / ‹sysret›
│  • the entry point is stored in a «machine state register»
│  • there is only «one entry point»
│    ◦ unlike with software interrupts
│  • quite a bit «faster» than interrupts

When x86 switched to a 64-bit address space, many new instructions
found their way into the instruction set. Among those was a simple,
single-entrypoint privilege escalation instruction. This mechanism
avoids most of the overhead associated with software interrupts:
computation state is managed in software, allowing compilers to only
save and restore a small number of registers across the system call
(instead of having the CPU automatically save its entire state into
memory).

│ Which System Call?
│
│  • often there are «many» system calls
│    ◦ there are more than 300 on 64-bit Linux
│    ◦ about 400 on 32-bit Windows NT
│  • but there is only a «handful of interrupts»
│    ◦ and only one ‹sysenter› address

Usually, there is only a single entry point (address) shared by all
system calls. However, the kernel needs to be able to figure out
which service the application program requested.

│ Reminder: System Call Numbers
│
│  • each system call is assigned a «number»
│  • available as ‹SYS_write› &c. on POSIX systems
│  • for the “universal” ‹int syscall( int sys, ... )›
│  • this number is passed in a CPU register

This is achieved by simply sending the «syscall number» as an
argument in a specific CPU register. The kernel can then decide,
based on this number, which kernel routine to execute on behalf of
the program.

│ System Call Sequence
│
│  • first, ‹libc› prepares the system call «arguments»
│  • and puts the system call «number» in the correct register
│  • then the CPU is switched into «privileged mode»
│  • this also transfers control to the «syscall handler»

The first stage of a system call is executed in user mode, and is
usually implemented in ‹libc›.

│ System Call Handler
│
│  • the handler first picks up the system call «number»
│  • and decides where to continue
│  • you can imagine this as a giant ‹switch› statement
│ 
│     switch ( sysnum ) /* C */
│     {
│        case SYS_write: return syscall_write();
│        case SYS_read: return syscall_read();
│        /* many more */
│     }

After the switch to privileged mode, the kernel needs to make sense
of the arguments that the user program provided, and most
importantly, decide which system call was requested. The code to do
this in the kernel might look like the above ‹switch› statement.

│ System Call Arguments
│
│  • each system call has «different arguments»
│  • how they are passed to the kernel is «CPU-dependent»
│  • on 32-bit ‹x86›, most of them are passed «in memory»
│  • on ‹amd64› Linux, all arguments go into «registers»
│    ◦ 6 registers available for arguments

Since different system calls expect different arguments, the
specific argument processing is done after the system call is
dispatched based on its number. In modern systems, arguments are
passed in CPU registers, but this was not possible with protocols
based on software interrupts (instead, arguments would be passed
through memory, usually at the top of the user-space stack).

## Kernel Services

Finally, we will revisit the services offered by monolithic kernels,
and look at how they are realized in microkernel operating systems.

│ What Does a Kernel Do?
│
│  • «memory» & process management
│  • task (thread) «scheduling»
│  • device drivers
│    ◦ SSDs, GPUs, USB, bluetooth, HID, audio, ...
│  • file systems
│  • networking

The first two points are a core responsibility of the kernel: those
are rarely ‘outsourced’ into external services. The remaining
services are a core part of an «operating system», but not
necessarily of a kernel. However, it is hard to imagine a modern,
general-purpose operating system which would omit any of them. In
traditional (monolithic) designs, they are all part of the kernel.

│ Additional Services
│
│  • inter-process «communication»
│  • timers and time keeping
│  • process tracing, profiling
│  • security, sandboxing
│  • cryptography

A monolithic kernel may provide a number of additional services,
with varying importance. Not all systems provide all the services,
and the implementations can look quite different across operating
systems. Out of this (incomplete) list, IPC (inter-process
communication) is the only item that is quite universally present,
in some form, in microkernels. Moreover, while dedicated IPC
mechanisms are common in monolithic kernels, they are more important
in a microkernel.

│ Reminder: Microkernel Systems
│
│  • the kernel proper is «very small»
│  • it is accompanied by «servers»
│  • in “true” microkernel systems, there are «many servers»
│    ◦ each device, filesystem, etc. is separate
│  • in «hybrid» systems, there is one, or a few
│    ◦ a “superserver” that resembles a monolithic kernel

Recall that a microkernel is small: it only provides services that
cannot be reasonably implemented outside of it. Of course, the
operating system as a whole still needs to implement those services.
Two basic strategies are available:

 1. a single program, running in a single process, implements all
    the missing functionality: this program is called a superserver,
    and internally has an architecture that is rather similar to
    that of a standard monolithic kernel,
 2. each service is provided by a separate, specialized program,
    running in its own process (and hence, address space) – this is
    characteristic of so-called ‘true’ microkernel systems.

There are of course different trade-offs involved in those two basic
designs. A hybrid system (i.e. one with a superserver) is easier to
initially design and implement (for instance, persistent storage
drivers, the block layer, and the file system all share the same
address space, simplifying the implementation) and is often
considerably faster, since communication between components does not
involve context switches. On the other hand, a true microkernel
system with services and drivers all strictly separated into
individual processes is more robust, and in theory also easier to
scale to large SMP systems.

│ Kernel Services
│
│  • we usually don't care «which server» provides what
│    ◦ each system is different
│    ◦ for services, we take a «monolithic» view
│  • the services are used through «system libraries»
│    ◦ they abstract away many of the details
│    ◦ e.g. whether a service is a «system call» or an «IPC call»

From a user-space point of view, the specifics of kernel
architecture should not matter. Applications use system libraries to
talk to the kernel in either case: it is up to the libraries in
question to implement the protocol for locating relevant servers and
interacting with them.

│ User-Space Drivers in Monolithic Systems
│
│  • not «all» device drivers are part of the kernel
│  • case in point: «printer» drivers
│  • also some «USB devices» (not the USB bus though)
│  • part of the GPU/graphics stack
│    ◦ memory and output management in kernel
│    ◦ most of OpenGL in «user space»

While user-space drivers are par for the course in microkernel
systems, there are also certain cases where drivers in operating
systems based on monolithic kernels have significant user-space
components. The most common example is probably printer drivers:
low-level communication with the printer (at the USB level) is
mediated by the kernel, but for many printers, document processing
comprises a large part of the functionality of the driver. In some
cases, this involves format conversion (e.g. PCL printers) but in
others, the input document is rasterised by the driver on the main
CPU: instead of sending text and layout information to the printer,
the driver sends pixel data, or even a stream of commands for the
printing head.

The situation with GPUs is somewhat analogous: low-level access to
the hardware is provided by the kernel, but again, a large part of
the driver is dedicated to data manipulation: dealing with triangle
meshes, textures, lighting and so on. Additionally, modern GPUs are
invariably *programmable*: a shader compiler is also part of the
driver, translating high-level shader programs into instruction
streams that can be executed by the CPU.

We will deal with device drivers in more detail in lecture 8.

│ Review Questions
│
│   9. What CPU modes are there and how are they used?
│  10. What is the memory management unit?
│  11. What is a microkernel?
│  12. What is a system call?