# The Kernel This lecture is about the kernel, the lowest layer of an operating system. It will be in 5 parts: │ Lecture Overview │ │ 1. privileged mode │ 2. booting │ 3. kernel architecture │ 4. system calls │ 5. kernel-provided services First, we will look at processor modes and how they mesh with the layering of the operating system. We will move on to the boot process, because it somewhat illustrates the relationship between the kernel and other components of the operating system, and also between the firmware and the kernel. We will look in more detail at kernel architecture: things that we already hinted at in previous lectures, and will also look at exokernels and unikernels, in addition to the architectures we already know (micro and monolithic kernels). The fourth part will focus on system calls and their binary interface – i.e. how system calls are actually implemented at the machine level. This is closely related to the first part of the lecture about processor modes, and builds on the knowledge we gained last week about how system calls look at the C level. Finally, we will look at what are the services that kernels provide to the rest of the operating system, what are their responsibilities and we will also look more closely at how microkernel and hybrid operating systems work. │ Reminder: Software Layering │ │ • → the «kernel» ← │ • system «libraries» │ • system services / «daemons» │ • utilities │ • «application» software ## Privileged Mode │ CPU Modes │ │ • CPUs provide a «privileged» (supervisor) and a «user» mode │ • this is the case with all modern «general-purpose» CPUs │ ◦ not necessarily with micro-controllers │ • x86 provides 4 distinct privilege levels │ ◦ most systems only use «ring 0» and «ring 3» │ ◦ Xen paravirtualisation uses ring 1 for guest kernels There is a number of operations that only programs running in supervisor mode can perform. This allows kernels to enforce boundaries between user programs. Sometimes, there are intermediate privilege levels, which allow finer-grained layering of the operating system. For instance, drivers can run in a less privileged level than the ‘core’ of the kernel, providing a level of protection for the kernel from its own device drivers. You might remember that device drivers are the most problematic part of any kernel. In addition to device drivers, multi-layer privilege systems in CPUs can be used in certain virtualisation systems. More about this towards the end of the semester. │ Privileged Mode │ │ • many operations are «restricted» in «user mode» │ ◦ this is how «user programs» are executed │ ◦ also «most» of the operating system │ • software running in privileged mode can do ~anything │ ◦ most importantly it can program the «MMU» │ ◦ the «kernel» runs in this mode The kernel executes in privileged mode of the processor. In this mode, the software is allowed to do anything that's possible. In particular, it can (re)program the memory management unit (MMU, see next slide). Since MMU is how program separation is implemented, code executing in privileged mode is allowed to change the memory of any program running on the computer. This explains why we want to reduce the amount of code running in supervisor (privileged) mode to a minimum. The way most operating systems operate, the kernel is the only piece of software that is allowed to run in this mode. The code in system libraries, daemons and so on, including application software, is restricted to the user mode of the processor. In this mode, the MMU cannot be programmed, and the software can only do what the MMU allows based on the instructions it got from the kernel. │ Memory Management Unit │ │ • is a subsystem of the processor │ • takes care of «address translation» │ ◦ user software uses «virtual addresses» │ ◦ the MMU translates them to «physical addresses» │ • the mappings can be managed by the OS kernel Let's have a closer look at the MMU. Its primary role is «address translation». Addresses that programs refer to are «virtual» – they do not correspond to fixed physical locations in memory chips. Whenever you look at, say, a pointer in C code, that pointer's numeric value is an address in some virtual address space. The job of the MMU is to translate that virtual address into a physical one – which has a fixed relationship with some physical capacitor or other electronic device that remembers information. How those addresses are mapped is programmable: the kernel can tell the MMU how the translation goes, by providing it with translation tables. We will discuss how page tables work in a short while; what is important now is that it is the job of the kernel to build them and send them to the MMU. │ Paging │ │ • physical memory is split into «frames» │ • virtual memory is split into «pages» │ • pages and frames have the same size (usually 4KiB) │ • frames are places, pages are the content │ • «page tables» map between pages and frames Before we get to virtual addresses, let's have a look at the other major use for the address translation mechanism, and that is «paging». We do so, because it perhaps better illustrates how the MMU works. In this viewpoint, we split physical memory (the physical address space) into «frames», which are «storage areas»: places where we can put data and retrieve it later. Think of them as shelves in a bookcase. The virtual address space is then split into «pages»: actual pieces of data of some fixed size. Pages do not physically exist, they just represent some bits that the program needs stored. You could think of a page as a really big integer. Or you can think of pages as a bunch of books that fits into a single shelf. The page table, then, is a catalog, or an address book of sorts. Programs attach names to pages – the books – but the hardware can only locate shelves. The job of the MMU is to take a name of the book and find the physical shelf where the book is stored. Clearly, the operating system is free to move the books around, as long as it keeps the page table – the catalog – up to date. Remaining software won't know the difference. │ Swapping Pages │ │ • RAM used to be a scarce resource │ • paging allows the OS to «move pages» out of RAM │ ◦ a page (content) can be written to disk │ ◦ and the frame can be used for another page │ • not as important with contemporary hardware │ • useful for «memory-mapping files» (cf. next lecture) If we are short on shelf space, we may want to move some books into storage. Then we can use the shelf we freed up for some other books. However, hardware can only retrieve information from shelves and therefore if a program asks for a book that is currently in storage, the operating system must arrange things so that it is moved from storage to a shelf before the program is allowed to continue. This process is called swapping: the OS, when pressed for memory, will evict pages from RAM onto disk or some other high-capacity (but slow) medium. It will only page them back in when they are required. In contemporary computers, memory is not very scarce and this use-case is not very important. However, it allows another neat trick: instead of opening a file and reading it using ‹open› and ‹read› system calls, we can use so-called memory mapped files. This basically provides the content of the file as a chunk of memory that can be read or even written to, and the changes are sent back to the filesystem. We will discuss this in more detail next week. │ Look Ahead: Processes │ │ • process is primarily defined by its «address space» │ ◦ address space meaning the valid «virtual» addresses │ • this is implemented via the MMU │ • when changing processes, a different page table is loaded │ ◦ this is called a «context switch» │ • the «page table» defines what the process can see We will deal with processes later in the course, but let me just quickly introduce the concept now, so that we can appreciate how important the MMU is for a modern operating system. Each process has its own «address space» which describes what addresses are valid for that process. Barring additional restrictions, the process can write to any of its valid addresses and then read back the stored value from that address. The fact that the address space of a process is «abstract» and not tied to any particular physical layout of memory is quite important. Another important observation is that the address space does not need to be contiguous, and that not all physical memory has to be visible in that address space. │ Memory Maps │ │ • different view of the same principles │ • the OS «maps» physical memory into the process │ • multiple processes can have the same RAM area mapped │ ◦ this is called «shared memory» │ • often, a piece of RAM is only mapped in a «single process» We can look at the same thing from another point of view. Physical memory is a resource, and the operating system can ‘hand out’ a piece of physical memory to a process. This is done by «mapping» that piece of memory into the address space of the process. There is nothing that, in principle, prevents the operating system from mapping the same physical piece of RAM into multiple processes. In this case, the data is only stored once, but either process can read it using an address in its virtual address space (possibly at a different address in each process). For ‘working’ memory – that is both read and written by the program – it is most common that any given area of physical memory is only mapped into a single process. Instructions are, however, shared much more often: for instance a shared library is often mapped into multiple different processes. An executable itself may likewise be mapped into a number of processes if they all run the same program. │ Page Tables │ │ • the MMU is programmed using «translation tables» │ ◦ those tables are stored in RAM │ ◦ they are usually called «page tables» │ • and they are fully in the management of the kernel │ • the kernel can ask the MMU to replace the page table │ ◦ this is how processes are isolated from each other The actual implementation mechanism of virtual memory is known as «page tables»: those are translation tables that tell the MMU which virtual address maps to which physical address. Page tables are stored in memory, just like any other data, and can be created and changed by the kernel. The kernel usually keeps a separate set of page tables for each process, and when a context switch happens, it asks the MMU to replace the active page table with a new one (the one that belongs to the process which is being activated). This is usually achieved by storing the physical address of the first level of the new page table (the page directory, in x86 terminology) in a special register. Often, the writable physical memory referenced by the first set of page tables will be unreachable from the second set and vice versa. Even if there is overlap, it will be comparatively small (processes can request shared memory for communicating with each other). Therefore, whatever data the previous process wrote into its memory becomes completely invisible to the new process. │ Kernel Protection │ │ • kernel memory is usually mapped into «all processes» │ ◦ this «improves performance» on many CPUs │ ◦ (until «meltdown» hit us, anyway) │ • kernel pages have a special 'supervisor' flag set │ ◦ code executing in user mode «cannot touch them» │ ◦ else, user code could «tamper» with kernel memory Replacing the page tables is usually a rather expensive operation and we want to avoid doing it as much as possible. We especially want to avoid it in the «system call» path (you probably remember system calls from last week, and we will talk about system calls in more detail later today). For this reason, it is a commonly employed trick to map the kernel into «each process», but make the memory inaccessible to user-space code. Unfortunately, there have been some CPU bugs which make this less secure than we would like. ## Booting The boot process is a sequence of steps which starts with the computer powered off and ends when the computer is ready to interact with the user (via the operating system). │ Starting the OS │ │ • upon power on, the system is in a «default state» │ ◦ mainly because «RAM is volatile» │ • the entire «platform» needs to be «initialised» │ ◦ this is first and foremost «the CPU» │ ◦ and the «console» hardware (keyboard, monitor, ...) │ ◦ then the rest of the devices Computers can be turned off and on (clearly). When they are turned off, power is no longer available and dynamic RAM will, without active refresh, quickly forget everything it held. Hence when we turn the computer on, there is nothing in RAM, the CPU is in some sort of default state and variations of the same are true of pretty much every sub-device in the computer. Except for the content of persistent storage, the computer is in the state it was when it left the factory door. The computer in this state is, to put it bluntly, not very useful. │ Boot Process │ │ • the process starts with a built-in hardware init │ • when ready, the hardware hands off to the «firmware» │ ◦ this was BIOS on 16 and 32 bit systems │ ◦ replaced with EFI on current ‹amd64› platforms │ • the firmware then loads a «bootloader» │ • the bootloader «loads the kernel» We will not get into the hardware part of the sequence. The switch is flipped, the hardware powers up and does its thing. At some point, firmware takes over and does some more things. The hardware and firmware is finally put into a state, where it can begin loading the operating system. There is usually a piece of software that the firmware loads from persistent storage, called a «bootloader». This bootloader is, more or less, a part of the operating system: its purpose is to find and load the kernel (from persistent storage, usually by using firmware services to identify said storage and load data from it). It may or may not understand file systems and similar high-level things. In the simplest case, the bootloader has a list of disk blocks in which the kernel is stored, and requests those from the firmware. In modern systems, both the firmware and the bootloader are quite sophisticated, and understand complicated, high-level things (including e.g. encrypted drives). │ Boot Process (cont'd) │ │ • the kernel then initialises «device drivers» │ • and the «root filesystem» │ • then it hands off to the ‹init› process │ • at this point, the «user space» takes over We are finally getting to familiar ground. The bootloader has loaded the kernel into RAM and jumped at a pre-arranged address inside the kernel image. The instructions stored at that address kickstart the kernel initialization sequence. The first part is usually still rather low-level: it puts the CPU and some basic peripherals (console, timers and so on) into a state in which the operating system can use them. Then it hands off control into C code, which then sets up basic data structures used by the kernel. Then the kernel starts initializing individual peripheral devices – this task is performed by individual device drivers. When peripherals are initialized, the kernel can start looking for the «root filesystem» – it is usually stored on one of the attached persistent storage devices (which should now be operational and available to the kernel via their device drivers). After mounting the root filesystem, the kernel can set up an empty process and load the ‹init› program into that process, and hand over control. At this point, kernel stops behaving like a sequential program with ‹main› in it and fades into background: all action is driven by user-space processes from now on (or by hardware interrupts, but we will talk about those much later in the course). │ User-mode Initialisation │ │ • ‹init› mounts the remaining file systems │ • the ‹init› process starts up user-mode «system services» │ • then it starts «application services» │ • and finally the ‹login› process We are far from done. The ‹init› process now needs to hunt down all the other file systems and mount them, start a whole bunch of «system services» and perhaps some «application services» (daemons which are not part of the operating system – things like web servers). Once all the essential services are ready, ‹init› starts the ‹login› process, which then presents the familiar login screen, asking the user to type in their name and password. At this point, the boot process is complete, but we will have a quick look at one more step. │ After Log-In │ │ • the ‹login› process initiates the «user session» │ • loads «desktop» modules and «application software» │ • drops the user in a (text or graphical) «shell» │ • now you can start using the computer When the user logs in, another initialization sequence starts: the system needs to set up a «session» for the user. Again, this involves some steps, but at the end, it's finally possible to interact with the computer. │ CPU Init │ │ • this depends on both «architecture» and «platform» │ • on ‹x86›, the CPU starts in «16-bit» mode │ • on legacy systems, BIOS & bootloader stay in this mode │ • the kernel then switches to «protected mode» during its boot Let's go back to start and fill in some additional details. First of all, what is the state of the CPU at boot, and why does the operating system need to do anything? This has to do with backward compatibility: a CPU usually starts up in the most-compatible mode – in case of 32b x86 processors, this is 16b mode with the MMU disabled. Since the entire platform keeps backward compatibility, the firmware keeps the CPU in this mode and it is the job of either the bootloader or the kernel itself to fix this. This is not always the case (modern 64b x86 processors still start up in 16b mode, but the firmware puts them into «long mode» – that is the 64b one – before handing off to the bootloader). │ Bootloader │ │ • historically limited to tens of «kilobytes» of code │ • the bootloader locates the kernel «on disk» │ ◦ may allow the operator to choose different kernels │ ◦ «limited» understanding of «file systems» │ • then it «loads the kernel» image into «RAM» │ • and hands off control to the kernel A bootloader is a short, platform-specific program which loads the kernel from persistent storage (usually a file system on a disk) and hands off execution to the kernel. The bootloader might do some very basic hardware initialization, but most of that is done by the kernel itself in a later stage. │ Modern Booting on ‹x86› │ │ • the bootloader nowadays runs in «protected mode» │ ◦ or even the long mode on 64-bit CPUs │ • the firmware understands the ‹FAT› filesystem │ ◦ it can «load files» from there into memory │ ◦ this vastly «simplifies» the boot process The boot process has been considerably simplified on x86 computers in the last decade or so. Much higher-level APIs have been added to the standardized firmware interface, making the boot code considerably simpler. │ Booting ARM │ │ • on ARM boards, there is «no unified firmware» interface │ • U-boot is as close as one gets to unification │ • the bootloader needs «low-level» hardware knowledge │ • this makes writing bootloaders for ARM quite «tedious» │ • current U-boot can use the «EFI protocol» from PCs Unlike the x86 world, the ARM ecosystem is far less standardized and each system on a chip needs a slightly different boot process. This is extremely impractical, since there are dozens of SoC models from many different vendors, and new ones come out regularly. Fortunately, U-boot has become a de-facto standard, and while U-boot itself still needs to be adapted to each new SoC or even each board, the operating system is, nowadays, mostly insulated from the complexity. ## Kernel Architecture In this section, we will look at different architectures (designs) of kernels: the main distinction we will talk about is which services and components are part of the kernel proper, and which are outside of the kernel. │ Architecture Types │ │ • «monolithic» kernels (Linux, *BSD) │ • microkernels (Mach, L4, QNX, NT, ...) │ • «hybrid» kernels (macOS) │ • type 1 «hypervisors» (Xen) │ • exokernels, rump kernels We have already mentioned the main two kernel types earlier in the course. Those types represent the extremes of mainstream kernel design: microkernels are the smallest (most exclusive) mainstream design, while monolithic kernels are the biggest (most inclusive). Systems with «hybrid» kernels are a natural compromise between those two extremal designs: they have 2 components, a microkernel and a so-called «superserver», which is essentially a gutted monolithic kernel – that is, the functionality covered by the microkernel is removed. Besides ‘mainstream’ kernel designs, there are a few more exotic choices. We could consider type 1 (bare metal) hypervisors to be a special type of an operating system kernel, where «applications» are simply virtual machines – i.e. ‘normal’ operating systems (more on this later in the course). Then there are «exokernel» operating systems, which drastically cut down on services provided to applications and «unikernels» which are basically libraries for running entire applications in kernel mode. │ Microkernel │ │ • handles «memory protection» │ • (hardware) interrupts │ • task / process «scheduling» │ • «message passing» │ │ • everything else is «separate» A microkernel handles only the essential services – those that cannot be reasonably done outside of the kernel (that is, outside of the privileged mode of the CPU). This obviously includes programming the MMU (i.e. management of address spaces and memory protection), handling interrupts (those switch the CPU into privileged mode, so at least the initial interrupt routine needs to be part of the kernel), thread and process switching (and typically also scheduling) and finally some form of inter-process communication mechanism (typically message passing). With those components in the kernel, almost everything else can be realized outside the kernel proper (though device drivers do need some additional low-level services from the kernel not listed here, like DMA programming and delegation of hardware interrupts). │ Monolithic Kernels │ │ • all that a microkernel does │ • plus device drivers │ • file systems, volume management │ • a network stack │ • data encryption, ... A monolithic kernel needs to include everything that a microkernel does (even though some of the bits have a slightly different form, at least typically: inter-process communication is present, but may be of different type, driver integration looks different). However, there are many additional responsibilities: many device drivers (those that need interrupts or DMA, or are otherwise performance-critical) are integrated into the kernel, as are file systems and volume (disk) management. A complete TCP/IP stack is almost a given. A number of additional bits and pieces might be part of the kernel, like cryptographic services (key management, disk encryption, etc.), packet filtering, a kitchen sink and so on. Of course, all that code runs in privileged mode, and as such has complete power over the operating system and the computer as a whole. │ Microkernel Redux │ │ • we need a lot more than a microkernel provides │ • in a “true” microkernel OS, there are many modules │ • each «device driver» runs in a «separate process» │ • the same for «file systems» and networking │ • those modules / processes are called «servers» The question that now arises is who is responsible for all the services listed on the previous slide (those that are part of a monolithic kernel, but are missing from a microkernel). In a ‘true’ microkernel operating system, those services are individually covered, each by a separate process (also known as a «server» in this context). │ Hybrid Kernels │ │ • based around a microkernel │ • «and» a gutted monolithic kernel │ │ • the monolithic kernel is a big server │ ◦ takes care of stuff not handled by the microkernel │ ◦ easier to implement than true microkernel OS │ ◦ strikes middle ground on performance In a hybrid kernel, most of the services are provided by a single large server, which is somewhat isolated from the hardware. It is often the case that the server is based on a monolithic OS kernel, with the lowest-level layers removed, and replaced with calls to the microkernel as appropriate. Hybrid kernels are both cheaper to design and theoretically perform better than ‘true’ (multi-server) microkernel systems. │ Micro vs Mono │ │ • microkernels are more «robust» │ • monolithic kernels are more «efficient» │ ◦ less context switching │ • what is easier to implement is debatable │ ◦ in the short view, monolithic wins │ • hybrid kernels are a «compromise» The main advantage of microkernels is their robustness in face of software bugs. Since the kernel itself is small, chances of a bug in the kernel proper are much diminished compared to the relatively huge code base of a monolithic kernel. The impact of bugs outside the kernel (in servers) is considerably smaller, since those are isolated from the rest of the system and even if they provide vital services, the system can often recover from a failure by restarting the failed server. On the other hand, monolithic kernels offer better performance, mainly through reduced context switching, which is still fairly expensive even on modern, virtualisation-capable processors. However, as monolithic kernels adopt technologies such as kernel page table isolation to improve their security properties, the performance difference becomes smaller. Implementation-wise, monolithic kernels offer two advantages: in many cases, code can be written in direct, synchronous style, and different parts of the kernel can share data structures without additional effort. In contrast, a proper multi-server system often has to use asynchronous communication (message passing) to achieve the same goals, making the code harder to write and harder to understand. Long-term, improved modularity and isolation of components could outweigh the short-term gains in programming efficiency due to more direct programming style. │ Exokernels │ │ • smaller than a microkernel │ • much «fewer abstractions» │ ◦ applications only get «block» storage │ ◦ networking is much reduced │ • only «research systems» exist Operating systems based on microkernels still provide the full suite of services to their applications, including file systems, network stacks and so on. The difference lies in where this functionality is implemented, whether the kernel proper, or in a user-mode server. With exokernels, this is no longer true: the services provided by the operating system are severely cut down. The resulting system is somewhere between a paravirtualized computer (we will discuss this concept in more detail near the end of the course) and a ‘standard’ operating system. Unlike virtual machines (and unikernels), process-based application isolation is still available, and plays an important role. No production systems based on this architecture currently exist. │ Type 1 Hypervisors │ │ • also known as «bare metal» or «native» hypervisors │ • they resemble microkernel operating systems │ ◦ or exokernels, depending on the viewpoint │ • “applications” for a hypervisor are «operating systems» │ ◦ hypervisor can use «coarser abstractions» than an OS │ ◦ entire storage devices instead of a filesystem A bare metal hypervisor is similar to an exokernel or a microkernel operating system (depending on a particular hypervisor and on our point of view). Typically, a hypervisor provides interfaces and resources that are traditionally implemented in hardware: block devices, network interfaces, a virtual CPU, including a virtual MMU that allows the ‘applications’ (i.e. the guest operating systems) to take advantage of paging. │ Unikernels │ │ • kernels for running a «single application» │ ◦ makes little sense on real hardware │ ◦ but can be very useful on a «hypervisor» │ • bundle applications as «virtual machines» │ ◦ without the overhead of a general-purpose OS Unikernels constitute a different strand (compared to exokernels) of minimalist operating system design. In this case, process-level multitasking and address space isolation are not part of the kernel: instead, the kernel exists to support a single application by providing (a subset of) traditional OS abstractions like a networking stack, a hierarchical file system and so on. When an application is bundled with a compatible unikernel, the result can be executed directly on a hypervisor (or an exokernel). │ Exo vs Uni │ │ • an exokernel runs «multiple applications» │ ◦ includes process-based isolation │ ◦ but «abstractions» are very «bare-bones» │ • unikernel only runs a «single application» │ ◦ provides more-or-less «standard services» │ ◦ e.g. standard hierarchical file system │ ◦ socket-based network stack / API ## System Calls In the remainder of this lecture, we will focus on monolithic kernels, since the more progressive designs do not use the traditional system call mechanism. In those systems, most ‘system calls’ are implemented through message passing, and only services provided directly by the microkernel use a mechanism that resembles system calls as described in this section. │ Reminder: Kernel Protection │ │ • kernel executes in «privileged» mode of the CPU │ • kernel memory is protected from user code │ │ But: Kernel Services │ │ • user code needs to ask kernel for «services» │ • how do we «switch the CPU» into privileged mode? │ • «cannot» be done arbitrarily (security) The main purpose of the system call interface is to allow secure transfer of control between a user-space application and the kernel. Recall that each executes with different level of privileges (at the CPU level). A viable system call mechanism must allow the application to switch the CPU into privileged mode (so that the CPU can execute kernel code), but in a way that does not allow the application to execute its own code in this mode. │ System Calls │ │ • hand off execution to a «kernel routine» │ • pass «arguments» into the kernel │ • obtain «return value» from the kernel │ • all of this must be done «safely» We would like system calls to behave more-or-less like standard subroutines (e.g. those provided by system libraries): this means that we want to pass arguments to the subroutine and obtain its return value. Like with the transfer of control flow, we need the argument passing to be safe: the user-space side of the call must not be able to read or modify kernel memory. │ Trapping into the Kernel │ │ • there are a few possible mechanisms │ • details are very «architecture-specific» │ • in general, the kernel sets a fixed «entry address» │ ◦ an instruction changes the CPU into privileged mode │ ◦ while «at the same time» jumping to this address Security from execution of arbitrary code by the application is achieved by tying the privilege escalation (i.e. the entry into the privileged CPU mode) to a simultaneous transfer of execution to a fixed address, which the application is unable to change. The exact mechanism is highly architecture-dependent, but the principle outlined here is universal. │ Trap Example: ‹x86› │ │ • there is an ‹int› instruction on those CPUs │ • this is called a «software interrupt» │ ◦ interrupts are normally a «hardware» thing │ ◦ interrupt «handlers» run in «privileged mode» │ • it is also synchronous │ • the handler is set in ‹IDT› (interrupt descriptor table) On traditional (32 bit) x86 CPUs, the preferred method of implementing the system call trap was through «software interrupts». In this case, the application uses an ‹int› instruction, which causes the CPU to perform a process analogous to a hardware interrupt. The two important aspects are: 1. the CPU switches into privileged mode to execute the «interrupt handler», 2. reads the address to jump to from an «interrupt handler table», which is a data structure stored in RAM, at an address given by a special register. The kernel sets up the interrupt handler table in such a way that user-level code cannot change it (via standard MMU-based memory protection). The register which holds its address cannot be changed outside of privileged mode. │ Software Interrupts │ │ • those are available on a range of CPUs │ • generally «not very efficient» for system calls │ • extra level of indirection │ ◦ the handler address is retrieved from memory │ ◦ a «lot of CPU state» needs to be saved A similar mechanism is available on many other processor architectures. There are, however, some downsides to using this approach for system calls, the main being their poor performance. Since the mechanism piggy-backs on the hardware variety of interrupts, the CPU usually saves a lot more computation state than would be required. As an additional inconvenience, there are multiple entry-points, which must therefore be stored in RAM (instead of a register), causing additional delays when the CPU needs to read the interrupt table. Finally, arguments must be passed through memory, since registers are reset by the interrupt, again contributing to increased latency. │ Aside: SW Interrupts on PCs │ │ • those are used even in «real mode» │ ◦ legacy 16-bit mode of 80x86 CPUs │ ◦ BIOS (firmware) routines via ‹int 0x10› & ‹0x13› │ ◦ MS-DOS API via ‹int 0x21› │ • and on older CPUs in 32-bit «protected mode» │ ◦ Windows NT uses ‹int 0x2e› │ ◦ Linux uses ‹int 0x80› On the ubiquitous x86 architecture, software interrupts were the preferred mechanism to provide services to application programs until the end of the 32-bit x86 era. Interestingly, x86 CPUs since 80386 offer a mechanism that was directly intended to implement operating system services (i.e. syscalls), but it was rather complex and largely ignored by operating system programmers. │ Trap Example: ‹amd64› / ‹x86_64› │ │ • ‹sysenter› and ‹syscall› instructions │ ◦ and corresponding ‹sysexit› / ‹sysret› │ • the entry point is stored in a «machine state register» │ • there is only «one entry point» │ ◦ unlike with software interrupts │ • quite a bit «faster» than interrupts When x86 switched to a 64-bit address space, many new instructions found their way into the instruction set. Among those was a simple, single-entrypoint privilege escalation instruction. This mechanism avoids most of the overhead associated with software interrupts: computation state is managed in software, allowing compilers to only save and restore a small number of registers across the system call (instead of having the CPU automatically save its entire state into memory). │ Which System Call? │ │ • often there are «many» system calls │ ◦ there are more than 300 on 64-bit Linux │ ◦ about 400 on 32-bit Windows NT │ • but there is only a «handful of interrupts» │ ◦ and only one ‹sysenter› address Usually, there is only a single entry point (address) shared by all system calls. However, the kernel needs to be able to figure out which service the application program requested. │ Reminder: System Call Numbers │ │ • each system call is assigned a «number» │ • available as ‹SYS_write› &c. on POSIX systems │ • for the “universal” ‹int syscall( int sys, ... )› │ • this number is passed in a CPU register This is achieved by simply sending the «syscall number» as an argument in a specific CPU register. The kernel can then decide, based on this number, which kernel routine to execute on behalf of the program. │ System Call Sequence │ │ • first, ‹libc› prepares the system call «arguments» │ • and puts the system call «number» in the correct register │ • then the CPU is switched into «privileged mode» │ • this also transfers control to the «syscall handler» The first stage of a system call is executed in user mode, and is usually implemented in ‹libc›. │ System Call Handler │ │ • the handler first picks up the system call «number» │ • and decides where to continue │ • you can imagine this as a giant ‹switch› statement │ │ switch ( sysnum ) /* C */ │ { │ case SYS_write: return syscall_write(); │ case SYS_read: return syscall_read(); │ /* many more */ │ } After the switch to privileged mode, the kernel needs to make sense of the arguments that the user program provided, and most importantly, decide which system call was requested. The code to do this in the kernel might look like the above ‹switch› statement. │ System Call Arguments │ │ • each system call has «different arguments» │ • how they are passed to the kernel is «CPU-dependent» │ • on 32-bit ‹x86›, most of them are passed «in memory» │ • on ‹amd64› Linux, all arguments go into «registers» │ ◦ 6 registers available for arguments Since different system calls expect different arguments, the specific argument processing is done after the system call is dispatched based on its number. In modern systems, arguments are passed in CPU registers, but this was not possible with protocols based on software interrupts (instead, arguments would be passed through memory, usually at the top of the user-space stack). ## Kernel Services Finally, we will revisit the services offered by monolithic kernels, and look at how they are realized in microkernel operating systems. │ What Does a Kernel Do? │ │ • «memory» & process management │ • task (thread) «scheduling» │ • device drivers │ ◦ SSDs, GPUs, USB, bluetooth, HID, audio, ... │ • file systems │ • networking The first two points are a core responsibility of the kernel: those are rarely ‘outsourced’ into external services. The remaining services are a core part of an «operating system», but not necessarily of a kernel. However, it is hard to imagine a modern, general-purpose operating system which would omit any of them. In traditional (monolithic) designs, they are all part of the kernel. │ Additional Services │ │ • inter-process «communication» │ • timers and time keeping │ • process tracing, profiling │ • security, sandboxing │ • cryptography A monolithic kernel may provide a number of additional services, with varying importance. Not all systems provide all the services, and the implementations can look quite different across operating systems. Out of this (incomplete) list, IPC (inter-process communication) is the only item that is quite universally present, in some form, in microkernels. Moreover, while dedicated IPC mechanisms are common in monolithic kernels, they are more important in a microkernel. │ Reminder: Microkernel Systems │ │ • the kernel proper is «very small» │ • it is accompanied by «servers» │ • in “true” microkernel systems, there are «many servers» │ ◦ each device, filesystem, etc. is separate │ • in «hybrid» systems, there is one, or a few │ ◦ a “superserver” that resembles a monolithic kernel Recall that a microkernel is small: it only provides services that cannot be reasonably implemented outside of it. Of course, the operating system as a whole still needs to implement those services. Two basic strategies are available: 1. a single program, running in a single process, implements all the missing functionality: this program is called a superserver, and internally has an architecture that is rather similar to that of a standard monolithic kernel, 2. each service is provided by a separate, specialized program, running in its own process (and hence, address space) – this is characteristic of so-called ‘true’ microkernel systems. There are of course different trade-offs involved in those two basic designs. A hybrid system (i.e. one with a superserver) is easier to initially design and implement (for instance, persistent storage drivers, the block layer, and the file system all share the same address space, simplifying the implementation) and is often considerably faster, since communication between components does not involve context switches. On the other hand, a true microkernel system with services and drivers all strictly separated into individual processes is more robust, and in theory also easier to scale to large SMP systems. │ Kernel Services │ │ • we usually don't care «which server» provides what │ ◦ each system is different │ ◦ for services, we take a «monolithic» view │ • the services are used through «system libraries» │ ◦ they abstract away many of the details │ ◦ e.g. whether a service is a «system call» or an «IPC call» From a user-space point of view, the specifics of kernel architecture should not matter. Applications use system libraries to talk to the kernel in either case: it is up to the libraries in question to implement the protocol for locating relevant servers and interacting with them. │ User-Space Drivers in Monolithic Systems │ │ • not «all» device drivers are part of the kernel │ • case in point: «printer» drivers │ • also some «USB devices» (not the USB bus though) │ • part of the GPU/graphics stack │ ◦ memory and output management in kernel │ ◦ most of OpenGL in «user space» While user-space drivers are par for the course in microkernel systems, there are also certain cases where drivers in operating systems based on monolithic kernels have significant user-space components. The most common example is probably printer drivers: low-level communication with the printer (at the USB level) is mediated by the kernel, but for many printers, document processing comprises a large part of the functionality of the driver. In some cases, this involves format conversion (e.g. PCL printers) but in others, the input document is rasterised by the driver on the main CPU: instead of sending text and layout information to the printer, the driver sends pixel data, or even a stream of commands for the printing head. The situation with GPUs is somewhat analogous: low-level access to the hardware is provided by the kernel, but again, a large part of the driver is dedicated to data manipulation: dealing with triangle meshes, textures, lighting and so on. Additionally, modern GPUs are invariably *programmable*: a shader compiler is also part of the driver, translating high-level shader programs into instruction streams that can be executed by the CPU. We will deal with device drivers in more detail in lecture 8. │ Review Questions │ │ 9. What CPU modes are there and how are they used? │ 10. What is the memory management unit? │ 11. What is a microkernel? │ 12. What is a system call?