# Processes, Threads & Scheduling In this lecture, we will look at the 2 basic resources that the computer offers, and which every program needs: CPU and memory. The main question then will be how the operating system achieves the illusion that every thread seemingly has its own processor and every process has its own memory (this is what we mean by multiplexing – turning a single physical resource into a larger set of virtual instances of the same). │ Lecture Overview │ │ 1. processes and virtual memory │ 2. thread scheduling │ 3. interrupts and clocks There will be 3 parts. We will first look at virtual memory, and the unit of memory isolation in the operating system: a process. We will then look at CPU (processor) sharing and the unit of CPU allocation, a «thread». Finally, we will look in more detail how CPU sharing is implemented in terms of hardware. ## Processes and Virtual Memory │ Prehistory: Batch Systems │ │ • first computers ran «one program» at a time │ • programs were scheduled «ahead of time» │ • we are talking punch cards &c. │ • and computers that took an entire room The first generation of computers was strictly sequential: one program could execute at any given time, and that was that. To execute another program, the first must have finished first. │ History: Time Sharing │ │ • “mini” computers could run programs «interactively» │ • teletype «terminals», screens, keyboards │ • «multiple users» at the same time │ • hence, «multiple programs» at the same time Computers were quite expensive at the time, and programs started to offer more interactivity. That is, a program could interact with the operator by asking questions and then waiting for inputs to be provided, for instance. Together, those two considerations made the sequential, one program at a time paradigm quite impractical. This is the origin of «time sharing» systems, where a computer could execute multiple programs at seemingly the same time, by quickly switching from one program to another. As soon as this is possible, it makes sense to allow multiple users to interact with the same computer (each user through a different program). │ Processes: Early View │ │ • process is an «executing» program │ • there can be «multiple processes» │ • various «resources» belong to a process │ • each process belongs to a particular «user» Since traditional programs were internally sequential, the original notion of a process encompassed both types of basic resources: memory and processor. In normal operation, a process P gets to execute on the processor for some amount of time, before it is interrupted and the system switches to executing some other process. At some later point, process P will be awaken and continue execution where it left off. │ Process Resources │ │ • «memory» (address space) │ • «processor» time │ • open files (descriptors) │ ◦ also working directory │ ◦ also network connections The most basic resource associated with a process is «memory»: this is where the program itself (the instructions) are stored, and also where its data (both static and dynamic) resides. This memory is typically private: one process cannot see the memory of another process. │ Process Memory Segments │ │ • program «text»: contains instructions │ • data: static and dynamic «data» │ ◦ with a separate read-only section │ • «stack» memory: execution stack │ ◦ return addresses │ ◦ automatic variables The memory of a program is organized into functionally distinct «segments». Traditionally, those segments were contiguous chunks of address space, but in modern operating systems, this is not quite true (and the whole idea of a segment is losing its appeal). Nonetheless, the traditional split is into program «text», which contains the instructions of the program (i.e. the executable code, typically produced by a compiler), a «stack» which holds the C stack (i.e. the values of local variables and return addresses, along with other bookkeeping) and finally a «data segment» which holds, unsurprisingly, data. In a modern system, none of those segments needs to be contiguous: the «text» consists of multiple mappings: one for the main program, and one for each shared library it uses. Those mappings do not need to be adjacent in the virtual address space (much less so in physical memory). Likewise, since one process can contain multiple threads (more on that later), there may be multiple stacks, again, not necessarily allocated next to each other. Finally, static (and especially read-only) data comes from executable images too, hence there might be one mapping per executable, just like with text. The dynamic data, then, is entirely unconstrained in its shape: the implementation of ‹malloc› in ‹libc› will request memory from the operating system as needed, with virtual addresses often being assigned randomly to each new region of memory. │ Process Memory │ │ • each process has its own «address space» │ • this means processes are «isolated» from each other │ • requires that the CPU has an MMU │ • implemented via «paging» (page tables) The defining characteristic of a modern process is its «address space»: the virtual addresses allocated to a process is private to that process and as such invisible to any other process. The data is mostly stored in RAM, but each process only (typically) sees a small portion of the physical memory – the majority is simply not visible in its virtual address space. Behind the scenes, this separation of address spaces is built on paging, which is, in turn, provided by the MMU (memory management unit) of the CPU. │ Process Switching │ │ • switching processes means switching «page tables» │ • physical addresses do «not» change │ • but the «mapping» of virtual addresses does │ • large part of physical memory is «not mapped» │ ◦ could be completely unallocated (unused) │ ◦ or belong to «other processes» In a typical time-sharing operating system (this covers pretty much every modern general-purpose OS), an illusion of virtually unlimited concurrency (the ability to run an arbitrary number of processes at the same time, regardless of the number of available CPU cores) is achieved by rapidly switching the execution from one process to the next. Or, to be more precise, execution is switched from one «thread» to another (which we will discuss shortly), but often, the threads will be part of different processes. In those cases, the operating system must instruct the processor to switch into the virtual address space of the new process. Of course, in this process, the physical memory is not rearranged in any way – that would be way too expensive. Instead, the MMU is instructed to start using a different set of mapping rules (page tables) that map virtual addresses to physical addresses. Usually, the overlap between physical addresses that have been visible in the old virtual address space and those visible in the new one is minimal: as outlined earlier, the virtual address space is almost entirely private to each process. Thus, most of the physical memory is, from the point of view of any given process, inaccessible: it is either unused, or it belongs to a different process. │ Paging and TLB │ │ • address translation is «slow» │ • recently-used pages are stored in a TLB │ ◦ short for Translation Look-aside Buffer │ ◦ very fast «hardware» cache │ • the TLB needs to be «flushed» on process switch │ ◦ this is fairly «expensive» (microseconds) An important consideration is the «cost» associated with switching processes. On the surface, this involves a thread switch (which, as we will see later, consists of storing the content of user-accessible registers in memory and loading a new set of values from a different location in memory) and an additional register write (to load the new page table into the MMU). Unfortunately, under the surface, that single register write causes a cascade of other effects: in particular, all modern processors use a very fast «cache» for storing recent address translations (i.e. which physical addresses correspond to a small set of recently-used virtual addresses), known as the TLB (translation look-aside buffer). The TLB is required because otherwise, translating a virtual address to a physical one involves multiple round-trips into the main memory, where page tables are stored. In modern computers, each of those round-trips will take hundreds or even thousands of processor cycles – clearly, repeating this process on every address translation would make computation extremely slow. Nonetheless, upon a process switch, the mapping of virtual addresses to physical addresses changes, and the information stored in the TLB becomes invalid, since it refers to the previous address space. The simplest remedy is simply erasing everything that is stored in the TLB. This is, in itself, a quick operation, since the TLB is implemented using very fast on-die memory. Most of the price comes from the subsequent address translations, which cannot be answered from the (now empty) TLB and must perform multiple round-trips into main memory (or, depending on cache pressure, into one of the general-purpose caches – L1, L2 or L3). Incidentally, modern processors often use tagged TLBs, which make the entire process more efficient, but this is well outside of the scope of this course. │ Threads │ │ • the modern unit of CPU scheduling │ • each thread runs sequentially │ • one process can have «multiple threads» │ ◦ such threads share a single address space While processes are the basic unit of memory management in the operating system, computation is captured by «threads». Each process has at least one thread, but might have more than one. While processor time «accounting» is usually done at process level, «execution» is tied to threads, since threads is what the processor ultimately executes. │ What is a Thread? │ │ • thread is a «sequence» of instructions │ ◦ instructions depend on results of previous instructions │ • different threads run different instructions │ ◦ as opposed to SIMD or many-core units (GPUs) │ • each thread has its own «stack» A thread can be thought of as an instruction stream: essentially what we imagine a traditional, sequential program to be. Within a thread, an instruction can freely use results of previous instructions (data dependencies) and make decisions (based on those results) as to which instruction to execute next (branching, or control flow). This is how a processor core executes a program (at least from the point of view of the programmer). Hence at any given time, each processor core can execute a single thread. The threads running concurrently on different cores or different CPUs can be completely unrelated (they don't even need to belong to the same process: each core has its own MMU and which can each use a different page table, and hence, each core can be executing threads in different address spaces). Of course, this also means that each thread needs its own execution stack (i.e. the memory that stores local variables in a C program, along with return addresses and other execution-related book-keeping). │ Processor Time Sharing │ │ • CPU time is sliced into «time shares» │ • time shares (slices) are like memory «frames» │ • process «computation» is like memory «pages» │ • processes are allocated into «time shares» Besides memory, the other main commodity in a computer is «computation time». Like memory, there is a limited amount, and it needs to be distributed among multiple threads (and processes). Since different instructions take wildly different amount of time to execute, allocation of computation is based on «time» instead of instruction count. As a bonus, time is easier to measure. │ Multiple CPUs │ │ • execution of a thread is «sequential» │ • one CPU = one «instruction sequence» at a time │ • physical limits on CPU speed → «multiple cores» │ • more CPU cores = more throughput With time sharing, it is possible to execute any number of threads on a single CPU core. However, physical limits in CPU design have made it, in the last decade, much easier to build processors which can execute twice as many threads (using twice as many processor cores), instead of making them execute a given number of threads twice as fast. As long as there are always enough threads ready to execute, the overall computational throughput of the system is doubled either way. Of course, this puts strain on software, which needs to be separated into more and more threads to saturate the computational power of a modern processor. │ Modern View of a Process │ │ • in a modern view, process is an «address space» │ • threads are the right «scheduling abstraction» │ │ • «process» is a unit of «memory management» │ • «thread» is a unit of «computation» │ • old view: one process = one thread To recap, there are two main abstractions that center about memory and computation. While historically, operating systems made the assumption that 1 process = 1 thread (and hence older textbooks about operating systems might follow the same design), this no longer makes sense in modern systems. Instead, threads are the abstraction that covers computation itself (the sequence of instructions to be executed), while processes cover memory and most other resources. │ Fork │ │ • how do we create «new processes»? │ • by ‹fork›-ing existing processes │ • fork creates an «identical copy» of a process │ • execution continues in both processes │ ◦ each of them gets a different «return value» Let's now look at processes from the perspective of programs (and users). The first thing to consider is how processes are created: on POSIX-like systems, this is almost exclusively through the use of the ‹fork› system call, which simply makes an identical copy of the current process. The processes are then very slightly tweaked: in one, ‹fork› returns a different value, and in the process table, one is taken to be a «parent» and the other to be a «child». The return value of ‹fork› tells each process which of the two it is. │ Lazy Fork │ │ • paging can make ‹fork› quite efficient │ • we start by copying the «page tables» │ • initially, all pages are marked «read-only» │ • the processes start out «sharing» memory Making a copy of an entire process would be a comparatively expensive exercise. However, like we have seen with file systems before, there are tricks (based on the copy-on-write implementation technique) which make it possible to make ‹fork› itself quite efficient. At the time of ‹fork›, the only thing that needs to be copied is the «page table» which is much smaller than the entire address space of the process in question. At the same time, all pages in both copies of the page table are marked as read-only, since they are now shared by two processes, which should not have access to each other's address space. Of course, as long as they are only reading from memory, whether there are two physical copies or a single shared copy does not make a difference. │ Lazy Fork: Faults │ │ • the shared memory becomes «copy on write» │ • «fault» when either process tries to write │ ◦ remember the memory is marked as read-only │ • the OS checks if the memory is «supposed» to be writable │ ◦ if yes, it makes a «copy» and allows the write As soon as either of the processes attempts to issue a memory write, this will cause a fault: the pages are marked read-only in the page tables. When this happens, the processor will invoke the fault handler (which is part of the kernel) to resolve the situation. The fault handler will then consult its data structures (which may be partially embedded in the page table itself) to distinguish actual faults (i.e. the process tried to write into truly read-only memory) from copy-on-write events. In the latter case, the page is semantically read-write, but is marked read-only because multiple processes are using a single physical copy. At this point, the fault handler will split the mapping: it will make a new physical copy of the data (i.e. it will allocate a new frame for it) and adjust the page table of the process which attempted to write, so that the offending virtual address is translated to point into this new physical copy. Finally, the CPU is instructed to restart the offending instruction: since the page table entry is now marked as read-write (pointing to the new physical location), the instruction will execute without further impediment. │ Init │ │ • on UNIX, ‹fork› is the only way to make a process │ • but ‹fork› splits existing processes into 2 │ • the «first process» is special │ • it is directly spawned by the kernel on boot Earlier, we have claimed that ‹fork› is essentially the only way to create a new process. Of course, this cannot be entirely true, since ‹fork› may only be executed by an existing process. For this reason, there is one special process, also called ‹init› or ‹pid 1› which is directly created by the kernel upon boot. All other processes in a running system, however, descend from this original process by a sequence of forks. │ Process Identifier │ │ • processes are assigned «numeric identifiers» │ • also known as PID (Process ID) │ • those are used in «process management» │ • used in calls like ‹kill› or ‹setpriority› To facilitate process management, each process is assigned a numeric identifier (known as PID, short for process identifier). Process management syscalls, then, take this number as an argument. Traditionally, the ‘namespace’ of processes is global: a PID identifies a process globally (unlike, for instance, a file descriptor, which is local to each process). That is, for all users and all processes, the given number always represents the same process (until it is terminated). Please note that in container-based virtualisation, this is no longer strictly true, since different containers will get different PID namespaces even though they share the same kernel. │ Process vs Executable │ │ • process is a «dynamic» entity │ • executable is a «static» file │ • an executable contains an initial «memory image» │ ◦ this sets up memory layout │ ◦ and content of the «text» and «data» segments In previous lectures, we have touched the subject of executables: those are «files» which contain programs. It is very important to understand the difference between an executable (a program, so to speak, at rest) and a process (a program in the process of being executed). An executable contains, essentially, an initial image of process memory: when a program starts executing, the kernel populates its virtual address space with data from the executable, most importantly the «text» segment which contains instructions, but also the static parts of the «data» segment. Additionally, the executable contains the (virtual) address of the «entry point»: this is the address of the first instruction of the program that should be executed. │ Exec │ │ • on UNIX, processes are created via ‹fork› │ • how do we «run programs» though? │ • ‹exec›: load a new «executable» into a process │ ◦ this completely «overwrites» process memory │ ◦ execution starts from the «entry point» │ • running programs: ‹fork› + ‹exec› However, when a process is created, its memory is «not» populated from an executable: instead, it is a clone of an existing process. To execute a new program, an existing process needs to call an ‹exec› system call, which then simply replaces the entire memory of the current process with the content of the executable. Only environment variables and command-line arguments (which are both stored in process memory) are copied over to the new address space. The common ‘start a new program’ operation (e.g. what happens when the user runs a command in the shell) is then implemented as a ‹fork› followed by an ‹exec› in the child process. ## Thread Scheduling In this section, we will look in more detail how operating systems decide which thread to run when and for how long. │ What is a Scheduler? │ │ • scheduler has two related tasks │ ◦ «plan» when to run which thread │ ◦ actually «switch» threads and processes │ • usually part of the «kernel» │ ◦ even in micro-kernel operating systems The «scheduler» is a component of the kernel which fills two basic roles: planning and thread switching (including preemption). │ Switching Threads │ │ • threads of the «same process» share an address space │ ◦ a «partial» context switch is needed │ ◦ only «register state» has to be saved and restored │ • no TLB flushing – lower «overhead» To switch execution from one thread to another, within a single process, it is sufficient to store the current state of user-visible registers into memory, and load the state which corresponds to the other thread (the state of which was saved when it was last interrupted and suspended). Since the address of the (top of the) stack is stored in a register, simply replacing values in registers from a backup also has the effect of switching the active stack. In this case, the TLB does not need to be flushed, making the switch quite efficient, even if not entirely free (there is still a small penalty due to branch prediction and speculative execution). │ Fixed vs Dynamic Schedule │ │ • «fixed» schedule = all processes known «in advance» │ ◦ only useful in special / embedded systems │ ◦ can «conserve resources» │ ◦ planning is not part of the OS │ • most systems use «dynamic scheduling» │ ◦ what to run next is «decided periodically» When it comes to the «schedule», that is, the plan of which thread to execute in which time slot(s), there are two basic types: 1. static schedules are computed ahead of time, for a fixed set of threads with pre-determined priorities and with their relative computational costs also known in advance, 2. dynamic schedules, in which the above information is not known. Running on a static schedule makes the runtime scheduler particularly simple and robust. However, this approach is unsuitable for all but the simplest use cases, and is usually only found in high-assurance embedded systems. A dynamic scheduler, on the other hand, allows threads and processes to be created in an ad-hoc manner, during execution. Likewise, the priorities of threads can be determined (and adjusted) at will. The scheduler periodically evaluates the situation and decides what to run either now, or in the very near future. │ Preemptive Scheduling │ │ • tasks (threads) just run as if they owned the CPU │ • the OS forcibly «takes the CPU» away from them │ ◦ this is called «preemption» │ • pro: a faulty program «cannot block» the system │ • somewhat «less efficient» than cooperative In most modern operating systems, the scheduler is «preemptive», that is, it can suspend threads at its own discretion, without cooperation from the threads themselves. In this approach, the user-space software does not need, in principle, be aware that it is running on a time-sharing system: when a thread is preempted and later resumed, the execution continues where it left off, without any immediately detectable effect on the executing program. A major advantage of preemptive scheduling is that uncooperative (or faulty) programs cannot endanger the system as a whole: the operating system can, at any time, decide to suspend a process, regardless of what code it executes at the time. │ Cooperative Scheduling │ │ • threads (tasks) «cooperate» to share the CPU │ • each thread has to explicitly «yield» │ • this can be «very efficient» if designed well │ • but a «bad program» can easily «block the system» A different approach is known as «cooperative scheduling». In this case, there is no preemption: the currently running thread has to «yield» (give up) the processor voluntarily. This makes it possible to optimize task switching, but it also means that a program which does not yield the processor will simply continue to run unimpeded. While general-purpose operating systems no longer use cooperative scheduling globally, some programs implement so-called «green threads» which are scheduled cooperatively – scheduling of those threads is implemented entirely in the user-space. Usually, a large number of such cooperative ‘green’ threads is dispatched onto a small number of ‘real’ OS threads (which are preemptively scheduled by the kernel). │ Scheduling in Practice │ │ • cooperative on Windows 3.x for everything │ • «cooperative» for «threads» on classic Mac OS │ ◦ but «preemptive» for «processes» │ • «preemptive» on pretty much every modern OS │ ◦ including real-time and embedded systems The last mainstream operating system to use cooperative scheduling at the operating system level was ‘classic’ Mac OS (before OS X), though it used preemptive scheduling for processes. This way, switching threads within a process could take advantage of the improved efficiency, while processes were unable to block each other. The last mainstream system with fully cooperative scheduling was MS Windows 3.11, released in 1993. │ Waiting and Yielding │ │ • threads often need to «wait» for resources or «events» │ ◦ they could also use software timers │ • a waiting thread should «not consume» CPU time │ • such a thread will «yield» the CPU │ • it is put on a list and later «woken up» by the kernel In most programs, it is common that a thread cannot continue its execution until some event takes place, or a resource becomes available. In this case, it is undesirable that the thread should wait actively, that is, spin in a loop that checks whether it can continue. Instead, kernels provide mechanisms which allow threads to be «suspended» until an event arrives, or a resource becomes available (at which point it is resumed by the kernel). This process is not completely free from overhead, but unless the wait is very short (a few dozen CPU cycles), suspending the thread will give a much better overall system throughput. │ Run Queues │ │ • «runnable» (non-waiting) threads are «queued» │ • could be «priority», round-robin or other queue types │ • scheduler «picks» threads from the run queue │ • «preempted» threads are put back There are two reasons why a thread is suspended, that is, it is no longer running: it's either waiting for an event (see above) or its time slot has ended and it was preempted. In the latter case, the thread is put on a «run queue», which is a list of threads that are ready to run, usually sorted by a «dynamic priority», which is a number that indicates how soon the thread should run, computed by the scheduler. Whenever a thread is suspended, the scheduler picks the next thread to execute from a run queue. │ Priorities │ │ • what «share» of the CPU should a thread get? │ • «priorities» are static and dynamic │ • «dynamic» priority is adjusted as the thread runs │ ◦ this is done by the system / scheduler │ • a «static» priority is assigned by the «user» Not all threads are equally important, and some should take priority over others whenever they need to run. This is achieved through a combination of priorities: a «static» priority is assigned to each thread by the user. The static priority then feeds into the computation of a «dynamic» priority, which governs scheduling: while a thread is running, its dynamic priority drops and while it is suspended, its dynamic priority increases. │ Fairness │ │ • «equal» (or priority-based) share per «thread» │ • what if one process has many «more threads»? │ • what if one user has many «more processes»? │ • what if one user group has many «more active users»? In the system that we have outlined above, scheduling is only concerned with individual threads. Such a system is clearly unfair: processes and users with many threads will get a much higher share of the CPU than processes and users with fewer threads. To mitigate this effect, most operating systems implement some form of fair scheduling. │ Fair Share Scheduling │ │ • we can use a «multi-level» scheduling scheme │ • CPU is sliced fairly first among «user groups» │ • then among «users» │ • then among «processes» │ • and finally among «threads» There are different levels on which a fair scheduler may operate: almost all systems will consider processes, while many will also consider users. In this scheme, the scheduler strives to achieve the following two goals: 1. as long as the given entity (process, user, group) has enough runnable threads, those threads get equal time when compared to any other entity with the same property (i.e. if both process A and process B have runnable threads, the total time slice of process A is equal to that of process B, assuming equal priorities), 2. resources are allocated efficiently: if an entity does not have sufficient runnable threads, the resources it cannot use are distributed among the remaining entities of the same type (i.e. if process A has 1 runnable thread, process B has 3 runnable threads and there are 4 processor cores, process B should not be hampered by process A only being able to use one of them – the system should use all 4 cores in this scenario).   │ Scheduling Strategies │ │ • first in, first served («batch» systems) │ • earliest «deadline» first (realtime) │ • round robin │ • «fixed priority» preemptive │ • «fair share» scheduling (multi-user) When it comes to computing a dynamic schedule, there is a number of approaches. The simplest of all is a batch scheduler, which is basically no scheduler at all: when a program is executed, it runs until it is done; afterwards, the next program is executed until termination, and so on. In real-time systems, a fairly simple scheduler is known as ‘earliest deadline first’: in such systems, each task comes with a «deadline» which tells the system when the task must be done, at the latest. In a preemptive setting, then, it is an optimal strategy to first execute the task with the earliest deadline. A naive general-purpose, preemptive scheduler is known as round robin: it runs all available threads in a fixed order, switching from one to the next as they run out of their time slice. In this system, the per-thread throughput can be prioritised (by tweaking the relative sizes of time slices of different threads), but latency can not. A preemptive scheduler with priorities fixes the latency problem: if a high-priority thread wakes up, it will be able to run very soon (with low latency), since it can ‘jump the queue’, unlike in a round-robin scheduler. │ Interactivity │ │ • «throughput» vs «latency» │ • «latency» is more important for «interactive» workloads │ ◦ think phone or desktop systems │ ◦ but also web servers │ • «throughput» is more important for «batch» systems │ ◦ think render farms, compute grids, simulation A scheduler will often need to make a compromise between maximising total computational «throughput» and minimising «latency». In interactive settings, the dominant concern is latency, while in computational contexts, it's the total throughput that is usually a priority. │ Reducing Latency │ │ • «shorter» time slices │ • more willingness to switch tasks (more «preemption») │ • «dynamic» priorities │ • priority boost for «foreground» processes When optimising for low latency, the scheduler should use short time slices (so that threads that wake up do not need to wait too long for the processor to free up, even if they are not high-priority threads). Likewise, the scheduler should be willing to switch to a higher-priority thread whenever one wakes up, preempting running threads (even if they did not consume their current time slice yet). Both decisions of course lead to more context switches, which are expensive and hence negatively affect total throughput (less total useful computation is performed in any given time, since more time is spent on overhead – both in the scheduler, but mainly in switching threads and processes). │ Maximising Throughput │ │ • «longer» time slices │ • «reduce context switches» to minimum │ • cooperative multitasking The opposite is true for high-throughput systems: the time slice should be as long as is practical, to minimize the number of context switches the CPU needs to do. Likewise, the scheduler should be unwilling to switch tasks before their current time slice runs out, pushing back threads that woke up in reaction to an event. Whenever possible, preemption should be avoided in favour of cooperation (though this is, nowadays, mainly an application-level concern). │ Multi-Core Schedulers │ │ • traditionally one CPU, many threads │ • nowadays: many threads, many CPUs (cores) │ • more complicated «algorithms» │ • more complicated & «concurrent-safe» data structures In traditional designs, an operating system would only expect a single processor to be available, in which case the scheduler would be comparatively simple. However, essentially all modern computers are multi-core systems, and schedulers need to take this into account in two major ways: 1. the algorithm to decide which threads run when «and on which core» is much more complicated, 2. the scheduler must be able to run concurrently with itself: multiple cores must be able to decide which thread to run next in parallel, which means that the data structures used by the scheduler to keep track of threads must be concurrent-safe.   │ Scheduling and Caches │ │ • threads can «move» between CPU «cores» │ ◦ important when a different «core is idle» │ ◦ and a runnable thread is «waiting for CPU» │ • but there is a «price» to pay │ ◦ thread / process data is extensively «cached» │ ◦ caches are typically «not shared» by all cores It is entirely possible to suspend a thread on one core, and wake it up later on another core. Since processor cores are all identical, this does not make a semantic difference – the thread will not be able to directly detect that it has moved. However, if some of the data associated with that thread was still in the per-core part of the cache (typically at least L1, but also L2 and L3 if the core is part of a different package) this will cause a performance hit, as the new core has to re-fetch the data that was already cached by the other core. │ Core Affinity │ │ • modern schedulers try to «avoid moving threads» │ • threads are said to have an «affinity» to a core │ • an extreme case is «pinning» │ ◦ this altogether prevents the thread to be «migrated» │ • practically, this practice «improves throughput» │ ◦ even if nominal core «utilisation» may be lower For the above reason, schedulers will try to avoid moving threads between cores. Of course, there are competing concerns: if a thread is runnable and a core other than its preferred core is idle, inefficiency creeps in: the system keeps a processor core idle while a thread is waiting to run. A realistic scheduler needs to strike a balance between those two concerns: if the preferred core is about to be freed up, it will wait, otherwise it will migrate the thread. Of course, this will sometimes go wrong: a higher-priority thread with affinity to the same core might wake up and the original waiting thread will have to be migrated anyway. For this reason, stronger affinity will cause a drop in core utilisation, but in a well-tuned scheduler, this should not cause a drop in throughput. │ NUMA Systems │ │ • «non-uniform memory» architecture │ ◦ different memory is attached to different CPUs │ ◦ each SMP block within a NUMA is called a «node» │ • «migrating» a process to a «different node» is expensive │ ◦ thread vs node ping-pong can kill performance │ ◦ threads of «one process» should live on one node In a traditional SMP (symmetric multiprocessing) system, each CPU has equal access to all RAM. However, in large computers with many CPUs, it is advantageous (from the point of view of hardware design) to violate this principle, and allow RAM to be attached locally to a single processor package. Cores from another such «node» then need to access this memory indirectly, adding a penalty both in terms of throughput (how many bytes per unit of time can be transferred from memory to the CPU cache) and in terms of latency (how many cycles it takes from the time data is requested to the time it is available to the processor). A scheduler needs to take this into account: the affinity of threads (and entire processes) to a particular NUMA node is usually much stronger than the affinity of threads to individual cores within an SMP system: unlike with purely cache-related overhead, the performance hit of running on a different NUMA node is permanent: the data is not automatically migrated into a ‘closer’ block of memory. ## Interrupts and Clocks In this section, we will look at how preemptive scheduling is actually implemented, through periodic hardware interrupts. │ Interrupt │ │ • a way for hardware to «request attention» │ • CPU «mechanism» to divert execution │ • partial (CPU state only) «context switch» │ • switch to «privileged» (kernel) CPU mode Interrupts allow peripherals to request the attention of the operating system. This is a low-level CPU feature, where asserting an interrupt line (in hardware) causes the processor to stop executing instructions and execute an «interrupt handler». When an interrupt comes in, the CPU will consult a table of interrupt handlers, which is set up by the operating system. Additionally, the CPU will store its state (the values stored in registers) into memory (usually on the active stack). Usually, only a partial context switch is done by the CPU (i.e. the page table is unaffected), but the interrupt routine will execute in privileged mode (hence it can, if it needs to, adjust page tables). In traditional designs, the kernel memory is mapped (but inaccessible from user mode) into all processes, and hence the interrupt handler can directly and immediately read and write kernel memory. When the interrupt routine is finished, it uses a special instruction (‹iret› on ‹x86›) which instructs the CPU to restore its state from the stack (which it stored upon entering the interrupt handler) and drop back into user (unprivileged) mode. │ Hardware Interrupts │ │ • «asynchronous», unlike software interrupts │ • triggered via «bus signals» to the CPU │ • IRQ = interrupt request │ ◦ just a different «name» for hardware interrupts │ • PIC = programmable «interrupt controller» We have already talked about software interrupts earlier: interrupt handlers are quite alike in both cases (that is, for hardware and software interrupts), but there are other differences. In particular, a hardware interrupt is «asynchronous»: the CPU will invoke the handler regardless of the instruction that is currently being executed. Hardware-wise, interrupts are realized through bus signalling – a peripheral (or rather an interrupt controller) will send a signal down a wire to the CPU, which will react by starting its interrupt handling sequence. │ Interrupt Controllers │ │ • PIC: «simple circuit», typically with 8 input lines │ ◦ peripherals connect to the PIC with wires │ ◦ PIC delivers prioritised signals to the CPU │ • APIC: advanced programmable interrupt controller │ ◦ split into a shared «IO APIC» and per-core «local APIC» │ ◦ typically 24 incoming «IRQ lines» │ • OpenPIC, MPIC: similar to APIC, used by e.g. Freescale An interrupt controller is essentially a hub which provides a number of interrupt lines to peripherals, and signals the CPU using a single interrupt line whenever any of the peripherals raises an interrupt. The interrupt controller of course informs the CPU which peripheral is the origin of the interrupt. │ Timekeeping │ │ • PIT: «programmable interval timer» │ ◦ crystal oscillator + divider │ ◦ «IRQ line» to the CPU │ • local APIC timer: built-in, «per-core clock» │ • HPET: high-precision event timer │ • RTC: real-time clock A (programmable) timer is another traditionally discrete component that has long been integrated into the main CPU die. Its role is to keep time (historically using a quartz crystal), and raise an interrupt in regular, configurable intervals. │ Timer Interrupt │ │ • generated by the PIT or the local APIC │ • the OS can «set the frequency» │ • a hardware interrupt happens on each «tick» │ • this creates an opportunity for bookkeeping │ • and for «preemptive scheduling» Every time the configured interval expires, the timer will generate an interrupt. When this happens, the OS kernel has a chance to run, regardless of the currently executing process. Among other things, it is possible to do a context switch by doing an ‹iret› (or equivalent) into a different thread or process from the one that was interrupted. This is how preemptive scheduling is usually implemented. │ Timer Interrupt and Scheduling │ │ • measure how much time the current thread took │ • if it ran out of its slice, «preempt it» │ ◦ «pick» a new «thread» to execute │ ◦ perform a context switch │ • checks are done on each tick │ ◦ «rescheduling» is usually less frequent The (scheduler part of) the interrupt handler for the timer interrupt quickly checks whether anything needs to be done: that is, it looks up how much time has the currently running thread left of its time slice. If it ran out, then a «reschedule» is in order: the handler will return into a different thread (that is, perform a context switch... if needed, this includes replacing the active page table). The slice check is done on every tick, since it is cheap. On most ticks, however, no rescheduling will take place, since a typical slice length is many ticks (and is much more expensive). │ Timer Interrupt Frequency │ │ • typical is 100 Hz │ • this means a 10 ms «scheduling tick» (quantum) │ • 1 kHz is also possible │ ◦ harms throughput but «improves latency» The traditional timer frequency on UNIX systems is 100 Hz, or one tick every 10 milliseconds. This means that time slices available to the scheduler are multiples of 10 milliseconds. For interactive systems, the frequency can be increased, with 1 kHz being the highest commonly used setting (yielding a quantum just 1 ms long, cutting down on scheduling latency quite a bit). Of course, the CPU will spend a lot more time in the timer interrupt handler, reducing the amount of useful work it can perform, thus harming throughput. │ Tickless Kernels │ │ • the timer interrupt «wakes up» the CPU │ • this can be «inefficient» if the system is «idle» │ • alternative: use «one-off» timers │ ◦ allows the CPU to «sleep longer» │ ◦ this «improves power efficiency» on light loads Modern processors implement aggressive power management, shutting down parts of the CPU that are not in use. This technique is of course more efficient, if the idle circuits do not need to be woken up very often. If nothing at all is going on, then an entire core can be put to sleep. Unfortunately, the timer interrupt interferes with this somewhat: while the system is completely idle (no threads are ready to run), the processor will still need to process the interrupt, 100 times every second. This has a considerable energy cost, just to figure out that nothing needs to be done. Instead of using a periodic timer, it's possible to configure a one-off, non-repeating timer, and when it fires, compute when the next scheduling interrupt is needed. This way, during idle times, the processor can sleep undisturbed for much longer periods of time. │ Tickless Scheduling │ │ • «quantum length» becomes part of the planning │ • if a core is idle, wake up on next «software timer» │ ◦ synchronisation of software timers │ • other interrupts are «delivered as normal» │ ◦ network or disk activity │ ◦ keyboard, mice, ... This is done by simply doing away with the fixed scheduling quantum. The scheduler can compute when the next reschedule needs to happen, and not waste any time in the timer interrupt unless the slice is over. Sleeping threads may be blocked for any number of reasons, but the most common are waiting for events or resources – one of the events that are commonly awaited is a timer: the thread has some periodic work to do, and may ask the kernel to be woken up in a second, then do some of its work, and then sleep for another second, and so on. If, at a given moment, all processes are sleeping, then the next timer interrupt needs to happen whenever the soonest of the threads sleeping on a timer is due to wake up. Of course, other interrupts are still delivered as usual, so if a thread is blocked waiting for data to arrive from disk or network, the respective interrupts can cause that thread to be woken up and scheduled for execution. │ Other Interrupts │ │ • serial port │ ◦ «data is available» on the port │ • «network» hardware │ ◦ data is available in a packet queue │ • keyboards, mice │ ◦ «user» pressed a key, moved the mouse │ • USB devices in general Besides the timer interrupt, there are many others that come up, and many of them indirectly affect scheduling too (usually because one of the threads was waiting for the event that the interrupt signalled). │ Interrupt Routing │ │ • not all CPU cores need to see all interrupts │ • APIC can be told how to deliver IRQs │ ◦ the OS can «route IRQs» to CPU cores │ • multi-core systems: IRQ «load balancing» │ ◦ useful to «spread out» IRQ overhead │ ◦ especially useful with «high-speed networks» Finally, let us consider systems with multiple cores (probably the majority of computers nowadays) and how this affects interrupts: clearly, if a packet comes in from the network, it does not make sense to interrupt all cores: we only need a single core to execute the handler to fetch the data or otherwise deal with the event. What is more, this does not need to be the same core every time the interrupt happens: if there are many interrupts, we would like them spread out somewhat evenly across all the cores. This is what IRQ load balancing does: it will route interrupts to cores in such a way that each core has approximately the same amount of work handling them. Finally, since scheduling is done per-core, it is useful to have a separate timer interrupt for each core: those interrupts are of course not subject to load-balancing or other re-routing: the interrupt is, in fact, generated locally on each core (in the local APIC) and the per-core timer which generates the interrupt is programmed separately. │ Review Questions │ │ 17. What is a thread and a process? │ 18. What is a (thread, process) scheduler? │ 19. What do ‹fork› and ‹exec› do? │ 20. What is an interrupt?