# Device Drivers Abstracting hardware is one of the major roles of an operating system. While we have already discussed the basic hardware resources (CPU and memory) in detail in previous lectures, so-called peripherals also play an important role. In this lecture, we will look at the interface between the operating system and the peripheral hardware (network interface cards, persistent storage, removable storage, displays, input devices and so on). │ Lecture Overview │ │ 1. Drivers, IO and Interrupts │ 2. System and Expansion Buses │ 3. Graphics │ 4. Persistent Storage │ 5. Networking and Wireless ## Drivers, IO and Interrupts In the first part, we will discuss the low-level aspects of hardware interaction: how the data moves between the CPU and the peripheral, how peripherals signal events to the CPU, and how this all relates to the operating system (which is running on the CPU). │ Input and Output │ │ • we will mostly think in terms of IO │ • peripherals produce and consume «data» │ • «input» – reading data produced by a device │ • «output» – sending data to a device │ • «protocol» – valid sequences of IO events While peripherals can be rather complicated, we will think of them in a rather abstract, simplified way, as devices which produce and consume data. The other crucial component in our understanding of devices will be «events». The valid sequences of events and the inputs and outputs tied to those events are described by a «protocol». Data transfers coupled to events (i.e. when they happen in a specific time pattern) can represent a rather wide variety of behaviours and effects. Consider a keyboard: when the user presses a key, that is an event, and is accompanied by a data transfer, which tells the system which key was pressed (or released). Likewise, when a mouse is moved, a stream of data which describe the relative motion is sent to the computer. Other types of devices receive data instead: take a display, as an archetype of that type of device: the computer (more or less continuously) sends data which represents the pixels to show on the screen and which is in turn shown to the user. Yet other devices accept commands (which are also of course a form of data) and again respond with data (responses to the commands). Consider a disk drive: when the system wishes to store some data, it will send a command (along with the payload, i.e. the data to be stored) and receives a confirmation. Likewise, when it wishes to retrieve some data, it sends a read command and receives a reply, which includes the data which was stored at the requested address. │ What is a Driver? │ │ • piece of «software» that talks to a «device» │ • usually quite specific / «unportable» │ ◦ tied to the particular «device» │ ◦ and also to the «operating system» │ • often part of the «kernel» Clearly, the input data needs to be processed and output generated. It is also rather clear that the form and content of the data will be specific to the particular device. Hence the software needs to be able to construct and understand data in the form understood by the particular device. The software in charge of this communication is known as a «driver», and in the light of the above, it is rather clear that any given driver is paired off with a specific device, or a small class of devices. Or, to be more precise, the driver implements one side of the «protocol» (the other side is implemented by the device itself). At first glance, there does not appear to be a good reason why a driver should be specific to a particular operating system: after all, the protocol used by the device will be the same, regardless of the operating system running on the CPU. But of course, the device-side protocol is only one part of the driver: the other part is communication with the operating system. This communication is performed using a set of interfaces that are usually specific to a given operating system (though portable drivers do exist). The other issue that ties drivers to a specific operating system is that drivers usually need to cooperate with each other: later in the lecture, we will see that devices are connected through other devices, and the peripheral driver needs to talk to the bus driver to talk to the peripheral. │ Kernel-mode Drivers │ │ • they are part of the kernel │ • running with full «kernel privileges» │ ◦ including «unrestricted» hardware access │ • no or «minimal» context switching «overhead» │ ◦ fast but dangerous In some sense, the simplest type of driver is one that is part of the kernel. A driver of this type can use all CPU and hardware facilities necessary to communicate with its device directly, without going through a middleman. Since no processes are involved, it also means that the code of the driver can run without context switches whenever necessary (e.g. in response to a hardware «interrupt»). This makes kernel-mode drivers particularly fast (low-overhead), but their unrestricted access to hardware and memory makes any problems in such a driver very serious. If the driver crashes, for instance, it'll usually take the entire operating system with it. │ Microkernels │ │ • drivers are «excluded» from microkernels │ • but the driver still needs «hardware access» │ ◦ this could be a special «memory region» │ ◦ it may need to «react» to «interrupts» │ • in principle, everything can be done «indirectly» │ ◦ but this may be quite «expensive», too While kernel-mode drivers are ubiquitous in monolithic kernel designs, they are all but banished from microkernels. Instead, each driver is a separate process and executes in user mode of the CPU. However, many drivers require some level of direct hardware access in order to communicate with their device: most often interrupt handlers and reads/writes to a specific area of physical memory. The latter can be arranged easily enough (just map that area of memory into the driver process). The former, however, is a problem: interrupt handlers (we will look at those in more detail shortly) always run in privileged mode and hence the driver cannot install one. Instead, the kernel will relay the interrupt to the driver process using some form of IPC, often precipitating an expensive context switch. │ User-mode Drivers │ │ • many drivers can run completely in «user space» │ • this improves «robustness» and «security» │ ◦ driver bugs can't bring the «entire system» down │ ◦ nor can they compromise system «security» │ • possibly at some «cost» to «performance» Drivers running in user mode are not exclusive to microkernels, and while they have downsides, they also have many desirable properties. Since they are isolated from the kernel, from each other and from other programs running on the system, crashes and other bugs in a driver cannot compromise the rest of the system (at least not directly, though if a peripheral is mis-programmed, it may still crash the system or make it otherwise unusable). Of course, security is also significantly improved. │ Drivers in Processes │ │ • user-mode drivers typically run in their own «process» │ • this means «context switches» │ ◦ every time the device demands attention (interrupt) │ ◦ every time «another process» wants to use the device │ • the driver needs «system calls» to talk to the device │ ◦ this incurs even more overhead Let's look at the model where each driver runs in its own process. As we have already mentioned, a major problem in this model is due to additional context switches, which happen when: 1. an interrupt arrives from the hardware and some other process is executing at the time (which is almost always), 2. another process on the system tries to use the device: the request must go through the driver, which means that it needs to run, and hence its process needs to be scheduled before the request can be served. Neither of those happen with kernel-mode drivers. Finally, to perform any privileged operation, the driver must perform a system call – while it is less expensive than a context switch, it is also significantly more expensive than a normal function call. │ In-Process Drivers │ │ • what if a (large portion of) a driver could be a «library» │ • best of both worlds │ ◦ «no» context switch «overhead» for requests │ ◦ bugs and security problems remain «isolated» │ • often used for GPU-accelerated 3D graphics There is an alternative model, which mitigates some of the downsides of user-mode drivers. In particular the second source of context switches can be (at least partially) eliminated by running the driver in the same process as the application which uses the device. How would this work? The driver can come as a library and the application links to that library. Of course you would want to link the driver dynamically, so that a different driver (e.g. for a different device of the same general type) can be substituted without recompiling the application. There are some issues that need to be resolved with regards to permissions, but in principle, an in-process (library) driver can use the same system calls that a driver running in its own process could. Effects of possible bugs or misbehaviour in the driver are limited to the particular process in which it runs. It is also common that multiple processes can use the same device, each using its own ‘instance’ of the driver. However, this model is not applicable when the driver needs to be protected from the application or the driver needs to perform multiplexing (i.e. it is not possible to have multiple independent instances of the driver talk to the same device, but the device needs to be nevertheless shared by multiple processes). │ Port-Mapped IO │ │ • early CPUs had very limited «address space» │ ◦ 16-bit addresses mean 64KB of memory │ • peripherals got a «separate» address space │ • «special instructions» for using those addresses │ ◦ e.g. ‹in› and ‹out› on ‹x86› processors Let's now look at how the CPU communicates with peripherals and how this affects drivers. Some old CPUs (most famously Intel 8086) had 2 distinct address spaces, one for memory and another for peripherals. The latter could be accessed using special-purpose instructions, which would move values between CPU registers and peripherals. In later iterations of the ‹x86› family, when memory protection (an MMU and privilege levels) was added, those instructions became privileged. Hence, only kernel can talk to devices which are attached to the CPU through this mechanism. But the IO instructions have been largely abandoned, and are only used with legacy devices. │ Memory-mapped IO │ │ • devices «share» address space with memory │ • «more common» in contemporary systems │ • IO uses the same instructions as memory access │ ◦ ‹load› and ‹store› on RISC, ‹mov› on ‹x86› │ • allows «selective» user-level access (via the MMU) The alternative to port-mapped IO is memory-mapped IO (MMIO for short), where the physical address space is shared by RAM and peripherals. Writing to some addresses (using e.g. ‹mov› on ‹x86›) will store the data in RAM, but simply changing the address will result into the data being sent to a peripheral (typically to be stored in an onboard register or memory). A specific example would be the PCIe configuration space: each PCIe device must expose a single page (4KiB) of MMIO address space through which it can be enumerated and configured. Unlike port-mapped IO, access to the physical memory address space is managed by the MMU (including the regions assigned to devices, not just those given to RAM). Hence it is possible to securely allow a certain process to talk to a specific device by mapping the corresponding chunk of the physical address space into the virtual address space of that process. │ Programmed IO │ │ • input or output is «driven» by the «CPU» │ • the CPU must «wait» until the device is ready │ • would usually run at «bus speed» │ ◦ 8 MHz for ISA (and hence ATA-1) │ • PIO would talk to a «buffer» on the device Another way to look at IO is how it is «timed». Peripherals are usually orders of magnitude slower than the main CPU and the CPU must wait a significant number of cycles between, for instance, issuing commands to a given device. Commands are usually realized by writing data to the on-board registers of the device. The device periodically reads those registers and acts accordingly, perhaps writing the response into some other register, which the CPU can then read (both input and output is done using one of the mechanisms described above: port-mapped or memory-mapped IO). The simplest form of timing is called «programmed IO» or PIO. In this mode, the CPU drives the data transfer and it has to actively wait for the device (or rather the bus) to become ready after each transfer. Consider sending data to a disk: there is a RAM-based buffer in the disk controller, one that can hold at least a single physical disk sector worth of data. The CPU can transfer into this buffer at bus speed, e.g. 8MHz for ISA (admittedly a very old technology). If the CPU core runs at 32MHz, this means that it can only send data every fourth cycle. It has to spend 3 out of every 4 cycles waiting for the bus to become ready. │ Interrupt-driven IO │ │ • peripherals are «much» slower than the CPU │ ◦ «polling» the device is expensive │ • the peripheral can «signal» data availability │ ◦ and also «readiness» to accept more data │ • this «frees up CPU» to do other work in the meantime Some peripherals can only process very small amounts of data at once, and are much slower still than the bus. As an extreme example, consider a serial port configured to send 9600 bits per second. That works out to 1200 characters per second: with an on-board buffer for 60 characters, the CPU needs to fill that buffer at 20Hz, i.e. with a period of 50 milliseconds, which is of course an eternity in CPU time (almost 2 million cycles at 32MHz). So you would perhaps use PIO to fill in that 60 character buffer (at bus speed, so 25 % efficiency, working out to 240 cycles), but actively waiting for the buffer to drain would be madness. Fortunately, the serial port hardware can be configured to cause an «interrupt» when the buffer becomes empty. The CPU can go about doing whatever, but the serial port driver will be woken up to fill in the next 60 bytes when needed, once every 50ms or so. Same mechanism can be used for receiving data: the hardware will cause an interrupt once the receive buffer becomes full and needs to be read by the CPU. │ Interrupt Handlers │ │ • also known as «first-level» interrupt handler │ • they must run in «privileged» mode │ ◦ they are part of the «kernel» by definition │ • the low-level interrupt handler must finish «quickly» │ ◦ it will mask its own interrupt to avoid «re-entering» │ ◦ and «schedule» any long-running jobs for later (SLIH) Upon a hardware interrupt, the CPU will drop whatever it was doing, save its current state into a designated memory area and transfer control to an «interrupt handler». Or rather, one of the CPU cores will. This handler is automatically executed in privileged mode, and hence is by definition part of the kernel. Notice that no context switch occurs: though registers are written into memory, the page table is unaffected – the interrupt handler runs in the context of whatever process was currently running at the time. This is analogous to how system calls behave. To avoid issues with reentrancy, a first-level handler will usually «mask» its own interrupt (cause the CPU to temporarily ignore it). This is one of a number of reasons why the first-level handler needs to finish quickly (if an interrupt is masked for too long, this can cause data to be lost, e.g. due to buffer overruns). Hence the first-level handler will usually do the minimum required work (e.g. clear time-critical buffers) and schedule any further processing for a later time. │ Second-level Handler │ │ • does any expensive «interrupt-related» processing │ • can be executed by a «kernel thread» │ ◦ but also by a user-mode driver │ • usually not time critical (unlike first-level handler) │ ◦ can use standard «locking» mechanisms The work that was deferred by the first-level handler is picked up by a second-level handler. This handler can run in a kernel thread or even in a user-mode process. The second-level routine is usually not time critical and can synchronize with the rest of the system as needed. A second-level handler of a disk device could, for instance, call into the file system to notify it that a piece of data it has requested has arrived, which in turn could trigger a suspended ‹read› system call to write the data into the address space of a waiting process. The syscall then returns and the process is woken up. │ Direct Memory Access │ │ • allows the device to directly read/write «memory» │ • this is a «huge» improvement over «programmed» IO │ • «interrupts» indicate buffer «full»/«empty» │ • devices can read and write arbitrary «physical» memory │ ◦ opens up «security» / reliability problems The last mode of IO is known as DMA, or Direct Memory Access. While there is some superficial similarity with MMIO (memory-mapped IO), it is important to distinguish them. In MMIO, the CPU (and by extension, the OS) talks to the device using the memory subsystem, mapping the on-board memory or registers of that device into the physical address space of the CPU. The situation in DMA is flipped: the CPU and the device do not talk to each other at all. Instead, physical memory attached to the CPU (i.e. the main RAM) is made accessible to the peripheral, which can then transfer data to RAM. The CPU still uses memory access instructions to fetch data that came from the device (like in MMIO) but it does not communicate with the device directly. Instead, it reads and writes into its own RAM, which just happens to contain data that the device wrote there, or will later read. To summarise: • both MMIO and DMA use memory access instructions on the CPU to read and write data, • under MMIO, the main memory is «not involved» at all, • under DMA, «both» the device and the CPU «access main memory», • under DMA, there is no direct bulk transfer of data between the CPU and the device. The use of MMIO and DMA is not exclusive, rather to the contrary: devices often use a combination of both. In fact, MMIO can be used to configure DMA (the latter is unsuitable for configuration, but performs better for bulk data transfers). │ IO-MMU │ │ • like the MMU, but for DMA transfers │ • allows the OS to «limit» memory access per device │ • very useful in «virtualisation» │ • only recently found its way into «consumer» computers While DMA is extremely important for devices which transfer a lot of data (HDDs, SSDs, NICs), it has some nasty security and safety implications. Under ‘traditional’ DMA, the device can read and write any physical memory it wants to. For instance, it can overwrite kernel code if it so wishes. A rogue device could then very easily circumvent any software-level security. Perhaps even more importantly, a rogue driver could program the device to overwrite memory with data (and code) of the driver's choosing. This is undesirable, especially if we want to use user-mode drivers, or if the device is not sufficiently secure.¹ The IO-MMU is a device which fixes this problem, by enforcing limits on which memory can a particular peripheral access. The IO-MMU, like the regular MMU, can only be programmed by the OS kernel (or a hypervisor, as it may be… we will learn more about those in Chapter 11). With a correctly programmed IO-MMU, DMA is safe and secure. ¹ Famously, any device attached to a firewire port – an external port, kind of like high-speed USB before USB 3 was a thing – can read and write any and all host memory. It is not impossibly hard to build a rogue firewire device and attach it to someone else's computer. Other connectors which expose high-speed buses may be susceptible. ## System and Expansion Buses The rest of the lecture will be a tour of peripherals and some of their history. Before we get to the peripherals themselves, though, we will look at buses which are used to connect peripherals to the CPU (or CPUs) and, in some cases, to RAM. While the bus itself is not a peripheral, it is common for a bus to have drivers of their own. This has two reasons: 1. all but the simplest buses have additional hardware, which mediates bus access, takes care of device configuration and enumeration, and so on, and which needs to be itself configured, 2. besides the electronics and signalling, a bus also conceptually comes with a set of «protocols», which need to be implemented both by the peripherals and by their drivers; the bus driver implements those protocols: other drivers make simple function calls and the driver translates them to the required MMIO or port-mapped IO operations. With that said, let's look at some of the historical buses which were used in PCs over time and how they evolved to the current state of the art, PCI Express. │ History: ISA (Industry Standard Architecture) │ │ • 16-bit system «expansion» bus on IBM PC/AT │ • «programmed» IO and «interrupts» │ • a fixed number of hardware-configured «interrupt lines» │ ◦ likewise for I/O port ranges │ ◦ the HW settings then need to be «typed back» for SW │ • parallel data and address transmission One of the oldest expansion buses, which made an appearance with IBM PC/AT (a personal computer based on Intel 80286). The ISA bus was hooked to the CPU via IO ports (no MMIO) and provided an interrupt line to each peripheral. A limited number of DMA ‘channels’ were provided by a DMA controller, allowing attached peripherals (mainly storage devices) to move data to and from memory independently from the main CPU.² It was not possible to enumerate the bus, much less to configure the peripherals, in software. The port ranges and IRQ lines were selected by the hardware (either hardcoded, or configurable with jumpers or switches) and had to be given to the driver by the user (i.e. it had to be ‘typed back’ manually). Hardware-wise, the bus was a parallel design, synchronously transferring 16 bits across 16 wires at the same clock tick. Separate data and address lanes were used: an address could be transferred with the same clock tick as a data word. ² In this setup, the DMA controller actually becomes the bus master and performs the transfer. While the effect is essentially the same, the implementation is rather different than with the DMA based on peripherals becoming bus masters that we will encounter later in the lecture. │ MCA, EISA │ │ • MCA: Micro Channel Architecture │ ◦ «proprietary» to IBM, patent-encumbered │ ◦ 32-bit, «software-driven» device configuration │ ◦ expensive and ultimately a market «failure» │ │ • EISA: Enhanced ISA │ ◦ a 32-bit extension of ISA │ ◦ mostly created to avoid MCA licensing costs │ ◦ short-lived and replaced by PCI At 8MHz and 16 bits, ISA eventually started to be a limiting factor, since both CPUs and peripherals – especially graphics adapters, but also storage devices – were getting a lot faster. │ VESA Local Bus │ │ • memory mapped IO & fast «DMA» on otherwise ISA systems │ • «tied» to the 80486 line of Intel (and AMD) CPUs │ • primarily for «graphics cards» │ ◦ but also used with hard drives │ • quickly fell out of use with the arrival of PCI VESA Local Bus, or VLB, was a fairly successful effort to standardize a disparate set of home-grown buses designed to accommodate faster graphics hardware than what was possible with ISA, while avoiding the licensing costs of MCA. The VLB essentially connected the peripheral directly to the 80486 memory bus, using an additional connector (as an extension of standard ISA). Due to incompatible memory bus design in later processors, VLB did not survive the upgrade to Pentium. │ PCI: Peripheral Component Interconnect │ │ • a 32-bit successor to ISA │ ◦ 33 MHz (compared to 8 MHz for ISA) │ ◦ later revisions at 66 MHz, PCI-X at 133 MHz │ ◦ with support for «bus-mastering» and DMA │ • still a «shared», parallel bus │ ◦ all devices share the same set of wires The breakthrough in peripheral interconnects came with PCI, which provided most of the benefits of MCA while avoiding some of its problems. Perhaps the most important update was software-based configuration, but the considerable bandwidth upgrade did not hurt either. From a modern perspective, the one downside was the topology: a shared, parallel bus connecting all the devices in the system. Parallel here means that 32 bits are transmitted with each clock cycle, along 32 separate wires. This limits achievable clock speeds due to signal delay differences along traces of different length – modern buses transmit data serially, each data wire on its own clock. │ Bus Mastering │ │ • normally, the CPU is the bus «master» │ ◦ which means it initiates communication │ • it's possible to have multiple masters │ ◦ they need to agree on a conflict resolution protocol │ • usually used for accessing the memory On a shared bus, one of the devices is usually the master and is in charge of the bus and the traffic on it. Normally, this is the CPU. However, for DMA transfers (between memory and a peripheral), the CPU should not be involved, since the entire point is to free up the CPU to do other work while the transfer is going on. To facilitate these transfers, then, the peripherals can temporarily become bus masters, directing the traffic. An arbitration protocol ensures there is at most a single master driving the bus at any given time. │ DMA (Direct Memory Access) │ │ • the most common form of bus mastering │ • the CPU tells the device what and where to write │ • the device then sends data directly to RAM │ ◦ the CPU can work on other things in the meantime │ ◦ completion is signaled via an interrupt In principle, it is possible for peripherals to talk to each other when one of them is the bus master. However, this is not usually done: instead, the (temporary) bus master performs a data transfer to or from the main memory. │ Plug and Play │ │ • the ISA system for IRQ configuration was «messy» │ • MCA pioneered software-configured devices │ • PCI further improved on MCA with “Plug and Play” │ ◦ each PCI device has an ID it can «tell» the system │ ◦ enables «enumeration» and automatic «configuration» An important aspect of PCI (and MCA before it) was software-based configuration and enumeration of connected devices. This allows the firmware and the operating system to discover what devices are connected, load the appropriate drivers and set up the devices without user intervention. │ PCI IDs and Drivers │ │ • PCI allows for device enumeration │ • device «identifiers» can be paired to device «drivers» │ • this allows the OS to load and configure its drivers │ ◦ or even download / install drivers from a vendor Enumeration has two components: one is a system to discover and configure the devices attached to the system. This is done by using a common, device-independent protocol which must be implemented by all PCI devices. The other is a system for assigning a unique identifier to each peripheral, a so-called PCI ID. An operating system can then include a database of known PCI IDs and corresponding drivers for that device. Loading that driver typically makes the device available for use by the rest of the operating system, and hence by the user. │ AGP: Accelerated Graphics Port │ │ • PCI eventually became too «slow» for GPUs │ ◦ AGP is based on PCI and only «improves performance» │ ◦ enumeration and configuration stays the same │ • adds a dedicated «point-to-point» connection │ • multiple transfers per clock (up to 8, for 2 GB/s) Of course, peaking around 4 Gib/s (500 MiB/s), PCI is not the end of the story. In a clear historic pattern, graphics hardware became limited by its connection to the rest of the system (CPU and memory). Like with VLB, a dedicated graphics bus has become widespread, this time based on PCI, with essentially two modifications: 1. the bus was point-to-point (dedicated to a single peripheral), i.e. not shared with the main PCI bus in the system, 2. it allowed multiple data transfers per clock cycle – the same technique that DDR RAM uses to increase throughput without driving the clock faster. With maximum of 8 transfers per clock, with the main clock running at 66MHz, the maximum transfer speed comes out as 16Gib/s. │ PCI Express │ │ • the current high-speed peripheral bus for PC │ • builds on / «extends» conventional PCI │ • point-to-point, «serial» data interconnect │ • much improved «throughput» (up to ~30GB/s) We have finally reached the present day. The modern successor to PCI moved away from synchronous parallel data transmission and from a shared bus, allowing for drastic performance increase. Even though multiple wires are used for data transfer, they are self-clocked (clock is part of the data signal) and hence asynchronous to each other. Each wire is called a ‘lane’ and a single peripheral can use up to 16 lanes. Low-bandwidth devices only need a single lane, saving on power requirements and manufacturing cost. At the time of this writing, devices targeting PCIe 4.0, with 16GT (billion transactions) per second on each lane, are commonly available. This translates to a maximum per-device bandwidth of about 256Gib/s (compare to AGP at 16Gib/s) or 32GiB/s in a 16-lane configuration. The next revision, PCIe 5.0 (final spec released in 2019) doubles the per-lane transfer rate to 32GT/s, for a per-device maximum of 64GiB/s. Software-wise, PCIe is backward-compatible with PCI, using an extended version of the PCI enumeration and configuration protocol. Additionally, PCIe allows the configuration to use MMIO instead of port-mapped IO, exposing a single 4KiB page of configuration data per endpoint (peripheral). │ USB: Universal Serial Bus │ │ • primarily for «external» peripherals │ ◦ keyboards, mice, printers, ... │ ◦ replaced a host of «legacy ports» │ • later revisions allow «high-speed» transfers │ ◦ suitable for storage devices, cameras &c. │ • device enumeration, capability «negotiation» PCI brought software-driven enumeration and configuration to the permanently attached, internal peripherals (graphics hardware, storage, network interfaces, and so on). USB did the same for externally-attached devices, like keyboards, mice, printers, scanners and so on. Earlier systems used comparatively ‘dumb’ buses for the same purpose. The user had to select a driver by hand and configure the driver (tell it which external port the device is attached to). With USB, the devices would identify themselves using a device-neutral protocol, just like with PCI. The host system can then load and configure the correct driver automatically. Moreover, USB supports hotplug, so this can happen whenever the user plugs in a device. Finally, the bandwidth available on USB, even in its first revision, was much higher than the earlier standards (RS-232, PS/2). Later USB revisions considerably increased both data transmission speed and the power available to the attached peripheral. The current highest speed available to USB devices (in USB 3.2 Gen 2 mode with 2 lanes, over USB-C connectors) is 20Gib/s, exceeding the maximal transfer speeds of AGP, the fastest internal bus available in consumer hardware before PCIe. │ USB Classes │ │ • a set of «vendor-neutral» protocols │ • HID = human-interface device │ • mass storage = disk-like devices │ • audio equipment │ • printing USB comes with additional standardization, with so-called device «classes». Each class constitutes a vendor-neutral protocol for a particular type of devices: • HID (human-interface device), e.g.: ◦ keyboards, ◦ mice, ◦ game controllers, ◦ small character-based displays, ◦ pretty much anything with a button. • mass storage (persistent memory, usually with a file system): ◦ flash ‘pen’ drives, ◦ external hard drives or SSDs, ◦ optical drives, ◦ card readers, … • audio devices, e.g.: ◦ headsets (headphones with a microphone), ◦ sound cards, ◦ active loudspeakers, ◦ standalone microphones, ◦ MIDI devices, • MTP (media transfer protocol), ◦ smartphones, ◦ portable media players. • printers, • video (webcams, digital microscopes). Essentially, none of the devices in the above list need vendor-specific drivers to operate. Instead, a single ‘class’ driver which implements the respective protocol can talk to any peripheral which belongs to that class. A single physical peripheral may provide multiple virtual devices, possibly in different classes (e.g. a portable recorder which can appear both as an audio device – a microphone, as well as a storage device). │ Other USB Uses │ │ • scanners │ • ethernet adapters │ • usb-serial adapters │ • wifi adapters (dongles) │ • bluetooth In addition to the standard device classes, there are many USB devices which do not fit one of those categories. These will use a vendor-specific protocol and will require corresponding device-specific driver. │ Bluetooth │ │ • a «wireless» alternative to USB │ • allows «short-distance» radio links with «peripherals» │ ◦ input (keyboard, mice, game controllers) │ ◦ audio (headsets, speakers) │ ◦ data transmission (e.g. smartphone sync) │ ◦ gadgets (watches, heartrate monitoring, GPS, ...) While bluetooth is not a bus as such (being wireless), it behaves much like USB from the point of view of software (with additional complexity related to device pairing, security and unreliable data transmission). Many device types that can be attached via USB can also be attached with bluetooth (wireless keyboards, mice, headsets, speakers, and so on). │ ARM Buses │ │ • ARM is typically used in System-on-a-Chip designs │ • those use a «proprietary» bus to connect peripherals │ • there is less need for enumeration │ ◦ the entire system is baked into a single chip │ • the peripherals can be «pre-configured» The ARM ecosystem is somewhat different from the PC one. It is common that ARM devices are ‘system on a chip’ designs, where most, if not all, peripherals are part of a single chip together with CPU cores, memory controller, and interconnect (system bus). SoC vendors usually prepare operating system images or kernel builds (typically of Android) that work on their system. Software-driven enumeration and autoconfiguration is much less important and is typically not supported. Peripherals typically included are a graphics core, an USB controller, wifi, ethernet, bluetooth controller, audio controller, NFC, storage controller (eMMC) and perhaps a few others. │ USB and PCIe on ARM │ │ • neither USB nor PCIe are exclusive to the PC platform │ • most ARM SoC's support USB devices │ ◦ for slow and medium-speed off-SoC devices │ ◦ e.g. used for «ethernet» on RPi 1 │ • some ARM SoC's support PCI Express │ ◦ this allows for «high-speed» off-SoC peripherals However, not all ARM processors are designed for ‘sealed’ devices like smartphones or smart TVs. ARM-based general-purpose hardware includes single-board computers (like Raspberry Pi, Beaglebone, …) but also laptops (new generation of Apple hardware) and servers (Ampére Altra). Those systems often need more connectivity and extensibility and will provide PCI Express for connecting to high-speed peripherals. │ PCMCIA & PC Card │ │ • People Can't Memorize Computer Industry Acronyms │ ◦ PC = Personal Computer, MC = Memory Card │ ◦ IA = International Association │ • «hotplug»-capable notebook «expansion» bus │ • used for memory cards, network adapters, modems │ • comes with its own set of drivers (cardbus) Back to history: until a decade ago, it was common that laptop computers had expansion slots, a bit like traditional desktops. Of course, a standard-size expansion card has no chance of fitting in a laptop, hence special connectors and/or buses. One of the oldest was PCMCIA, with credit-card-sized (but thicker) peripherals that could be hot-plugged into a bay on the side of a laptop (i.e. the device would be hidden inside the laptop body, unlike various USB dongles with a mess of wires). │ ExpressCard │ │ • an «expansion card» standard like PCMCIA / PC Card │ • based on PCIe and USB │ ◦ can mostly «re-use» drivers for those standards │ • not in wide use anymore │ ◦ last update was in 2009, introducing USB 3 support │ ◦ the industry association «disbanded» the same year ExpressCard is a more modern version of the same idea and a similar form factor, with USB and PCIe in the backend. Modern laptops, however, no longer offer this functionality and the association responsible for ExpressCard was disbanded over a decade ago. │ miniPCIe, mSATA, M.2 │ │ • those are «physical interfaces», not special buses │ • they provide some mix of PCIe, SATA and USB │ ◦ also other protocols like I²C, SMBus, ... │ • used mainly for compact SSDs and wireless │ ◦ also GPS, NFC, bluetooth, ... What does survive are connectors for «internal» devices in a small form factor: mainly for SSDs, but also for wifi adapters, bluetooth and similar modules. These are common in laptops and mini-ITX (small desktop) systems. Depending on the particular connector standard (and variant), it will provide a variety of bus connections, including PCIe (up to 4 lanes) and USB. ## Graphics and GPUs Graphics hardware was always a very important part of both home computers and professional workstations. Often, it is also the most demanding peripheral in those applications, and the most complex. │ Graphics Cards │ │ • initially just a device to «drive displays» │ • reads pixels from «memory» and provides «display» signal │ ◦ basically a DAC with a clock │ ◦ the memory can be part of the graphics card │ • evolved «acceleration» capabilities Originally, a graphics card would simply contain some fast static memory (frame buffer), a clock and a digital-to-analog converter (DAC), which would drive a CRT display (cathode ray tube). The displays of the era worked by pointing an electron gun (using electromagnets) at individual pixels in rapid succession while modulating the voltage between the cathode and anode (essentially a conductive coating of the inside of the screen) to attain corresponding brightness on each pixel. The graphics card would generate the signal driving this modulation, in step with the advancing electron gun. The memory of the graphics card would contain digital information about the brightness of each pixel. Typical refresh rates would be in the 30-120 Hz range for the entire screen. For a VGA screen (640 columns, 480 rows) at 70 Hz, this works out to about 20 MHz (20 million pixels per second). The three component colours are transmitted in parallel. │ Graphics Accelerator │ │ • allows common «operations» to be done in «hardware» │ • like drawing lines or filled «polygons» │ • the pixels are computed directly in video RAM │ • this can «save» considerable «CPU time» Composing a picture to be displayed on screen can take a lot of computation and/or memory traffic. If some of those operations are performed by dedicated hardware instead of the main CPU, this can drastically improve performance, since the CPU is free to do other things while the graphics hardware asynchronously performs the simple, repetitive tasks. There are two main classes of operations that can be easily accelerated using dedicated hardware: • rasterization of geometric shapes such as lines, rectangles, polygons or curves (vector graphics) – those are used in, for instance, graphical user interfaces and in vector drawing programs or 2D computer-aided design systems, • bulk pixel operations, such as flood fill or bit blitting¹ mainly used in raster graphics (e.g. video games). Since essentially each pixel (or at best a small block of pixels) needs at least one memory write, and for a CPU, memory writes are expensive (lots of waiting for slow memory), such operations are especially wasteful on the CPU. Even worse if data (textures, sprites) need to be read from memory and written back elsewhere, perhaps after performing a simple operation on the pixels. ¹ A memory copy with some additional logic: it operates on pixels (instead of bytes) in various formats (e.g. 2 or 8 pixels per byte) and can deal with transparent pixels which are skipped (allows drawing non-rectangular shapes over an existing background). │ 3D Graphics │ │ • rendering 3D scenes is «computationally intensive» │ • CPU-based, «software-only» rendering is possible │ ◦ texture-less in early flight simulators │ ◦ bitmap textures since '95 / '96 (Descent, Quake) │ • CAD workstations had 3D accelerators (OpenGL '92) While 2D graphics takes a lot of resources (at least in terms of the capabilities of older hardware), it is essentially free compared to 3D graphics, where computing each output pixel can take hundreds of operations, some of which are geometric and others which are raster-based. Hence, the potential for hardware acceleration of 3D graphics is considerably higher than with 2D graphics, but the hardware to do so is much more complicated. │ GPU (Graphics Processing Unit) │ │ • a term coined by Sony in '94 (the GPU in PlayStation) │ • originally a purpose-built «hardware renderer» │ ◦ based on polygonal meshes and Z buffering │ • increasingly more «flexible» and «programmable» │ • on-board RAM, high-speed connection to system RAM First GPUs were essentially hardware built for rasterization of 3D geometry, supplied as a polygonal (triangular) mesh with textures attached to the faces. The hardware would then compute visibility and lighting to produce a raster image to be displayed on screen. The CPU would prepare the geometry for each frame which the GPU would then render and display. Each generation of GPUs brings more flexibility and programmability, allowing for acceleration of lots of different effects without hard-coding them in hardware. Contemporary GPUs are essentially fully programmable general-purpose vector processors, with registers, memory, control flow and so on. │ GPU Drivers │ │ • split into a number of components │ • graphics output / frame buffer access │ • «memory management» is often done in kernel │ • geometry, textures &c. are prepared «in-process» │ • front end API: OpenGL, Direct3D, Vulkan, ... A typical GPU driver is split into a number of components, some of which reside in the kernel (frame buffer setup, memory management) while the more complex parts are libraries linked into client applications (geometry and texture processing, shader compilation). │ Shaders │ │ • current GPUs are «computation» devices │ • the GPU has its own machine code for «shaders» │ • the GPU driver contains a «shader compiler» │ ◦ either all the way from a high level language (HLSL) │ ◦ or starting with an intermediate code (SPIR) Since modern GPUs are really just vector processors in disguise, they run programs in their own machine code. The driver then compiles higher-level programs which are part of the software (e.g. a computer game or a 3D game engine) into the hardware-specific machine language. While the output is very device-specific, the input (which is what the application gives to the driver) is mostly standardized, with two main options being HLSL (High-Level Shader Language) and SPIR (Standard Portable Intermediate Representation). │ Mode Setting │ │ • deals with «screen» configuration and «resolution» │ • including support for e.g. «multiple displays» │ • usually also supports primitive (SW-only) «framebuffer» │ • often in-kernel, with minimum user-level support While there is a lot of bells and whistles on a modern GPU, there are some boring tasks which did not really change in the last 2-3 decades, like display configuration. It's common that current computers can attach multiple displays, and each needs to be given a resolution, color depth, refresh rate &c., together known as a graphics ‘mode’. This is the task of the ‘mode setting’ part of a graphics driver. │ Graphics Servers │ │ • multiple apps cannot all drive the graphics card │ ◦ the graphics hardware needs to be «shared» │ ◦ one option is a «graphics server» │ • provides an IPC-based «drawing» and/or «windowing» API │ • performs «painting» on behalf of the applications While not a driver itself, graphics servers form an important part of the graphics stack (on systems which use one). The problem here is that only one program can meaningfully draw on any given screen, but we usually want to show the output of more than a single program. One option is a graphics server, which hands out regions (rectangular windows, typically) into which programs can paint using its API. │ Compositors │ │ • a more direct way to share graphics cards │ • each application gets its «own buffer» to paint into │ • painting is mostly done by a (context-switched) GPU │ • the individual buffers are then «composed» onto screen │ ◦ composition is also hardware-accelerated The other common approach is to use a «compositor», which differs crucially from a graphics server in one thing: how the individual applications paint their content. In a graphics server, there is a painting API which the program calls to display shapes and pixmaps on screen. With a compositor, each program gets an «off-screen buffer» (pixmap) into which they can paint by directly interacting with the driver of the graphics hardware. The compositor then combines those buffers into a single picture which is shown to the user (again by making appropriate calls into the graphics driver). In typical usage, each window corresponds to a single buffer. │ GP-GPU │ │ • general-purpose GPU (CUDA, OpenCL, ...) │ • used for «computation» instead of just graphics │ • basically a return of vector processors │ • close to CPUs but not part of normal OS scheduling As we have mentioned earlier, contemporary GPUs are really general-purpose vector processors and can be used for purely computational tasks that have nothing to do with graphics (machine learning is a popular application, but anything that benefits from massive SIMD is a good candidate). ## Persistent Storage In this section, we will look at bulk storage devices – those that usually carry file systems and which retain the stored data while offline (disconnected from power). │ Drivers │ │ • split into adapter, bus and device drivers │ • often a single driver per device type │ ◦ at least for disk drives and CD-ROMs │ • bus «enumeration» and «configuration» │ • data addressing and «data transfers» Storage devices have traditionally had their own dedicated, specialized bus. The host side of this bus is implemented by an «adapter» (controller) which is connected to a system bus (PCI, PCIe) on one side and to the storage bus on the other. Individual storage devices are then connected to this dedicated bus. This hardware structure essentially dictates the driver structure: the bus is usually standardized and comes with a set of protocols, just like system buses that we discussed earlier do. However, for any given bus, there might be many different adapter models made by different vendors. In some cases, they use a common protocol, but in other cases, device-specific drivers are required to configure them. Like with USB, on any given storage bus, there is considerable standardization among the storage devices themselves (endpoints), and a single ‘class’ driver is sufficient (a HDD driver, a CD-ROM driver, a tape unit driver, …). │ IDE / ATA │ │ • Integrated Drive Electronics │ ◦ disk controller becomes part of the disk │ ◦ standardised as ATA-1 (AT Attachment ...) │ • based on the ISA bus, but with cables │ • later adapted for non-disk use via ATAPI One of the oldest «standardized» storage buses was IDE (vendor name, later standardized as ATA). This is essentially an ISA bus with cabling, hence the adapter, if connected to the host ISA bus, was especially simple. However, later revisions of the ATA (now known as Parallel ATA) spec diverged from ISA due to much higher speeds that were eventually required. The ATA family of buses did not switch to use PCI internally and the storage bus and system bus evolved separately, even if along similar lines. │ ATA Enumeration │ │ • each ATA «interface» can attach only 2 drives │ ◦ the drives are HW-configured as master/slave │ ◦ this makes enumeration quite simple │ • multiple ATA interfaces were standard │ • no need for specific HDD drivers Since most implementations offer exactly 4 connectors (2 interfaces, each capable of connecting 2 drives), enumeration is not much of an issue. Each interface has a standard set of IO ports (for port-mapped IO). The system uses those ports to send 2 ‹IDENTIFY› commands on each interface, one for the master and the other to the slave device. This completes the enumeration. │ PIO vs DMA │ │ • original IDE could only use «programmed» IO │ • this eventually became a serious «bottleneck» │ • later ATA revisions include «DMA» modes │ ◦ up to 160MB/s with highest DMA modes │ ◦ compare 1900MB/s for SATA 3.2 │ SATA │ │ • «serial», point-to-point replacement for ATA │ • hardware-level incompatible to (parallel) ATA │ ◦ but SATA inherited the ATA «command set» │ ◦ legacy mode lets PATA drivers talk to SATA drives │ • hot-swap capable – replace drives in a «running system» Like other interfaces, storage systems made a transition to serial data links. For ATA, the result is known as SATA or Serial ATA. The newer standard retains software-level backward compatibility with Parallel ATA: if the controller is in ‘legacy mode’, it will emulate a PATA host controller and work with legacy PATA drivers. However, this PATA-compatible mode necessarily hides new features (ability to connect more drives, hotswap, native command queuing). │ AHCI (Advanced Host Controller Interface) │ │ • «vendor-neutral» interface to SATA controllers │ ◦ in theory only a single ‘AHCI’ driver is needed │ • an alternative to ‘legacy mode’ │ • NCQ = Native Command Queuing │ ◦ allows the drive to re-order requests │ ◦ another layer of IO scheduling Most SATA host controllers implement the AHCI standard and hence don't need device-specific drivers. Running the controller in AHCI mode is required to make use of new SATA technologies, such as NCQ (native command queuing) and hotswap. While attempts were made to add command queuing to PATA, those were ultimately unsuccessful, due to insufficient DMA capabilities of the old ISA-based system (with a 3rd-party DMA controller). Since SATA drives perform DMA themselves, NCQ has much better performance. │ ATA and SATA Drivers │ │ • the host controller (adapter) is mostly vendor-neutral │ • the «bus driver» will expose the ATA command set │ ◦ including support for «command queuing» │ • device driver uses the bus driver to talk to devices │ • partially re-uses SCSI drivers for ATAPI &c. │ SCSI (Small Computer System Interface) │ │ • originated with minicomputers in the 80's │ • more complicated and «capable» than ATA │ ◦ ATAPI basically encapsulates SCSI over ATA │ • device «enumeration», including «aggregates» │ ◦ e.g. entire enclosures with many drives │ • also allows CD-ROM, tapes, scanners (!) A different storage bus, called SCSI, has been in parallel use with ATA, mainly targeting servers and high-end hardware in general. The overall structure is the same as with ATA: there is an adapter (called HBA – host bus adapter – in SCSI jargon), a bus with a set of protocols, and an array of storage devices attached to the storage bus. Unlike Parallel ATA, the SCSI bus can attach many more devices and those devices can have additional internal structure (e.g. it's possible to attach a SATA RAID controller with a dozen disks as a single ‘composite’ SCSI endpoint). For this reason, it has advanced software-based enumeration and configuration capabilities: the HBA will ‘scan’ the storage bus to discover devices and report them to the operating system. SCSI also commonly supports hotplugging devices (i.e. attaching and detaching devices while the system is running). Also unlike ATA, external SCSI connectors and cabling are common. Like ATA (and like system buses) SCSI used a parallel design for a long time, but modern versions use high-speed serial links instead. The technology is known as SAS, Serial-Attached SCSI. SAS can optionally use a SATA-compatible connector (and SAS adapters with such connectors will work with SATA drives, but not vice versa). │ SCSI Drivers │ │ • split into: a host bus adapter (HBA) driver │ • a generic SCSI bus and command component │ ◦ often re-used in both ATAPI and USB storage │ • and per-«device» or per-class drivers │ ◦ optical drives, tapes, CD/DVD-ROM │ ◦ standard disk and SSD drives While SCSI «hardware» is somewhat uncommon, the protocols it uses are in widespread use. Both SATA and USB storage devices use SCSI as their command protocols. Additionally, Fibre Channel (FC, a storage-area network technology) and InfiniBand (IB, a high-speed, low-latency interconnect) offer SCSI implementations. This essentially means that the same ‘class’ driver can be used for storage devices attached to SATA, USB, SAS, FC, IB or ethernet (via iSCSI, see below), with an appropriate glue layer. │ iSCSI │ │ • basically SCSI over TCP/IP │ • entirely «software-based» │ • allows standard computers to serve as «block storage» │ • takes advantage of fast cheap ethernet │ • re-uses most of the «SCSI driver stack» The SCSI protocol can be also encapsulated in TCP/IP and transported using, for instance, ethernet. This approach allows SCSI endpoints to be implemented in software: instead of specialized hardware, a RAID enclosure (a box with many disks combined into one or a few logical drives using RAID) can be implemented as a commodity x86 server with an ethernet connection. This is sufficient for many use cases, while being significantly cheaper than ‘native’ storage-area networks (fibre channel, infiniband), or even standard externally-connected SAS. │ NVMe: Non-Volatile Memory Express │ │ • a fairly simple protocol for PCIe-attached storage │ • optimised for SSD-based devices │ ◦ much bigger and more «command queues» than AHCI │ ◦ better / faster interrupt handling │ • stresses «concurrency» in the kernel block layer A ‘return to the roots’ technology: what ATA was to ISA, NVMe is to PCIe. Essentially a protocol on top of PCIe interconnect, re-using PCIe enumeration and configuration. The protocol calls for rather massive command queues, taking advantage of the correspondingly massive parallelism in the SSD hardware. NVMe storage is usually very fast and the block layer, originally designed for much slower devices, may struggle to keep up. │ USB Mass Storage │ │ • an USB device class (vendor-neutral protocol) │ ◦ one driver for the entire class │ • typically USB «flash drives», but also external «disks» │ • USB 2 is not suitable for high-speed storage │ ◦ USB 3 introduced UAS = USB-Attached SCSI As mentioned earlier, storage devices can be also directly attached to USB. │ Tape Drives │ │ • unlike disk drives, only allow «sequential» access │ • needs support for media «ejection», «rewinding» │ • can be attached with SCSI, SATA, USB │ • parts of the driver will be «bus-neutral» │ • mainly for data «backup», capacities 6-15TB While disk-like devices (HDDs, SSDs, RAID enclosures) are by far the most important, there are other storage devices worth mentioning. Data centers will often use tape drives for backups, since they offer excellent data density, low price per gigabyte stored and good durability. From an OS standpoint, tapes are special since they can only be accessed sequentially, and it doesn't make sense to put a traditional file system on them. Instead, specialized programs are used to prepare data for writing on a tape, e.g. ‹tar› (short for Tape ARchive). │ Optical Drives │ │ • mainly used as a «read-only» distribution medium │ • laser-facilitated reading of a rotating disc │ • can be again attached to SCSI, SATA or USB │ • conceived for «audio playback» → very slow seek Another somewhat special class of storage devices are optical drives: CD-ROM, DVD-ROM, Blu-ray. While random access is possible, it is very slow even compared to HDDs. Optical drives are more suitable for streaming (mainly audio and video) or content distribution. Unlike tapes, (read-only) file systems are commonly used on optical media (ISO 9660 for CD-ROM, UDF for DVD and Blu-ray). │ Optical Disk Writers (Burners) │ │ • behaves more like a «printer» for optical «disks» │ • drivers are often done in «user space» │ • attached by one of the standard «disk buses» │ • «special programs» required to burn disks │ ◦ alternative: packet-writing drivers ## Networking and Wireless The last category of devices that we will discuss in this lecture are network interface cards. Please note that this is only an overview of network hardware that can be attached to a general-purpose computer – networking in general will be discussed in the next lecture. │ Networking │ │ • networks allow «multiple computers» to exchange «data» │ ◦ this could be files, streams or messages │ • there are «wired» and «wireless» networks │ • we will only deal with the «lowest layers» for now │ • NIC = Network Interface Card Network hardware allows computers to directly communicate with each other, using some sort of interconnect, either wired or wireless. A computer connects to the network using a «network interface card», typically a PCIe device with an external connector (e.g. RJ 45 for metallic ethernet), or an antenna (for wireless tech). A computer network as a whole resembles a bus of the kind we have discussed in the first part of the lecture, though with some crucial differences. │ Ethernet │ │ • specifies the «physical» medium │ • «on-wire» format and «collision» resolution │ • in modern setups, mostly «point-to-point» links │ ◦ using active «packet switching» devices │ • transmits data in «frames» (low-level packets) Like with system buses, networks have evolved away from shared media (token ring, coaxial 10MiB ethernet, twisted-pair ethernet with passive hubs). Modern networks use dedicated point-to-point links, with packet-switching hardware at hubs where a number of point-to-point links meet. Ethernet ‘packets’ are called frames and are transmitted as a single unit. Each frame has some metadata (sender, recipient, size) and of course carries some data (payload). │ Addressing │ │ • at this level, only «local» addressing │ ◦ at most a single LAN segment │ • uses baked-in MAC addresses │ ◦ MAC = Media Access Control │ • addresses belong to «interfaces», not computers Lowest-level addressing only works within a single ethernet segment (broadcast domain). All computers know the MAC addresses of all other computers that they wish to talk to (or rather of their network interface cards). In old shared-medium networks, the frame would be transmitted on the shared medium and picked up by the intended recipient based on the target address. In a packet-switched network, the switch will keep a mapping of MAC addresses to physical ports, and only retransmit frames on the port to which the intended recipient is attached. │ Transmit Queue │ │ • «packets» are picked up from «memory» │ • the OS «prepares» packets into the transmit «queue» │ • the device picks them up «asynchronously» │ • similar to how SATA queues commands and data When the OS wants to send packets (frames) over the network, they are appended to a «transmit queue» (also known as Tx queue) where the hardware picks them up and transmits them over its physical connection. The queue works approximately like this: 1. each queue (there can be more than one) has a pair of «registers» accessible through MMIO, one for the «head pointer» and another for a «tail pointer», 2. the pointers hold addresses into a «ring buffer» of a fixed size, stored in the main memory, accessed through DMA; each item in the ring is, again, a pointer, along with a size, and describes a memory buffer holding a single frame (packet), 3. the head and tail pointer split the ring into two parts, one that belongs to the NIC and one that belongs to the software, 4. the operating system (via the driver) controls the «tail pointer» in the device register: a. to send a packet, it will create a buffer and store the packet data in that buffer, b. it will fill in the first cell in the OS-controlled part of the ring with the address and size of this buffer, c. it will shift the tail pointer, handing over the newly filled-in cell to the NIC, 5. the network card controls the «head pointer»: whenever it processes a packet, it'll shift the head pointer so that the processed buffer is now in the OS-controlled part of the ring. As outlined in the first part of the lecture, events related to the transmit ring can be signalled via interrupts. │ Receive Queue │ │ • data is also «queued» in the other direction │ • the NIC copies packets into a «receive queue» │ • it invokes an «interrupt» to tell the OS about new items │ ◦ the NIC may batch multiple packets per interrupt │ • if the queue is not cleared quickly → «packet loss» The receive (Rx) queue works analogously. Interrupts signal newly appended items. The OS is in charge of allocating buffers for packets: handing off a buffer to the NIC on the Rx queue means that the NIC is free to overwrite the buffer with packet data. After it does so, the Rx ring cell is handed back to the OS. In the common case, all frame (packet) buffers must be large enough to hold a biggest possible frame (known as an MTU = Maximal Transfer Unit), though at least some NICs can split incoming packets over multiple Rx cells if they don't fit in a single buffer. If an Rx ring fills up while packets continue to arrive on the interface, packets will be lost (hence the OS must clear the Rx ring sufficiently quickly). The packets don't need to be processed immediately: the OS is free to allocate new buffers and put those on the ring, instead of re-using existing buffers. The filled buffers can be processed and reclaimed later. │ Multi-Queue Adapters │ │ • fast adapters can «saturate» a CPU │ ◦ e.g. 10GbE cards, or multi-port GbE │ • these NICs can manage «multiple» Rx and Tx queues │ ◦ each queue gets its own interrupt │ ◦ different queues → possibly different «CPU cores» Contemporary network adapters can send and receive packets so quickly that a single CPU core cannot keep up (since there is typically a lot of work to be done for each packet as it bubbles up through the network stack and into user space). Those same adapters can be configured to use multiple Tx and Rx queues (rings), each with their own head/tail registers and interrupt. It is up to the OS to configure these queues – a typical setup would use a single Tx/Rx pair per CPU core. For transmission, the NIC simply interleaves packets from all the queues, since the OS decides which queue to use for sending a particular packet. It'll typically just use the one associated with the CPU core performing the operation. For reception, the story is slightly more complicated, since the NIC has to decide which queue to use. The NIC can be configured to filter or hash (parts of) incoming packets and use an Rx queue based on the result. The goal is to keep related packets in the same queue (improves locality) but also to keep all queues busy (improves load balancing). │ Checksum and TCP Offloading │ │ • more advanced adapters can «offload» certain features │ • e.g. computation of mandatory packet «checksums» │ • but also TCP-related features │ • needs both «driver» support and «TCP/IP stack» support To speed up packet processing, some per-packet tasks can be performed in hardware. Computing and verifying checksums is the most common task performed by hardware: packet headers often contain a checksum to detect data corruption. Those checksums can usually be computed in hardware very quickly and it's a waste of CPU cycles to do it in software. Hence, when a packet is stored in the Tx queue, the checksum fields are left blank and the hardware will fill them in before transmitting the packet (this applies to higher-level protocol checksums, e.g. TCP; ethernet frame checksums are always computed in hardware). While by far the simplest, checksum offloading is not the only task that can be done in hardware; some others include: • cryptography (IPsec) offloading: authentication headers, payload encryption and decryption, • large send, receive segment coalescing: segmentation and reassembly of large TCP packets (i.e. those that don't fit in a single IP packet), • UDP segmentation (splitting up UDP packets which do not fit into the MTU of the NIC). │ WiFi │ │ • «wireless» network interface – ‘wireless ethernet’ │ • «shared» medium – electromagnetic waves in air │ • (almost) mandatory «encryption» │ ◦ otherwise easy to «eavesdrop» or even actively «attack» │ • a very «complex» protocol (relative to hardware standards) │ ◦ assisted by «firmware» running on the adapter Compared to relative simplicity of wired networks, WiFi is extremely complicated due to the nature of its medium, which is shared, noisy, easily eavesdropped and generally unreliable. Devices which connect to WiFi networks are often portable and need to maintain connectivity as they move about, switching between access points or even networks. Due to pervasive encryption, clients and access points need to authenticate each other and establish session key pairs. Authentication is required because otherwise an active attacker could trick a client into connecting to their device and become a ‘man in the middle’, rendering the encryption ineffective. Since authentication is required anyway, it often doubles as an access control measure. Aspects of WiFi-related protocols are implemented in hardware, firmware (software running on the adapter) and software (running on the main CPU). │ Review Questions │ │ 25. What is memory-mapped IO and DMA? │ 26. What is a system bus? │ 27. What is a graphics accelerator? │ 28. What is a NIC receive queue?