# Device Drivers

Abstracting hardware is one of the major roles of an operating
system. While we have already discussed the basic hardware resources
(CPU and memory) in detail in previous lectures, so-called
peripherals also play an important role. In this lecture, we will
look at the interface between the operating system and the
peripheral hardware (network interface cards, persistent storage,
removable storage, displays, input devices and so on).

│ Lecture Overview
│
│  1. Drivers, IO and Interrupts
│  2. System and Expansion Buses
│  3. Graphics
│  4. Persistent Storage
│  5. Networking and Wireless

## Drivers, IO and Interrupts

In the first part, we will discuss the low-level aspects of hardware
interaction: how the data moves between the CPU and the peripheral,
how peripherals signal events to the CPU, and how this all relates
to the operating system (which is running on the CPU).

│ Input and Output
│
│  • we will mostly think in terms of IO
│  • peripherals produce and consume «data»
│  • «input» – reading data produced by a device
│  • «output» – sending data to a device
│  • «protocol» – valid sequences of IO events

While peripherals can be rather complicated, we will think of them
in a rather abstract, simplified way, as devices which produce and
consume data. The other crucial component in our understanding of
devices will be «events». The valid sequences of events and the
inputs and outputs tied to those events are described by a
«protocol».

Data transfers coupled to events (i.e. when they happen in a
specific time pattern) can represent a rather wide variety of
behaviours and effects. Consider a keyboard: when the user presses a
key, that is an event, and is accompanied by a data transfer, which
tells the system which key was pressed (or released). Likewise, when
a mouse is moved, a stream of data which describe the relative
motion is sent to the computer.

Other types of devices receive data instead: take a display, as an
archetype of that type of device: the computer (more or less
continuously) sends data which represents the pixels to show on the
screen and which is in turn shown to the user.

Yet other devices accept commands (which are also of course a form
of data) and again respond with data (responses to the commands).
Consider a disk drive: when the system wishes to store some data, it
will send a command (along with the payload, i.e. the data to be
stored) and receives a confirmation. Likewise, when it wishes to
retrieve some data, it sends a read command and receives a reply,
which includes the data which was stored at the requested address.

│ What is a Driver?
│
│  • piece of «software» that talks to a «device»
│  • usually quite specific / «unportable»
│    ◦ tied to the particular «device»
│    ◦ and also to the «operating system»
│  • often part of the «kernel»

Clearly, the input data needs to be processed and output generated.
It is also rather clear that the form and content of the data will
be specific to the particular device. Hence the software needs to be
able to construct and understand data in the form understood by the
particular device.

The software in charge of this communication is known as a «driver»,
and in the light of the above, it is rather clear that any given
driver is paired off with a specific device, or a small class of
devices. Or, to be more precise, the driver implements one side of
the «protocol» (the other side is implemented by the device itself).

At first glance, there does not appear to be a good reason why a
driver should be specific to a particular operating system: after
all, the protocol used by the device will be the same, regardless of
the operating system running on the CPU. But of course, the
device-side protocol is only one part of the driver: the other part
is communication with the operating system. This communication is
performed using a set of interfaces that are usually specific to a
given operating system (though portable drivers do exist).

The other issue that ties drivers to a specific operating system is
that drivers usually need to cooperate with each other: later in the
lecture, we will see that devices are connected through other
devices, and the peripheral driver needs to talk to the bus driver
to talk to the peripheral.

│ Kernel-mode Drivers
│
│  • they are part of the kernel
│  • running with full «kernel privileges»
│    ◦ including «unrestricted» hardware access
│  • no or «minimal» context switching «overhead»
│    ◦ fast but dangerous

In some sense, the simplest type of driver is one that is part of
the kernel. A driver of this type can use all CPU and hardware
facilities necessary to communicate with its device directly,
without going through a middleman. Since no processes are involved,
it also means that the code of the driver can run without context
switches whenever necessary (e.g. in response to a hardware
«interrupt»).

This makes kernel-mode drivers particularly fast (low-overhead), but
their unrestricted access to hardware and memory makes any problems
in such a driver very serious. If the driver crashes, for instance,
it'll usually take the entire operating system with it.

│ Microkernels
│
│  • drivers are «excluded» from microkernels
│  • but the driver still needs «hardware access»
│    ◦ this could be a special «memory region»
│    ◦ it may need to «react» to «interrupts»
│  • in principle, everything can be done «indirectly»
│    ◦ but this may be quite «expensive», too

While kernel-mode drivers are ubiquitous in monolithic kernel
designs, they are all but banished from microkernels. Instead, each
driver is a separate process and executes in user mode of the CPU.

However, many drivers require some level of direct hardware access
in order to communicate with their device: most often interrupt
handlers and reads/writes to a specific area of physical memory. The
latter can be arranged easily enough (just map that area of memory
into the driver process).

The former, however, is a problem: interrupt handlers (we will look
at those in more detail shortly) always run in privileged mode and
hence the driver cannot install one. Instead, the kernel will relay
the interrupt to the driver process using some form of IPC, often
precipitating an expensive context switch.

│ User-mode Drivers
│
│  • many drivers can run completely in «user space»
│  • this improves «robustness» and «security»
│    ◦ driver bugs can't bring the «entire system» down
│    ◦ nor can they compromise system «security»
│  • possibly at some «cost» to «performance»

Drivers running in user mode are not exclusive to microkernels, and
while they have downsides, they also have many desirable properties.
Since they are isolated from the kernel, from each other and from
other programs running on the system, crashes and other bugs in a
driver cannot compromise the rest of the system (at least not
directly, though if a peripheral is mis-programmed, it may still
crash the system or make it otherwise unusable). Of course, security
is also significantly improved.

│ Drivers in Processes
│
│  • user-mode drivers typically run in their own «process»
│  • this means «context switches»
│    ◦ every time the device demands attention (interrupt)
│    ◦ every time «another process» wants to use the device
│  • the driver needs «system calls» to talk to the device
│    ◦ this incurs even more overhead

Let's look at the model where each driver runs in its own process.
As we have already mentioned, a major problem in this model is due
to additional context switches, which happen when:

 1. an interrupt arrives from the hardware and some other process is
    executing at the time (which is almost always),
 2. another process on the system tries to use the device: the
    request must go through the driver, which means that it needs to
    run, and hence its process needs to be scheduled before the
    request can be served.

Neither of those happen with kernel-mode drivers. Finally, to
perform any privileged operation, the driver must perform a system
call – while it is less expensive than a context switch, it is also
significantly more expensive than a normal function call.

│ In-Process Drivers
│
│  • what if a (large portion of) a driver could be a «library»
│  • best of both worlds
│    ◦ «no» context switch «overhead» for requests
│    ◦ bugs and security problems remain «isolated»
│  • often used for GPU-accelerated 3D graphics

There is an alternative model, which mitigates some of the downsides
of user-mode drivers. In particular the second source of context
switches can be (at least partially) eliminated by running the
driver in the same process as the application which uses the device.
How would this work?

The driver can come as a library and the application links to that
library. Of course you would want to link the driver dynamically, so
that a different driver (e.g. for a different device of the same
general type) can be substituted without recompiling the
application.

There are some issues that need to be resolved with regards to
permissions, but in principle, an in-process (library) driver can
use the same system calls that a driver running in its own process
could. Effects of possible bugs or misbehaviour in the driver are
limited to the particular process in which it runs. It is also
common that multiple processes can use the same device, each using
its own ‘instance’ of the driver.

However, this model is not applicable when the driver needs to be
protected from the application or the driver needs to perform
multiplexing (i.e. it is not possible to have multiple independent
instances of the driver talk to the same device, but the device
needs to be nevertheless shared by multiple processes).

│ Port-Mapped IO
│
│  • early CPUs had very limited «address space»
│    ◦ 16-bit addresses mean 64KB of memory
│  • peripherals got a «separate» address space
│  • «special instructions» for using those addresses
│    ◦ e.g. ‹in› and ‹out› on ‹x86› processors

Let's now look at how the CPU communicates with peripherals and how
this affects drivers. Some old CPUs (most famously Intel 8086) had 2
distinct address spaces, one for memory and another for peripherals.
The latter could be accessed using special-purpose instructions,
which would move values between CPU registers and peripherals. In
later iterations of the ‹x86› family, when memory protection (an MMU
and privilege levels) was added, those instructions became
privileged. Hence, only kernel can talk to devices which are
attached to the CPU through this mechanism. But the IO instructions
have been largely abandoned, and are only used with legacy devices.

│ Memory-mapped IO
│
│  • devices «share» address space with memory
│  • «more common» in contemporary systems
│  • IO uses the same instructions as memory access
│    ◦ ‹load› and ‹store› on RISC, ‹mov› on ‹x86›
│  • allows «selective» user-level access (via the MMU)

The alternative to port-mapped IO is memory-mapped IO (MMIO for
short), where the physical address space is shared by RAM and
peripherals. Writing to some addresses (using e.g. ‹mov› on ‹x86›)
will store the data in RAM, but simply changing the address will
result into the data being sent to a peripheral (typically to be
stored in an onboard register or memory). A specific example would
be the PCIe configuration space: each PCIe device must expose a
single page (4KiB) of MMIO address space through which it can be
enumerated and configured.

Unlike port-mapped IO, access to the physical memory address space
is managed by the MMU (including the regions assigned to devices,
not just those given to RAM). Hence it is possible to securely allow
a certain process to talk to a specific device by mapping the
corresponding chunk of the physical address space into the virtual
address space of that process.

│ Programmed IO
│
│  • input or output is «driven» by the «CPU»
│  • the CPU must «wait» until the device is ready
│  • would usually run at «bus speed»
│    ◦ 8 MHz for ISA (and hence ATA-1)
│  • PIO would talk to a «buffer» on the device

Another way to look at IO is how it is «timed». Peripherals are
usually orders of magnitude slower than the main CPU and the CPU
must wait a significant number of cycles between, for instance,
issuing commands to a given device. Commands are usually realized by
writing data to the on-board registers of the device. The device
periodically reads those registers and acts accordingly, perhaps
writing the response into some other register, which the CPU can
then read (both input and output is done using one of the mechanisms
described above: port-mapped or memory-mapped IO).

The simplest form of timing is called «programmed IO» or PIO. In
this mode, the CPU drives the data transfer and it has to actively
wait for the device (or rather the bus) to become ready after each
transfer. Consider sending data to a disk: there is a RAM-based
buffer in the disk controller, one that can hold at least a single
physical disk sector worth of data. The CPU can transfer into this
buffer at bus speed, e.g. 8MHz for ISA (admittedly a very old
technology). If the CPU core runs at 32MHz, this means that it can
only send data every fourth cycle. It has to spend 3 out of every 4
cycles waiting for the bus to become ready.

│ Interrupt-driven IO
│
│  • peripherals are «much» slower than the CPU
│    ◦ «polling» the device is expensive
│  • the peripheral can «signal» data availability
│    ◦ and also «readiness» to accept more data
│  • this «frees up CPU» to do other work in the meantime

Some peripherals can only process very small amounts of data at
once, and are much slower still than the bus. As an extreme example,
consider a serial port configured to send 9600 bits per second. That
works out to 1200 characters per second: with an on-board buffer for
60 characters, the CPU needs to fill that buffer at 20Hz, i.e. with
a period of 50 milliseconds, which is of course an eternity in CPU
time (almost 2 million cycles at 32MHz).

So you would perhaps use PIO to fill in that 60 character buffer (at
bus speed, so 25 % efficiency, working out to 240 cycles), but
actively waiting for the buffer to drain would be madness.
Fortunately, the serial port hardware can be configured to cause an
«interrupt» when the buffer becomes empty. The CPU can go about
doing whatever, but the serial port driver will be woken up to fill
in the next 60 bytes when needed, once every 50ms or so.

Same mechanism can be used for receiving data: the hardware will
cause an interrupt once the receive buffer becomes full and needs to
be read by the CPU.

│ Interrupt Handlers
│
│  • also known as «first-level» interrupt handler
│  • they must run in «privileged» mode
│    ◦ they are part of the «kernel» by definition
│  • the low-level interrupt handler must finish «quickly»
│    ◦ it will mask its own interrupt to avoid «re-entering»
│    ◦ and «schedule» any long-running jobs for later (SLIH)

Upon a hardware interrupt, the CPU will drop whatever it was doing,
save its current state into a designated memory area and transfer
control to an «interrupt handler». Or rather, one of the CPU cores
will. This handler is automatically executed in privileged mode, and
hence is by definition part of the kernel.

Notice that no context switch occurs: though registers are written
into memory, the page table is unaffected – the interrupt handler
runs in the context of whatever process was currently running at the
time. This is analogous to how system calls behave.

To avoid issues with reentrancy, a first-level handler will usually
«mask» its own interrupt (cause the CPU to temporarily ignore it).
This is one of a number of reasons why the first-level handler needs
to finish quickly (if an interrupt is masked for too long, this can
cause data to be lost, e.g. due to buffer overruns). Hence the
first-level handler will usually do the minimum required work (e.g.
clear time-critical buffers) and schedule any further processing for
a later time.

│ Second-level Handler
│
│  • does any expensive «interrupt-related» processing
│  • can be executed by a «kernel thread»
│    ◦ but also by a user-mode driver
│  • usually not time critical (unlike first-level handler)
│    ◦ can use standard «locking» mechanisms

The work that was deferred by the first-level handler is picked up
by a second-level handler. This handler can run in a kernel thread
or even in a user-mode process. The second-level routine is usually
not time critical and can synchronize with the rest of the system as
needed. A second-level handler of a disk device could, for instance,
call into the file system to notify it that a piece of data it has
requested has arrived, which in turn could trigger a suspended
‹read› system call to write the data into the address space of a
waiting process. The syscall then returns and the process is woken
up.

│ Direct Memory Access
│
│  • allows the device to directly read/write «memory»
│  • this is a «huge» improvement over «programmed» IO
│  • «interrupts» indicate buffer «full»/«empty»
│  • devices can read and write arbitrary «physical» memory
│    ◦ opens up «security» / reliability problems

The last mode of IO is known as DMA, or Direct Memory Access. While
there is some superficial similarity with MMIO (memory-mapped IO),
it is important to distinguish them. In MMIO, the CPU (and by
extension, the OS) talks to the device using the memory subsystem,
mapping the on-board memory or registers of that device into the
physical address space of the CPU.

The situation in DMA is flipped: the CPU and the device do not talk
to each other at all. Instead, physical memory attached to the CPU
(i.e. the main RAM) is made accessible to the peripheral, which can
then transfer data to RAM. The CPU still uses memory access
instructions to fetch data that came from the device (like in MMIO)
but it does not communicate with the device directly. Instead, it
reads and writes into its own RAM, which just happens to contain
data that the device wrote there, or will later read.

To summarise:

 • both MMIO and DMA use memory access instructions on the CPU to
   read and write data,
 • under MMIO, the main memory is «not involved» at all,
 • under DMA, «both» the device and the CPU «access main memory»,
 • under DMA, there is no direct bulk transfer of data between the
   CPU and the device.

The use of MMIO and DMA is not exclusive, rather to the contrary:
devices often use a combination of both. In fact, MMIO can be used
to configure DMA (the latter is unsuitable for configuration, but
performs better for bulk data transfers).

│ IO-MMU
│
│  • like the MMU, but for DMA transfers
│  • allows the OS to «limit» memory access per device
│  • very useful in «virtualisation»
│  • only recently found its way into «consumer» computers

While DMA is extremely important for devices which transfer a lot of
data (HDDs, SSDs, NICs), it has some nasty security and safety
implications. Under ‘traditional’ DMA, the device can read and write
any physical memory it wants to. For instance, it can overwrite
kernel code if it so wishes. A rogue device could then very easily
circumvent any software-level security. Perhaps even more
importantly, a rogue driver could program the device to overwrite
memory with data (and code) of the driver's choosing.

This is undesirable, especially if we want to use user-mode drivers,
or if the device is not sufficiently secure.¹ The IO-MMU is a device
which fixes this problem, by enforcing limits on which memory can a
particular peripheral access. The IO-MMU, like the regular MMU, can
only be programmed by the OS kernel (or a hypervisor, as it may be…
we will learn more about those in Chapter 11). With a correctly
programmed IO-MMU, DMA is safe and secure.

¹ Famously, any device attached to a firewire port – an external
  port, kind of like high-speed USB before USB 3 was a thing – can
  read and write any and all host memory. It is not impossibly hard
  to build a rogue firewire device and attach it to someone else's
  computer. Other connectors which expose high-speed buses may be
  susceptible.

## System and Expansion Buses

The rest of the lecture will be a tour of peripherals and some of
their history. Before we get to the peripherals themselves, though,
we will look at buses which are used to connect peripherals to the
CPU (or CPUs) and, in some cases, to RAM. While the bus itself is
not a peripheral, it is common for a bus to have drivers of their
own. This has two reasons:

 1. all but the simplest buses have additional hardware, which
    mediates bus access, takes care of device configuration and
    enumeration, and so on, and which needs to be itself configured,
 2. besides the electronics and signalling, a bus also conceptually
    comes with a set of «protocols», which need to be implemented
    both by the peripherals and by their drivers; the bus driver
    implements those protocols: other drivers make simple function
    calls and the driver translates them to the required MMIO or
    port-mapped IO operations.

With that said, let's look at some of the historical buses which
were used in PCs over time and how they evolved to the current state
of the art, PCI Express.

│ History: ISA (Industry Standard Architecture)
│
│  • 16-bit system «expansion» bus on IBM PC/AT
│  • «programmed» IO and «interrupts»
│  • a fixed number of hardware-configured «interrupt lines»
│    ◦ likewise for I/O port ranges
│    ◦ the HW settings then need to be «typed back» for SW
│  • parallel data and address transmission

One of the oldest expansion buses, which made an appearance with IBM
PC/AT (a personal computer based on Intel 80286). The ISA bus was
hooked to the CPU via IO ports (no MMIO) and provided an interrupt
line to each peripheral. A limited number of DMA ‘channels’ were
provided by a DMA controller, allowing attached peripherals (mainly
storage devices) to move data to and from memory independently from
the main CPU.²

It was not possible to enumerate the bus, much less to configure the
peripherals, in software. The port ranges and IRQ lines were
selected by the hardware (either hardcoded, or configurable with
jumpers or switches) and had to be given to the driver by the user
(i.e. it had to be ‘typed back’ manually).

Hardware-wise, the bus was a parallel design, synchronously
transferring 16 bits across 16 wires at the same clock tick.
Separate data and address lanes were used: an address could be
transferred with the same clock tick as a data word.

² In this setup, the DMA controller actually becomes the bus master
  and performs the transfer. While the effect is essentially the
  same, the implementation is rather different than with the
  DMA based on peripherals becoming bus masters that we will
  encounter later in the lecture.

│ MCA, EISA
│
│  • MCA: Micro Channel Architecture
│    ◦ «proprietary» to IBM, patent-encumbered
│    ◦ 32-bit, «software-driven» device configuration
│    ◦ expensive and ultimately a market «failure»
│    
│  • EISA: Enhanced ISA
│    ◦ a 32-bit extension of ISA
│    ◦ mostly created to avoid MCA licensing costs
│    ◦ short-lived and replaced by PCI

At 8MHz and 16 bits, ISA eventually started to be a limiting factor,
since both CPUs and peripherals – especially graphics adapters, but
also storage devices – were getting a lot faster.

│ VESA Local Bus
│
│  • memory mapped IO & fast «DMA» on otherwise ISA systems
│  • «tied» to the 80486 line of Intel (and AMD) CPUs
│  • primarily for «graphics cards»
│    ◦ but also used with hard drives
│  • quickly fell out of use with the arrival of PCI

VESA Local Bus, or VLB, was a fairly successful effort to
standardize a disparate set of home-grown buses designed to
accommodate faster graphics hardware than what was possible with
ISA, while avoiding the licensing costs of MCA.

The VLB essentially connected the peripheral directly to the 80486
memory bus, using an additional connector (as an extension of
standard ISA). Due to incompatible memory bus design in later
processors, VLB did not survive the upgrade to Pentium.

│ PCI: Peripheral Component Interconnect
│
│  • a 32-bit successor to ISA
│    ◦ 33 MHz (compared to 8 MHz for ISA)
│    ◦ later revisions at 66 MHz, PCI-X at 133 MHz
│    ◦ with support for «bus-mastering» and DMA
│  • still a «shared», parallel bus
│    ◦ all devices share the same set of wires

The breakthrough in peripheral interconnects came with PCI, which
provided most of the benefits of MCA while avoiding some of its
problems. Perhaps the most important update was software-based
configuration, but the considerable bandwidth upgrade did not hurt
either. From a modern perspective, the one downside was the
topology: a shared, parallel bus connecting all the devices in the
system.

Parallel here means that 32 bits are transmitted with each clock
cycle, along 32 separate wires. This limits achievable clock speeds
due to signal delay differences along traces of different length –
modern buses transmit data serially, each data wire on its own
clock.

│ Bus Mastering
│
│  • normally, the CPU is the bus «master»
│    ◦ which means it initiates communication
│  • it's possible to have multiple masters
│    ◦ they need to agree on a conflict resolution protocol
│  • usually used for accessing the memory

On a shared bus, one of the devices is usually the master and is in
charge of the bus and the traffic on it. Normally, this is the CPU.
However, for DMA transfers (between memory and a peripheral), the
CPU should not be involved, since the entire point is to free up the
CPU to do other work while the transfer is going on.

To facilitate these transfers, then, the peripherals can temporarily
become bus masters, directing the traffic. An arbitration protocol
ensures there is at most a single master driving the bus at any
given time.

│ DMA (Direct Memory Access)
│
│  • the most common form of bus mastering
│  • the CPU tells the device what and where to write
│  • the device then sends data directly to RAM
│    ◦ the CPU can work on other things in the meantime
│    ◦ completion is signaled via an interrupt

In principle, it is possible for peripherals to talk to each other
when one of them is the bus master. However, this is not usually
done: instead, the (temporary) bus master performs a data transfer
to or from the main memory.

│ Plug and Play
│
│  • the ISA system for IRQ configuration was «messy»
│  • MCA pioneered software-configured devices
│  • PCI further improved on MCA with “Plug and Play”
│    ◦ each PCI device has an ID it can «tell» the system
│    ◦ enables «enumeration» and automatic «configuration»

An important aspect of PCI (and MCA before it) was software-based
configuration and enumeration of connected devices. This allows the
firmware and the operating system to discover what devices are
connected, load the appropriate drivers and set up the devices
without user intervention.

│ PCI IDs and Drivers
│
│  • PCI allows for device enumeration
│  • device «identifiers» can be paired to device «drivers»
│  • this allows the OS to load and configure its drivers
│    ◦ or even download / install drivers from a vendor

Enumeration has two components: one is a system to discover and
configure the devices attached to the system. This is done by using
a common, device-independent protocol which must be implemented by
all PCI devices.

The other is a system for assigning a unique identifier to each
peripheral, a so-called PCI ID. An operating system can then include
a database of known PCI IDs and corresponding drivers for that
device. Loading that driver typically makes the device available for
use by the rest of the operating system, and hence by the user.

│ AGP: Accelerated Graphics Port
│
│  • PCI eventually became too «slow» for GPUs
│    ◦ AGP is based on PCI and only «improves performance»
│    ◦ enumeration and configuration stays the same
│  • adds a dedicated «point-to-point» connection
│  • multiple transfers per clock (up to 8, for 2 GB/s)

Of course, peaking around 4 Gib/s (500 MiB/s), PCI is not the end of
the story. In a clear historic pattern, graphics hardware became
limited by its connection to the rest of the system (CPU and
memory). Like with VLB, a dedicated graphics bus has become
widespread, this time based on PCI, with essentially two
modifications:

 1. the bus was point-to-point (dedicated to a single peripheral),
    i.e. not shared with the main PCI bus in the system,
 2. it allowed multiple data transfers per clock cycle – the same
    technique that DDR RAM uses to increase throughput without
    driving the clock faster.

With maximum of 8 transfers per clock, with the main clock running
at 66MHz, the maximum transfer speed comes out as 16Gib/s.

│ PCI Express
│
│  • the current high-speed peripheral bus for PC
│  • builds on / «extends» conventional PCI
│  • point-to-point, «serial» data interconnect
│  • much improved «throughput» (up to ~30GB/s)

We have finally reached the present day. The modern successor to PCI
moved away from synchronous parallel data transmission and from a
shared bus, allowing for drastic performance increase. Even though
multiple wires are used for data transfer, they are self-clocked
(clock is part of the data signal) and hence asynchronous to each
other. Each wire is called a ‘lane’ and a single peripheral can use
up to 16 lanes. Low-bandwidth devices only need a single lane,
saving on power requirements and manufacturing cost.

At the time of this writing, devices targeting PCIe 4.0, with 16GT
(billion transactions) per second on each lane, are commonly
available. This translates to a maximum per-device bandwidth of
about 256Gib/s (compare to AGP at 16Gib/s) or 32GiB/s in a 16-lane
configuration.

The next revision, PCIe 5.0 (final spec released in 2019) doubles
the per-lane transfer rate to 32GT/s, for a per-device maximum of
64GiB/s.

Software-wise, PCIe is backward-compatible with PCI, using an
extended version of the PCI enumeration and configuration protocol.
Additionally, PCIe allows the configuration to use MMIO instead of
port-mapped IO, exposing a single 4KiB page of configuration data
per endpoint (peripheral).

│ USB: Universal Serial Bus
│
│  • primarily for «external» peripherals
│    ◦ keyboards, mice, printers, ...
│    ◦ replaced a host of «legacy ports»
│  • later revisions allow «high-speed» transfers
│    ◦ suitable for storage devices, cameras &c.
│  • device enumeration, capability «negotiation»

PCI brought software-driven enumeration and configuration to the
permanently attached, internal peripherals (graphics hardware,
storage, network interfaces, and so on). USB did the same for
externally-attached devices, like keyboards, mice, printers,
scanners and so on.

Earlier systems used comparatively ‘dumb’ buses for the same
purpose. The user had to select a driver by hand and configure the
driver (tell it which external port the device is attached to). With
USB, the devices would identify themselves using a device-neutral
protocol, just like with PCI. The host system can then load and
configure the correct driver automatically. Moreover, USB supports
hotplug, so this can happen whenever the user plugs in a device.
Finally, the bandwidth available on USB, even in its first revision,
was much higher than the earlier standards (RS-232, PS/2).

Later USB revisions considerably increased both data transmission
speed and the power available to the attached peripheral. The
current highest speed available to USB devices (in USB 3.2 Gen 2
mode with 2 lanes, over USB-C connectors) is 20Gib/s, exceeding the
maximal transfer speeds of AGP, the fastest internal bus available
in consumer hardware before PCIe.

│ USB Classes
│
│  • a set of «vendor-neutral» protocols
│  • HID = human-interface device
│  • mass storage = disk-like devices
│  • audio equipment
│  • printing

USB comes with additional standardization, with so-called device
«classes». Each class constitutes a vendor-neutral protocol for a
particular type of devices:

 • HID (human-interface device), e.g.:
   ◦ keyboards,
   ◦ mice,
   ◦ game controllers,
   ◦ small character-based displays,
   ◦ pretty much anything with a button.
 • mass storage (persistent memory, usually with a file system):
   ◦ flash ‘pen’ drives,
   ◦ external hard drives or SSDs,
   ◦ optical drives,
   ◦ card readers, …
 • audio devices, e.g.:
   ◦ headsets (headphones with a microphone),
   ◦ sound cards,
   ◦ active loudspeakers,
   ◦ standalone microphones,
   ◦ MIDI devices,
 • MTP (media transfer protocol),
   ◦ smartphones,
   ◦ portable media players.
 • printers,
 • video (webcams, digital microscopes).

Essentially, none of the devices in the above list need
vendor-specific drivers to operate. Instead, a single ‘class’ driver
which implements the respective protocol can talk to any peripheral
which belongs to that class. A single physical peripheral may
provide multiple virtual devices, possibly in different classes
(e.g. a portable recorder which can appear both as an audio device –
a microphone, as well as a storage device).

│ Other USB Uses
│
│  • scanners
│  • ethernet adapters
│  • usb-serial adapters
│  • wifi adapters (dongles)
│  • bluetooth

In addition to the standard device classes, there are many USB
devices which do not fit one of those categories. These will use a
vendor-specific protocol and will require corresponding
device-specific driver.

│ Bluetooth
│
│  • a «wireless» alternative to USB
│  • allows «short-distance» radio links with «peripherals»
│    ◦ input (keyboard, mice, game controllers)
│    ◦ audio (headsets, speakers)
│    ◦ data transmission (e.g. smartphone sync)
│    ◦ gadgets (watches, heartrate monitoring, GPS, ...)

While bluetooth is not a bus as such (being wireless), it behaves
much like USB from the point of view of software (with additional
complexity related to device pairing, security and unreliable data
transmission). Many device types that can be attached via USB can
also be attached with bluetooth (wireless keyboards, mice, headsets,
speakers, and so on).

│ ARM Buses
│
│  • ARM is typically used in System-on-a-Chip designs
│  • those use a «proprietary» bus to connect peripherals
│  • there is less need for enumeration
│    ◦ the entire system is baked into a single chip
│  • the peripherals can be «pre-configured»

The ARM ecosystem is somewhat different from the PC one. It is
common that ARM devices are ‘system on a chip’ designs, where most,
if not all, peripherals are part of a single chip together with CPU
cores, memory controller, and interconnect (system bus). SoC vendors
usually prepare operating system images or kernel builds (typically
of Android) that work on their system.  Software-driven enumeration
and autoconfiguration is much less important and is typically not
supported.

Peripherals typically included are a graphics core, an USB
controller, wifi, ethernet, bluetooth controller, audio controller,
NFC, storage controller (eMMC) and perhaps a few others.

│ USB and PCIe on ARM
│
│  • neither USB nor PCIe are exclusive to the PC platform
│  • most ARM SoC's support USB devices
│    ◦ for slow and medium-speed off-SoC devices
│    ◦ e.g. used for «ethernet» on RPi 1
│  • some ARM SoC's support PCI Express
│    ◦ this allows for «high-speed» off-SoC peripherals

However, not all ARM processors are designed for ‘sealed’ devices
like smartphones or smart TVs. ARM-based general-purpose hardware
includes single-board computers (like Raspberry Pi, Beaglebone, …)
but also laptops (new generation of Apple hardware) and servers
(Ampére Altra). Those systems often need more connectivity and
extensibility and will provide PCI Express for connecting to
high-speed peripherals.

│ PCMCIA & PC Card
│
│  • People Can't Memorize Computer Industry Acronyms
│    ◦ PC = Personal Computer, MC = Memory Card
│    ◦ IA = International Association
│  • «hotplug»-capable notebook «expansion» bus
│  • used for memory cards, network adapters, modems
│  • comes with its own set of drivers (cardbus)

Back to history: until a decade ago, it was common that laptop
computers had expansion slots, a bit like traditional desktops.  Of
course, a standard-size expansion card has no chance of fitting in a
laptop, hence special connectors and/or buses. One of the oldest was
PCMCIA, with credit-card-sized (but thicker) peripherals that could
be hot-plugged into a bay on the side of a laptop (i.e. the device
would be hidden inside the laptop body, unlike various USB dongles
with a mess of wires).

│ ExpressCard
│
│  • an «expansion card» standard like PCMCIA / PC Card
│  • based on PCIe and USB
│    ◦ can mostly «re-use» drivers for those standards
│  • not in wide use anymore
│    ◦ last update was in 2009, introducing USB 3 support
│    ◦ the industry association «disbanded» the same year

ExpressCard is a more modern version of the same idea and a similar
form factor, with USB and PCIe in the backend. Modern laptops,
however, no longer offer this functionality and the association
responsible for ExpressCard was disbanded over a decade ago.

│ miniPCIe, mSATA, M.2
│
│  • those are «physical interfaces», not special buses
│  • they provide some mix of PCIe, SATA and USB
│    ◦ also other protocols like I²C, SMBus, ...
│  • used mainly for compact SSDs and wireless
│    ◦ also GPS, NFC, bluetooth, ...

What does survive are connectors for «internal» devices in a small
form factor: mainly for SSDs, but also for wifi adapters, bluetooth
and similar modules. These are common in laptops and mini-ITX (small
desktop) systems. Depending on the particular connector standard
(and variant), it will provide a variety of bus connections,
including PCIe (up to 4 lanes) and USB.

## Graphics and GPUs

Graphics hardware was always a very important part of both home
computers and professional workstations. Often, it is also the most
demanding peripheral in those applications, and the most complex.

│ Graphics Cards
│
│  • initially just a device to «drive displays»
│  • reads pixels from «memory» and provides «display» signal
│    ◦ basically a DAC with a clock
│    ◦ the memory can be part of the graphics card
│  • evolved «acceleration» capabilities

Originally, a graphics card would simply contain some fast static
memory (frame buffer), a clock and a digital-to-analog converter
(DAC), which would drive a CRT display (cathode ray tube). The
displays of the era worked by pointing an electron gun (using
electromagnets) at individual pixels in rapid succession while
modulating the voltage between the cathode and anode (essentially a
conductive coating of the inside of the screen) to attain
corresponding brightness on each pixel.

The graphics card would generate the signal driving this modulation,
in step with the advancing electron gun. The memory of the graphics
card would contain digital information about the brightness of each
pixel. Typical refresh rates would be in the 30-120 Hz range for the
entire screen. For a VGA screen (640 columns, 480 rows) at 70 Hz,
this works out to about 20 MHz (20 million pixels per second).  The
three component colours are transmitted in parallel.

│ Graphics Accelerator
│
│  • allows common «operations» to be done in «hardware»
│  • like drawing lines or filled «polygons»
│  • the pixels are computed directly in video RAM
│  • this can «save» considerable «CPU time»

Composing a picture to be displayed on screen can take a lot of
computation and/or memory traffic. If some of those operations are
performed by dedicated hardware instead of the main CPU, this can
drastically improve performance, since the CPU is free to do other
things while the graphics hardware asynchronously performs the
simple, repetitive tasks. There are two main classes of operations
that can be easily accelerated using dedicated hardware:

 • rasterization of geometric shapes such as lines, rectangles,
   polygons or curves (vector graphics) – those are used in, for
   instance, graphical user interfaces and in vector drawing
   programs or 2D computer-aided design systems,
 • bulk pixel operations, such as flood fill or bit blitting¹ mainly
   used in raster graphics (e.g. video games).

Since essentially each pixel (or at best a small block of pixels)
needs at least one memory write, and for a CPU, memory writes are
expensive (lots of waiting for slow memory), such operations are
especially wasteful on the CPU. Even worse if data (textures,
sprites) need to be read from memory and written back elsewhere,
perhaps after performing a simple operation on the pixels.

¹ A memory copy with some additional logic: it operates on pixels
  (instead of bytes) in various formats (e.g. 2 or 8 pixels per byte)
  and can deal with transparent pixels which are skipped (allows
  drawing non-rectangular shapes over an existing background).

│ 3D Graphics
│
│  • rendering 3D scenes is «computationally intensive»
│  • CPU-based, «software-only» rendering is possible
│    ◦ texture-less in early flight simulators
│    ◦ bitmap textures since '95 / '96 (Descent, Quake)
│  • CAD workstations had 3D accelerators (OpenGL '92)

While 2D graphics takes a lot of resources (at least in terms of the
capabilities of older hardware), it is essentially free compared to
3D graphics, where computing each output pixel can take hundreds of
operations, some of which are geometric and others which are
raster-based. Hence, the potential for hardware acceleration of 3D
graphics is considerably higher than with 2D graphics, but the
hardware to do so is much more complicated.

│ GPU (Graphics Processing Unit)
│
│  • a term coined by Sony in '94 (the GPU in PlayStation)
│  • originally a purpose-built «hardware renderer»
│    ◦ based on polygonal meshes and Z buffering
│  • increasingly more «flexible» and «programmable»
│  • on-board RAM, high-speed connection to system RAM

First GPUs were essentially hardware built for rasterization of 3D
geometry, supplied as a polygonal (triangular) mesh with textures
attached to the faces. The hardware would then compute visibility
and lighting to produce a raster image to be displayed on screen.
The CPU would prepare the geometry for each frame which the GPU
would then render and display.

Each generation of GPUs brings more flexibility and programmability,
allowing for acceleration of lots of different effects without
hard-coding them in hardware. Contemporary GPUs are essentially
fully programmable general-purpose vector processors, with
registers, memory, control flow and so on.

│ GPU Drivers
│
│  • split into a number of components
│  • graphics output / frame buffer access
│  • «memory management» is often done in kernel
│  • geometry, textures &c. are prepared «in-process»
│  • front end API: OpenGL, Direct3D, Vulkan, ...

A typical GPU driver is split into a number of components, some of
which reside in the kernel (frame buffer setup, memory management)
while the more complex parts are libraries linked into client
applications (geometry and texture processing, shader compilation).

│ Shaders
│
│  • current GPUs are «computation» devices
│  • the GPU has its own machine code for «shaders»
│  • the GPU driver contains a «shader compiler»
│    ◦ either all the way from a high level language (HLSL)
│    ◦ or starting with an intermediate code (SPIR)

Since modern GPUs are really just vector processors in disguise,
they run programs in their own machine code. The driver then
compiles higher-level programs which are part of the software (e.g.
a computer game or a 3D game engine) into the hardware-specific
machine language. While the output is very device-specific, the
input (which is what the application gives to the driver) is mostly
standardized, with two main options being HLSL (High-Level Shader
Language) and SPIR (Standard Portable Intermediate Representation).

│ Mode Setting
│
│  • deals with «screen» configuration and «resolution»
│  • including support for e.g. «multiple displays»
│  • usually also supports primitive (SW-only) «framebuffer»
│  • often in-kernel, with minimum user-level support

While there is a lot of bells and whistles on a modern GPU, there
are some boring tasks which did not really change in the last 2-3
decades, like display configuration. It's common that current
computers can attach multiple displays, and each needs to be given a
resolution, color depth, refresh rate &c., together known as a
graphics ‘mode’. This is the task of the ‘mode setting’ part of a
graphics driver.

│ Graphics Servers
│
│  • multiple apps cannot all drive the graphics card
│    ◦ the graphics hardware needs to be «shared»
│    ◦ one option is a «graphics server»
│  • provides an IPC-based «drawing» and/or «windowing» API
│  • performs «painting» on behalf of the applications

While not a driver itself, graphics servers form an important part
of the graphics stack (on systems which use one). The problem here
is that only one program can meaningfully draw on any given screen,
but we usually want to show the output of more than a single
program. One option is a graphics server, which hands out regions
(rectangular windows, typically) into which programs can paint using
its API.

│ Compositors
│
│  • a more direct way to share graphics cards
│  • each application gets its «own buffer» to paint into
│  • painting is mostly done by a (context-switched) GPU
│  • the individual buffers are then «composed» onto screen
│    ◦ composition is also hardware-accelerated

The other common approach is to use a «compositor», which differs
crucially from a graphics server in one thing: how the individual
applications paint their content. In a graphics server, there is a
painting API which the program calls to display shapes and pixmaps
on screen.

With a compositor, each program gets an «off-screen buffer» (pixmap)
into which they can paint by directly interacting with the driver of
the graphics hardware. The compositor then combines those buffers
into a single picture which is shown to the user (again by making
appropriate calls into the graphics driver). In typical usage, each
window corresponds to a single buffer.

│ GP-GPU
│
│  • general-purpose GPU (CUDA, OpenCL, ...)
│  • used for «computation» instead of just graphics
│  • basically a return of vector processors
│  • close to CPUs but not part of normal OS scheduling

As we have mentioned earlier, contemporary GPUs are really
general-purpose vector processors and can be used for purely
computational tasks that have nothing to do with graphics (machine
learning is a popular application, but anything that benefits from
massive SIMD is a good candidate).

## Persistent Storage

In this section, we will look at bulk storage devices – those that
usually carry file systems and which retain the stored data while
offline (disconnected from power).

│ Drivers
│
│  • split into adapter, bus and device drivers
│  • often a single driver per device type
│    ◦ at least for disk drives and CD-ROMs
│  • bus «enumeration» and «configuration»
│  • data addressing and «data transfers»

Storage devices have traditionally had their own dedicated,
specialized bus. The host side of this bus is implemented by an
«adapter» (controller) which is connected to a system bus (PCI,
PCIe) on one side and to the storage bus on the other. Individual
storage devices are then connected to this dedicated bus.

This hardware structure essentially dictates the driver structure:
the bus is usually standardized and comes with a set of protocols,
just like system buses that we discussed earlier do. However, for
any given bus, there might be many different adapter models made by
different vendors. In some cases, they use a common protocol, but in
other cases, device-specific drivers are required to configure them.

Like with USB, on any given storage bus, there is considerable
standardization among the storage devices themselves (endpoints),
and a single ‘class’ driver is sufficient (a HDD driver, a CD-ROM
driver, a tape unit driver, …).

│ IDE / ATA
│
│  • Integrated Drive Electronics
│    ◦ disk controller becomes part of the disk
│    ◦ standardised as ATA-1 (AT Attachment ...)
│  • based on the ISA bus, but with cables
│  • later adapted for non-disk use via ATAPI

One of the oldest «standardized» storage buses was IDE (vendor name,
later standardized as ATA). This is essentially an ISA bus with
cabling, hence the adapter, if connected to the host ISA bus, was
especially simple. However, later revisions of the ATA (now known as
Parallel ATA) spec diverged from ISA due to much higher speeds that
were eventually required. The ATA family of buses did not switch to
use PCI internally and the storage bus and system bus evolved
separately, even if along similar lines.

│ ATA Enumeration
│
│  • each ATA «interface» can attach only 2 drives
│    ◦ the drives are HW-configured as master/slave
│    ◦ this makes enumeration quite simple
│  • multiple ATA interfaces were standard
│  • no need for specific HDD drivers

Since most implementations offer exactly 4 connectors (2 interfaces,
each capable of connecting 2 drives), enumeration is not much of an
issue. Each interface has a standard set of IO ports (for
port-mapped IO). The system uses those ports to send 2 ‹IDENTIFY›
commands on each interface, one for the master and the other to the
slave device. This completes the enumeration.

│ PIO vs DMA
│
│  • original IDE could only use «programmed» IO
│  • this eventually became a serious «bottleneck»
│  • later ATA revisions include «DMA» modes
│    ◦ up to 160MB/s with highest DMA modes
│    ◦ compare 1900MB/s for SATA 3.2

│ SATA
│
│  • «serial», point-to-point replacement for ATA
│  • hardware-level incompatible to (parallel) ATA
│    ◦ but SATA inherited the ATA «command set»
│    ◦ legacy mode lets PATA drivers talk to SATA drives
│  • hot-swap capable – replace drives in a «running system»

Like other interfaces, storage systems made a transition to serial
data links. For ATA, the result is known as SATA or Serial ATA. The
newer standard retains software-level backward compatibility with
Parallel ATA: if the controller is in ‘legacy mode’, it will emulate
a PATA host controller and work with legacy PATA drivers. However,
this PATA-compatible mode necessarily hides new features (ability to
connect more drives, hotswap, native command queuing).

│ AHCI (Advanced Host Controller Interface)
│
│  • «vendor-neutral» interface to SATA controllers
│    ◦ in theory only a single ‘AHCI’ driver is needed
│  • an alternative to ‘legacy mode’
│  • NCQ = Native Command Queuing
│    ◦ allows the drive to re-order requests
│    ◦ another layer of IO scheduling

Most SATA host controllers implement the AHCI standard and hence
don't need device-specific drivers. Running the controller in AHCI
mode is required to make use of new SATA technologies, such as NCQ
(native command queuing) and hotswap. While attempts were made to
add command queuing to PATA, those were ultimately unsuccessful, due
to insufficient DMA capabilities of the old ISA-based system (with a
3rd-party DMA controller). Since SATA drives perform DMA themselves,
NCQ has much better performance.

│ ATA and SATA Drivers
│
│  • the host controller (adapter) is mostly vendor-neutral
│  • the «bus driver» will expose the ATA command set
│    ◦ including support for «command queuing»
│  • device driver uses the bus driver to talk to devices
│  • partially re-uses SCSI drivers for ATAPI &c.

│ SCSI (Small Computer System Interface)
│
│  • originated with minicomputers in the 80's
│  • more complicated and «capable» than ATA
│    ◦ ATAPI basically encapsulates SCSI over ATA
│  • device «enumeration», including «aggregates»
│    ◦ e.g. entire enclosures with many drives
│  • also allows CD-ROM, tapes, scanners (!)

A different storage bus, called SCSI, has been in parallel use with
ATA, mainly targeting servers and high-end hardware in general. The
overall structure is the same as with ATA: there is an adapter
(called HBA – host bus adapter – in SCSI jargon), a bus with a set
of protocols, and an array of storage devices attached to the
storage bus.

Unlike Parallel ATA, the SCSI bus can attach many more devices and
those devices can have additional internal structure (e.g. it's
possible to attach a SATA RAID controller with a dozen disks as a
single ‘composite’ SCSI endpoint). For this reason, it has advanced
software-based enumeration and configuration capabilities: the HBA
will ‘scan’ the storage bus to discover devices and report them to
the operating system. SCSI also commonly supports hotplugging
devices (i.e. attaching and detaching devices while the system is
running). Also unlike ATA, external SCSI connectors and cabling are
common.

Like ATA (and like system buses) SCSI used a parallel design for a
long time, but modern versions use high-speed serial links instead.
The technology is known as SAS, Serial-Attached SCSI. SAS can
optionally use a SATA-compatible connector (and SAS adapters with
such connectors will work with SATA drives, but not vice versa).

│ SCSI Drivers
│
│  • split into: a host bus adapter (HBA) driver
│  • a generic SCSI bus and command component
│    ◦ often re-used in both ATAPI and USB storage
│  • and per-«device» or per-class drivers
│    ◦ optical drives, tapes, CD/DVD-ROM
│    ◦ standard disk and SSD drives

While SCSI «hardware» is somewhat uncommon, the protocols it uses
are in widespread use. Both SATA and USB storage devices use SCSI as
their command protocols. Additionally, Fibre Channel (FC, a
storage-area network technology) and InfiniBand (IB, a high-speed,
low-latency interconnect) offer SCSI implementations. This
essentially means that the same ‘class’ driver can be used for
storage devices attached to SATA, USB, SAS, FC, IB or ethernet (via
iSCSI, see below), with an appropriate glue layer.

│ iSCSI
│
│  • basically SCSI over TCP/IP
│  • entirely «software-based»
│  • allows standard computers to serve as «block storage»
│  • takes advantage of fast cheap ethernet
│  • re-uses most of the «SCSI driver stack»

The SCSI protocol can be also encapsulated in TCP/IP and transported
using, for instance, ethernet. This approach allows SCSI endpoints
to be implemented in software: instead of specialized hardware, a
RAID enclosure (a box with many disks combined into one or a few
logical drives using RAID) can be implemented as a commodity x86
server with an ethernet connection. This is sufficient for many use
cases, while being significantly cheaper than ‘native’ storage-area
networks (fibre channel, infiniband), or even standard
externally-connected SAS.

│ NVMe: Non-Volatile Memory Express
│
│  • a fairly simple protocol for PCIe-attached storage
│  • optimised for SSD-based devices
│    ◦ much bigger and more «command queues» than AHCI
│    ◦ better / faster interrupt handling
│  • stresses «concurrency» in the kernel block layer

A ‘return to the roots’ technology: what ATA was to ISA, NVMe is to
PCIe. Essentially a protocol on top of PCIe interconnect, re-using
PCIe enumeration and configuration. The protocol calls for rather
massive command queues, taking advantage of the correspondingly
massive parallelism in the SSD hardware. NVMe storage is usually
very fast and the block layer, originally designed for much slower
devices, may struggle to keep up.

│ USB Mass Storage
│
│  • an USB device class (vendor-neutral protocol)
│    ◦ one driver for the entire class
│  • typically USB «flash drives», but also external «disks»
│  • USB 2 is not suitable for high-speed storage
│    ◦ USB 3 introduced UAS = USB-Attached SCSI

As mentioned earlier, storage devices can be also directly attached
to USB.

│ Tape Drives
│
│  • unlike disk drives, only allow «sequential» access
│  • needs support for media «ejection», «rewinding»
│  • can be attached with SCSI, SATA, USB
│  • parts of the driver will be «bus-neutral»
│  • mainly for data «backup», capacities 6-15TB

While disk-like devices (HDDs, SSDs, RAID enclosures) are by far the
most important, there are other storage devices worth mentioning.
Data centers will often use tape drives for backups, since they
offer excellent data density, low price per gigabyte stored and good
durability. From an OS standpoint, tapes are special since they can
only be accessed sequentially, and it doesn't make sense to put a
traditional file system on them. Instead, specialized programs are
used to prepare data for writing on a tape, e.g. ‹tar› (short for
Tape ARchive).

│ Optical Drives
│
│  • mainly used as a «read-only» distribution medium
│  • laser-facilitated reading of a rotating disc
│  • can be again attached to SCSI, SATA or USB
│  • conceived for «audio playback» → very slow seek

Another somewhat special class of storage devices are optical
drives: CD-ROM, DVD-ROM, Blu-ray. While random access is possible,
it is very slow even compared to HDDs. Optical drives are more
suitable for streaming (mainly audio and video) or content
distribution. Unlike tapes, (read-only) file systems are commonly
used on optical media (ISO 9660 for CD-ROM, UDF for DVD and
Blu-ray).

│ Optical Disk Writers (Burners)
│
│  • behaves more like a «printer» for optical «disks»
│  • drivers are often done in «user space»
│  • attached by one of the standard «disk buses»
│  • «special programs» required to burn disks
│    ◦ alternative: packet-writing drivers


## Networking and Wireless

The last category of devices that we will discuss in this lecture
are network interface cards. Please note that this is only an
overview of network hardware that can be attached to a
general-purpose computer – networking in general will be discussed
in the next lecture.

│ Networking
│
│  • networks allow «multiple computers» to exchange «data»
│    ◦ this could be files, streams or messages
│  • there are «wired» and «wireless» networks
│  • we will only deal with the «lowest layers» for now
│  • NIC = Network Interface Card

Network hardware allows computers to directly communicate with each
other, using some sort of interconnect, either wired or wireless.  A
computer connects to the network using a «network interface card»,
typically a PCIe device with an external connector (e.g. RJ 45 for
metallic ethernet), or an antenna (for wireless tech). A computer
network as a whole resembles a bus of the kind we have discussed in
the first part of the lecture, though with some crucial differences.

│ Ethernet
│
│  • specifies the «physical» medium
│  • «on-wire» format and «collision» resolution
│  • in modern setups, mostly «point-to-point» links
│    ◦ using active «packet switching» devices
│  • transmits data in «frames» (low-level packets)

Like with system buses, networks have evolved away from shared media
(token ring, coaxial 10MiB ethernet, twisted-pair ethernet with
passive hubs). Modern networks use dedicated point-to-point links,
with packet-switching hardware at hubs where a number of
point-to-point links meet. Ethernet ‘packets’ are called frames and
are transmitted as a single unit. Each frame has some metadata
(sender, recipient, size) and of course carries some data (payload).

│ Addressing
│
│  • at this level, only «local» addressing
│    ◦ at most a single LAN segment
│  • uses baked-in MAC addresses
│    ◦ MAC = Media Access Control
│  • addresses belong to «interfaces», not computers

Lowest-level addressing only works within a single ethernet segment
(broadcast domain). All computers know the MAC addresses of all
other computers that they wish to talk to (or rather of their
network interface cards). In old shared-medium networks, the frame
would be transmitted on the shared medium and picked up by the
intended recipient based on the target address. In a packet-switched
network, the switch will keep a mapping of MAC addresses to physical
ports, and only retransmit frames on the port to which the intended
recipient is attached.

│ Transmit Queue
│
│  • «packets» are picked up from «memory»
│  • the OS «prepares» packets into the transmit «queue»
│  • the device picks them up «asynchronously»
│  • similar to how SATA queues commands and data

When the OS wants to send packets (frames) over the network, they
are appended to a «transmit queue» (also known as Tx queue) where
the hardware picks them up and transmits them over its physical
connection. The queue works approximately like this:

 1. each queue (there can be more than one) has a pair of «registers»
    accessible through MMIO, one for the «head pointer» and another
    for a «tail pointer»,
 2. the pointers hold addresses into a «ring buffer» of a fixed size,
    stored in the main memory, accessed through DMA; each item in the
    ring is, again, a pointer, along with a size, and describes a
    memory buffer holding a single frame (packet),
 3. the head and tail pointer split the ring into two parts, one that
    belongs to the NIC and one that belongs to the software,
 4. the operating system (via the driver) controls the «tail pointer»
    in the device register:
    a. to send a packet, it will create a buffer and store the packet
       data in that buffer,
    b. it will fill in the first cell in the OS-controlled part of
       the ring with the address and size of this buffer,
    c. it will shift the tail pointer, handing over the newly
       filled-in cell to the NIC,
 5. the network card controls the «head pointer»: whenever it
    processes a packet, it'll shift the head pointer so that the
    processed buffer is now in the OS-controlled part of the ring.

As outlined in the first part of the lecture, events related to the
transmit ring can be signalled via interrupts.

│ Receive Queue
│
│  • data is also «queued» in the other direction
│  • the NIC copies packets into a «receive queue»
│  • it invokes an «interrupt» to tell the OS about new items
│    ◦ the NIC may batch multiple packets per interrupt
│  • if the queue is not cleared quickly → «packet loss»

The receive (Rx) queue works analogously. Interrupts signal newly
appended items. The OS is in charge of allocating buffers for
packets: handing off a buffer to the NIC on the Rx queue means that
the NIC is free to overwrite the buffer with packet data. After it
does so, the Rx ring cell is handed back to the OS.

In the common case, all frame (packet) buffers must be large enough
to hold a biggest possible frame (known as an MTU = Maximal Transfer
Unit), though at least some NICs can split incoming packets over
multiple Rx cells if they don't fit in a single buffer.

If an Rx ring fills up while packets continue to arrive on the
interface, packets will be lost (hence the OS must clear the Rx ring
sufficiently quickly). The packets don't need to be processed
immediately: the OS is free to allocate new buffers and put
those on the ring, instead of re-using existing buffers. The filled
buffers can be processed and reclaimed later.

│ Multi-Queue Adapters
│
│  • fast adapters can «saturate» a CPU
│    ◦ e.g. 10GbE cards, or multi-port GbE
│  • these NICs can manage «multiple» Rx and Tx queues
│    ◦ each queue gets its own interrupt
│    ◦ different queues → possibly different «CPU cores»

Contemporary network adapters can send and receive packets so
quickly that a single CPU core cannot keep up (since there is
typically a lot of work to be done for each packet as it bubbles up
through the network stack and into user space).

Those same adapters can be configured to use multiple Tx and Rx
queues (rings), each with their own head/tail registers and
interrupt. It is up to the OS to configure these queues – a typical
setup would use a single Tx/Rx pair per CPU core.

For transmission, the NIC simply interleaves packets from all the
queues, since the OS decides which queue to use for sending a
particular packet. It'll typically just use the one associated with
the CPU core performing the operation.

For reception, the story is slightly more complicated, since the NIC
has to decide which queue to use. The NIC can be configured to
filter or hash (parts of) incoming packets and use an Rx queue based
on the result. The goal is to keep related packets in the same queue
(improves locality) but also to keep all queues busy (improves load
balancing).

│ Checksum and TCP Offloading
│
│  • more advanced adapters can «offload» certain features
│  • e.g. computation of mandatory packet «checksums»
│  • but also TCP-related features
│  • needs both «driver» support and «TCP/IP stack» support

To speed up packet processing, some per-packet tasks can be
performed in hardware. Computing and verifying checksums is the most
common task performed by hardware: packet headers often contain a
checksum to detect data corruption. Those checksums can usually be
computed in hardware very quickly and it's a waste of CPU cycles to
do it in software. Hence, when a packet is stored in the Tx queue,
the checksum fields are left blank and the hardware will fill them
in before transmitting the packet (this applies to higher-level
protocol checksums, e.g. TCP; ethernet frame checksums are always
computed in hardware).

While by far the simplest, checksum offloading is not the only task
that can be done in hardware; some others include:

 • cryptography (IPsec) offloading: authentication headers, payload
   encryption and decryption,
 • large send, receive segment coalescing: segmentation and
   reassembly of large TCP packets (i.e. those that don't fit in a
   single IP packet),
 • UDP segmentation (splitting up UDP packets which do not fit into
   the MTU of the NIC).

│ WiFi
│
│  • «wireless» network interface – ‘wireless ethernet’
│  • «shared» medium – electromagnetic waves in air
│  • (almost) mandatory «encryption»
│    ◦ otherwise easy to «eavesdrop» or even actively «attack»
│  • a very «complex» protocol (relative to hardware standards)
│    ◦ assisted by «firmware» running on the adapter

Compared to relative simplicity of wired networks, WiFi is extremely
complicated due to the nature of its medium, which is shared, noisy,
easily eavesdropped and generally unreliable. Devices which connect
to WiFi networks are often portable and need to maintain
connectivity as they move about, switching between access points or
even networks.

Due to pervasive encryption, clients and access points need to
authenticate each other and establish session key pairs.
Authentication is required because otherwise an active attacker
could trick a client into connecting to their device and become a
‘man in the middle’, rendering the encryption ineffective. Since
authentication is required anyway, it often doubles as an access
control measure.

Aspects of WiFi-related protocols are implemented in hardware,
firmware (software running on the adapter) and software (running on
the main CPU).

│ Review Questions
│
│  25. What is memory-mapped IO and DMA?
│  26. What is a system bus?
│  27. What is a graphics accelerator?
│  28. What is a NIC receive queue?