# File Systems

File systems are an integral part of general-purpose operating
systems. Besides storing user data, the programs and other
components that make up an operating system also reside in the file
system. In this lecture, we will look at some of the inner workings
of a typical POSIX-compatible file system layer.

│ Lecture Overview
│
│ 1. Filesystem Basics
│ 2. The Block Layer
│ 3. Virtual Filesystem Switch
│ 4. The UNIX Filesystem
│ 5. Advanced Features

We will first revisit some of the basic concepts that have been
already introduced in previous lectures. Afterwards, we will look at
the «block layer», which exists below the file system and provides a
simpler data storage abstraction. Afterward, we will have a look at
«VFS», a system that kernels typically use to allow multiple
implementations of the on-disk structures while providing a uniform
system call interface on the outside.

Afterwards, we will look at both the semantics and the on-disk
organization of a typical UNIX file system. Finally, we will take a
quick look at some of the more advanced features present in modern
file systems (and below, in the block layer).

## Filesystem Basics

In this section, we will consider the basic elements of a file
system, as understood by POSIX – some of the material will also
repeat concepts that we have seen earlier.

│ What is a File System?
│
│  • a collection of «files» and «directories»
│  • (mostly) «hierarchical»
│  • usually exposed to the user
│  • usually «persistent» (across reboots)
│  • file managers, command line, etc.

The first concern is the definition of a file system as a whole. The
most abstract view is that a file system is a collection of «files»
and «directories», where files are essentially byte sequences and
directories give «names» to files and other directories. Since
directories can contain other directories, the whole structure forms
a tree (with some exceptions).

File systems are usually user-visible: the user can directly access
the directory hierarchy and explore both directory and file content,
using tools provided by the operating system. Another typical
property of a file system is that it is «persistent»: rebooting the
computer will not erase or alter the data stored in the file system.
(Note that both these properties have exceptions: file systems can
be stored on virtual devices that use RAM for storage, and those are
not persistent; likewise, some operating systems do not allow the
user to directly access the file system).
 
│ What is a (Regular) File?
│
│  • a sequence of «bytes»
│  • and some basic «metadata»
│    ◦ owner, group, timestamp
│  • the OS does «not» care about the content
│    ◦ text, images, video, source code are all the same
│    ◦ executables are somewhat special

Since we have defined a file system as a collection of files and
directories, we will need to also give a definition of a «file».
Again, we can give a fairly abstract definition that works: a file
is an object that consists of a sequence of bytes (the data) and
some additional «metadata» – information about the file. The
content (i.e. the byte sequence) is typically not interpreted by the
operating system (apart from special cases, like executable files).
The «metadata» contains things like the identifier of the «owner» of
the file, a few timestamps (e.g. the time of last modification) and
access permissions. Please note that the file metadata does «not»
include a name: file name is not a property of the file itself in
POSIX systems.

│ What is a Directory?
│
│  • a list of «name» → file «mappings»
│  • an associative container if you will
│    ◦ semantically, the value types are not homogeneous
│    ◦ syntactically, they are just «i-nodes»
│  • one directory = one component of a path
│    ◦ ‹/usr/«local»/bin›

The last part of the definition of a file system that we did not
explain yet is a «directory». Directories are maps (associative
arrays, dictionaries), where the keys are «names» and the values are
«files» and other directories. In fact, both files and directories
are represented by the same data structure in the file system: an
«i-node».

We also remember that directories form a tree, and we use «paths» to
identify individual files and directories in the file system. Each
component (a section delimited by ‹/› on either side) of such a path
is used as a key in a single directory.

│ What is an i-node?
│
│  • an «anonymous», file-like object
│  • could be a «regular file»
│    ◦ or a directory
│    ◦ or a special file
│    ◦ or a symlink

The «i-node» is how files and directories (and any other file system
object) is represented by the POSIX file system. The «i-node» stores
references to data blocks (more on that later) and the metadata of
the file.
 
│ Files are Anonymous
│
│  • this is the case with UNIX
│    ◦ not all file systems work like this
│  • there are pros and cons to this approach
│    ◦ e.g. open files can be «unlinked»
│  • names are assigned via «directory entries»

As we have already mentioned, files do not carry names – i.e. the
«i-node» does not have a field which would store the name of the
object. Instead, a name is given to the file by a «directory entry»,
which ties a string (anything goes as long as there are no ‹/›
characters) to an «i-node».

Among other things, this means that it is possible to «unlink» files
(remove their directory entries) without erasing their data: this
happens when the file is currently open. When the last process
closes its last file descriptor tied to that «i-node», the data is
finally erased from the disk.
 
│ What Else is a Byte Sequence?
│
│  • characters coming from a «keyboard»
│  • bytes stored on a magnetic «tape»
│  • audio data coming from a «microphone»
│  • pixels coming from a «webcam»
│  • data coming on a «TCP connection»

We have previously mentioned that a «byte sequence» is a broadly
applicable abstraction. There are many things that can be treated
(with a bit of generosity) as byte sequences. Most of those are not
persistent in the way files are, but this is not really a problem as
far as programming interfaces go.
 
│ Writing Byte Sequences
│
│  • sending data to a «printer»
│  • playing back «audio»
│  • writing text to a «terminal» (emulator)
│  • sending data over a «TCP stream»

In addition to reading bytes, files (and many other things) support
«writing» bytes, i.e. storing or otherwise processing a sequence of
bytes that is sent (written) to the object in question.

│ Special Files
│
│  • many things look somewhat «like files»
│  • let's exploit that and unify them with files
│  • recall part 2 on APIs: “everything is a file”
│    ◦ the «API» is the same for special and regular files
│    ◦ «not» the implementation though

The generality of the abstraction makes it possible (and desirable)
to represent all such objects uniformly, using the file system API.
However, besides using the same API, many of those objects literally
appear in the file system as «i-nodes» with special properties. Of
course, when such a file is opened, reads and writes to the file
descriptor are not handled by the same kernel routines that deal
with regular files. More on this in a later section, on «VFS».

│ File System Types
│
│  • fat16, fat32, vfat, exfat (DOS, flash media)
│  • ISO 9660 (CD-ROMs)
│  • UDF (DVD-ROM)
│  • NTFS (Windows NT)
│  • HFS+ (macOS)
│  • ext2, ext3, ext4 (Linux)
│  • ufs, ffs (BSD)

Of course, there are many different implementations of the abstract
idea of a file system. While many have identical or similar
semantics (in most cases those described in this lecture and
codified by POSIX), this is not always the case: file systems in the
‹FAT› family, for instance, do not have the concept of an i-node and
file names are an intrinsic property of the file.

While the semantics are often similar, the underlying disk format
can vary rather widely: while the ‹ext› family and ‹ufs› family use
a fairly traditional approach, many modern file systems, such as
«ZFS», «btrfs» or «hammer» are internally built on B-trees,
extendible hashing or other scalable data structures, which can
handle much bigger volumes with many more files than the
traditional, rather naive approaches. At the same time, modern file
systems provide better resilience against corruption of their data
structures in the event of errors or unexpected power loss.
 
│ Multi-User Systems
│
│  • file ownership
│  • file permissions
│  • disk quotas

Multi-user systems come with an additional set of challenges when it
comes to file system implementation. It is usually desirable that
files of each user are inaccessible to any other user, unless the
owner of the file desires to share that particular file. Likewise,
we want to be able to limit how much space each user can take up,
which is usually done via disk quotas.

│ Ownership & Permissions
│
│  • we assume a «discretionary» model
│  • whoever creates a file is its «owner»
│  • ownership can be transferred
│  • the owner decides about «permissions»
│    ◦ basically read, write, execute
 
For these reasons, the system must be able to track:

 1. file system ownership, i.e. remember which file belongs to which
    user,
 2. file permissions, i.e. what the owner of the file deems as
    appropriate use of their file – whether other users (and which
    users) can read the content of the file, or even write new data
    or replace existing data in the file.

Under a discretionary access model, the owner can freely decide
about the permissions. There are other models, where this ability is
limited (usually to improve security).

│ Disk Quotas
│ 
│  • disks are big but «not infinite»
│  • bad things happen when the file system fills up
│    ◦ denial of service
│    ◦ programs may fail and even corrupt data
│  • «quotas» limits the amount of space per user

Besides limiting what can be done with files, the system must also
be able to manage «free space» in the file system: since file
systems are stored on some physical medium, the amount of data that
can be stored is limited. Disk quotas are a mechanism by which the
operating system ensures fair allocation of the space among users
(or at least prevents a single user to hog all the space in the file
system).


## The Block Layer

The block layer (of an operating system) takes care of low-level
access to persistent storage devices, such as hard drives,
solid-state drives and other similar devices. The file system itself
then does not need to understand the details of the underlying
device or the protocol by which to communicate with that device.
Instead, it uses a uniform API which allows the file system to store
and retrieve blocks of data. Or, in other words, a «block device» is
an abstraction which makes file system implementation both easier
(fewer details to take care of) and more universal (does not depend
on the particulars of a given storage device).

│ Disk-Like Devices
│
│  • disk drives provide «block-level» access
│  • read and write data in 512-byte chunks
│    ◦ or also 4K on big modern drives
│  • a big numbered «array of blocks»

A block device (i.e. a disk-like device) has the following basic
operations: it can read or write a «block», which is a chunk of data
of a fixed size. The specific size depends on the type of device,
but the usual block sizes are either 512 bytes (in older devices) or
4096 bytes in more modern hardware. The blocks are numbered and the
entire device essentially ‘looks’ like a big array of such blocks.
Reading and writing blocks means, in this context, transferring the
data stored in that block to or from main memory (RAM).

│ Aside: Disk Addressing Schemes
│
│  • CHS: Cylinder, Head, Sector
│    ◦ «structured» adressing used in (very) old drives
│    ◦ exposes information about relative seek times
│    ◦ useless with variable-length cylinders
│    ◦ 10:4:6 CHS = 1024 cylinders, 16 heads, 63 sectors
│  • LBA: Logical Block Addessing
│    ◦ «linear», unstructured address space
│    ◦ started as 22, later 28, ... now 48 bit

Old rotational drives used an addressing scheme which reflected
their physical geometry. The address consisted of 3 numbers: the
cylinder (distance from the center of the platter), the head (which
platter or rather which side of which platter stores the sector) and
the sector number (the angle at which the sector is stored). This
allowed the driver and operating system to predict operation
latency: reading sectors from the same cylinder would usually be
fast, reading sectors from a different cylinder would require
expensive head movements (known as ‘seeking’).

The CHS scheme was abandoned as disk capacities increased and the
address space began to run out. The replacement, known as LBA or
Logical Block Addressing, uses an unstructured, flat address space
(the same as main memory, but the unit of addressing is a block, not
a byte).

│ Block-Level Access
│ 
│  • disk drivers only expose «linear addressing»
│  • one block (sector) is the «minimum read/write size»
│  • many sectors can be written ‘at once’
│    ◦ «sequential» access is faster than random
│    ◦ maximum «throughput» vs «IOPS»

Higher layers of an operating system (starting with the block layer
and file system implementation) do not care about the interface
between the disk driver and the disk drive itself and always use
linear addressing. Within the block layer, transfers are often
re-organized to maximize sequential writes (we will discuss this
shortly), because most disk drives can read or write a contiguous
sequence of blocks much faster than they can read them one at a
time.

The sequential access regime is important when we ask about the
«throughput» of the system (how many bytes the device can supply or
store per second), while the number of blocks that it can read or
write when they are randomly distributed across the entire device is
known as «IOPS» – input/output «operations» per second.

│ Aside: Access Times
│
│  • block devices are «slow» (compared to RAM)
│    ◦ RAM is «slow» (compared to CPU)
│  • we cannot treat drives as an extension of RAM
│    ◦ not even fastest modern flash storage
│    ◦ latency: HDD 3–12 ms, SSD 0.1 ms, RAM 70 ns

The entire design of the storage stack, starting with hardware
buses, all they way to file systems and even application-level
software, most often database systems, is strongly influenced by the
«slowness» of block devices. Even the latest flash-based drives are
many times slower than RAM (which is many times slower than CPU
caches, which are still slower than the computational units in the
CPU).

│ Block Access Cache
│
│  • «caching» is used to hide latency
│    ◦ same principle between CPU and RAM
│  • files recently accessed are kept in RAM
│    ◦ many «cache management» policies exist
│  • implemented entirely in the OS
│    ◦ many devices implement their own caching
│    ◦ but the amount of fast memory is usually limited

You already know that CPU relies on «cache memory» to «hide» latency
of RAM: data that has been recently used, or that the system
suspects will be required shortly, are stored in, or prefetched
into, the much faster (but much smaller) «cache». This means that
for data that is most often used, the CPU does not need to wait out
the entire RAM latency time.

The same principle is used by operating systems to hide the latency
of block devices, but in this case, the ‘fast’ and ‘small’ memory is
the main memory (RAM) while the big and slow memory is the block
device. Unlike CPU cache, this so-called «disk cache» is implemented
entirely in software.

However, there is usually another layer (or two) of caching, which
is done in hardware or in drive firmware and which uses dedicated
hardware buffers (usually implemented as dynamic RAM or even flash
storage) on the device side of the bus. This is distinct from the
OS-level cache and we will ignore it from now on.

│ Write Buffers
│
│  • the «write» equivalent of the block cache
│  • data is kept in RAM until it can be processed
│  • must synchronise with caching
│    ◦ other users may be reading the file

There is one major difference between the CPU cache and the block
device cache maintained by the OS: there is a persistence mismatch
between the cache (RAM) and the main storage (disk). When the system
powers down, the content of the RAM is lost and along with it, any
modifications which were not yet replicated from the cache onto the
persistent storage device. This creates a lot of complexity, since
application software (and users) expect data written to disk to stay
there. This is especially important (and problematic) in systems
which need to be resistant to unexpected power loss, though
operating system crashes can cause the same effect even if the
system is hooked up to reliable power.

For this reason, many operating systems use a split cache design:
data to be written is stored in write buffers and flushed to disk as
bandwidth and other resources permit. Of course, data stored in
write buffers must be replicated into the read cache (or otherwise
shared with it), since other processes may be reading the same
blocks (usually via the file system) and should see the new content.

│ I/O Scheduler (Elevator)
│
│  • reads and writes are requested by users
│  • access ordering is crucial on a mechanical drive
│    ◦ not as important on an SSD
│    ◦ but sequential access is still «much preferred»
│  • requests are «queued» (recall, disks are slow)
│    ◦ but they are «not» processed in «FIFO» order

We have already mentioned that sequential access is much faster than
randomly distributed access, and that latency in rotational drives
depends on physical distance between the locations where individual
pieces of data are stored. Since there is usually a lot of
parallelism going on in the IO subsystems (different processes
reading and writing different files at the same time), it often pays
to re-order the requests for IO operations.

The desired effect is that a completely random sequence of IO
operations coming in from multiple processes, say 16 requests where
each request is in a different physical location from the previous,
is re-ordered into 4 sets of 4 operations happening in physical
proximity (or even in a sequence). This reordering can easily
improve overall throughput of the system (in this example situation)
3-4x, since a single batch of 4 sequential operations takes barely
any extra time on top of the random-access latency incurred on the
first operation in that sequence.

The reordering is more easily done with buffered writes: in this
case, the block layer is free to shuffle the write queue (subject to
ordering constraints imposed by the file system or perhaps by the
application). However, reading is more complicated: the OS can
certainly speculate what read operations will be requested next. In
systems with true asynchronous IO (where the application does not
block until the IO is finished), the read queue may be longer, but
still more limited than the write queue.

│ RAID
│
│  • hard drives are also «unreliable»
│    ◦ backups help, but take a long «time to restore»
│  • RAID = Redundant Array of Inexpensive Disks
│    ◦ «live»-replicate same data across multiple drives
│    ◦ many different configurations
│  • the system «stays online» despite disk failures

While «speed» of persistent storage is a major problem, its (lack
of) «reliability» is also rather important. While the problem of
speed is tackled with caching, reliability is usually improved
through «redundancy»: in case a component fails, other components
can take over. In case of storage, that means that the data must be
replicated across multiple devices, in such a way that if one of
them fails, the others can still put together a complete copy of the
data.

RAID is a low-level realization of this principle. RAID can be
implemented in hardware or in software, the latter being more common
in present-day systems. Software RAID is part of the block layer of
an operating system, and is usually presented to upper layers as a
single virtual device. Reading and writing to this virtual device
will cause the RAID subsystem of the block layer to distribute the
data across multiple physical devices. Most RAID configurations can
then continue operating without data loss (and without interruption)
if one of the physical devices fails.

│ RAID Performance
│
│  • RAID affects the performance of the block layer
│  • often «improved reading» throughput
│    ◦ data is recombined from multiple channels
│  • write performance is more «mixed»
│    ◦ may require a fair amount of computation
│    ◦ «more data» needs to be written for «redundancy»

In a fully functioning RAID array, read performance is usually
significantly enhanced: the data is collected from multiple devices
in parallel, each contributing part of the overall bandwidth.

Writing is less clear-cut: depending on the RAID configuration,
writes may be faster or slower than a single disk. Writing to a
software RAID array also consumes additional CPU cycles, since the
operating system must slice the data (and often also compute
checksums).

│ Block-Level Encryption
│
│  • «symmetric» & length-preserving
│  • encryption key is derived from a «passphrase»
│  • also known as “full disk encryption”
│  • incurs a small «performance penalty»
│  • very important for «security» / privacy

Another feature commonly present in modern operating systems is
block-level «encryption» of data. This hardens the system against
offline attacks: e.g. if the computer is stolen, the data remains
inaccessible without the encryption key (which is usually derived
from a passphrase).

Disk encryption uses «symmetric crypto» (usually
hardware-accelerated AES) and in most implementations, the
encryption is «length-preserving»: a single block of data results in
a single block of ciphertext which is then directly stored in the
corresponding physical block of the persistent storage device. For
most of the software stack, this type of encryption is completely
transparent. Not even the file system implementation in the
operating system needs to be aware that it is stored on an encrypted
device.

│ Storing Data in Blocks
│
│  • splitting data into fixed-size chunks is unnatural
│  • there is no permission system for individual blocks
│    ◦ this is unlike «virtual (paged) memory»
│    ◦ it'd be really inconvenient for users
│  • processes are not persistent, but «block storage» is

We are used to working with arbitrary-sized files, but this
expectation does not map very neatly onto the block device
abstraction, where data is stored in fairly big chunks.
Additionally, the block layer does not offer a permission system or
any kind of sharing or multiplexing: there are just blocks and a
process can either read or write any of them, or none at all.

While main memory is also a rather dumb array of bytes, organized
into pages (which are, in fact, rather block-like), the content of
this main memory is much more dynamic and much less important to the
user. Part of the distinction comes down to «persistence», but other
than that, it is really mostly just a historical accident.

│ Filesystem as Resource Sharing
│
│  • usually only 1 or few disks per computer
│  • many programs want to store «persistent» data
│  • file system «allocates space» for the data
│    ◦ which blocks belong to which file
│  • different programs can write to different files
│    ◦ no risk of trying to «use the same block»

Nonetheless, there needs to be «some» mechanism for sharing
persistent storage among different processes (and different users).
This mechanism is, of course, the «file system». Among other things,
it manages allocation of free space into files, and maintains
persistent identities of such files.

│ Filesystem as Abstraction
│
│  • allows the data to be «organised» into files
│  • enables the user to «manage» and review data
│  • files have arbitrary & «dynamic size»
│    ◦ blocks are «transparently» allocated & recycled
│  • «structured» data instead of a flat block array

Block storage is not very convenient, and in almost all cases
requires additional abstractions. We can draw an analogy with main
memory: the flat array of bytes exposed by the CPU is not what
programmers directly use: instead, this memory is managed by a
memory allocator like ‹malloc›, which slices it up into logical
objects of varying size which can be created and destroyed as
needed.

The file system plays the same role for persistent storage. However,
since the file system is «shared» among many processes, and even
directly exposed to users, it needs to be organized more rigidly and
intuitively. Instead of storing numeric pointers at arbitrary
locations within files (as is the case with in-memory data
structures), objects within a file system are strictly separated
into objects which carry pointers (structure), i.e. directories, and
data-carrying objects, i.e. regular files. Since each pointer in a
directory is given a name, this organizational principle gives rise
to the directory hierarchy as was discussed in lecture 2.

## Virtual Filesystem Switch

A typical kernel must be able to accommodate a fairly large number
of different file system implementations: the native file system of
the operating system, of course, but at minimum also file systems
that are typically used on portable media, such as USB flash drives
(‹FAT› or ‹exFAT›) or optical media (ISO 9660, UDF). It is also
often desirable that non-native file systems can be mounted, such as
accessing ‹NTFS› volumes on Unix systems. Finally, there are file
systems which are ported to a particular operating system because of
their desirable properties, such as high performance, reliability or
scalability and not for compatibility reasons alone (e.g. the SGI
‹xfs› ported from IRIX to Linux, or Sun/Oracle ‹ZFS› ported from
Solaris to FreeBSD).

Additionally, a number of operating systems employ various virtual
file systems (such as ‹proc› on many UNIX-like systems, or ‹sys› on
Linux), which also present a unified interface with the other file
systems.

│ Virtual File System Layer
│
│  • many different filesystems
│  • the OS wants to treat them «all alike»
│  • VFS provides an internal, «in-kernel API»
│  • filesystem «syscalls» are hooked up to VFS

Since all the different file systems provide essentially the same
semantics, and need to be available via a single unified API to the
outside world, the kernel would ideally also treat them all alike.
This is the role of the virtual file system switch or virtual file
system layer.

│ VFS in OOP terms
│
│  • VFS provides an abstract class, ‹filesystem›
│  • each filesystem implementation derives ‹filesystem›
│    ◦ e.g. ‹class iso9660 : public filesystem›
│  • each actual file system gets an instance
│    ◦ ‹/home›, ‹/usr›, ‹/mnt/usbflash› each one
│    ◦ the kernel uses the abstract interface to talk to them

If you have encountered object-oriented programming, this paradigm
should be quite familiar: there is an «interface» which describes a
file system, from the point of view of the kernel, in the abstract.
Each specific file system implements this abstract interface, but
from the point of view of the API, it doesn't really matter which
file system it is, since they all provide a uniform API. Unlike the
block layer, this abstraction is typically not visible from outside
of the kernel. Nonetheless, it is still an important abstraction.

│ The ‹filesystem› Class
│ 
│     struct handle { /* ... */ }; /* C */
│     struct filesystem
│     {
│         virtual int open( const char *path ) = 0;
│         virtual int read( handle file, ... ) = 0;
│         /* ... */
│     }

If we consider the VFS interface as a C++ class, this is
approximately how it would look like. In particular, the handle is a
possibly complex data structure: the file descriptor abstraction
exists between the kernel and the user-space. Within kernel, it is
much less useful, since it is, among other things, process-specific.
When a file system implementation needs to work with a file, what it
needs is a reference to the particular i-node which represents the
file (or rather to its in-memory version), and perhaps an iterator
into the file: ideally one that is more efficient than a mere
offset, i.e. a data structure which can, without scanning through
block lists in the i-node, pinpoint specific block (or blocks) which
need to be fetched from disk.
    
│ Filesystem-Specific Operations
│
│  • ‹open›: «look up» the file for access
│  • ‹read›, ‹write› – self-explanatory
│  • ‹seek›: move the read/write pointer
│  • ‹sync›: flush data to disk
│  • ‹mmap›: memory-mapped IO
│  • ‹select›: IO readiness notification
 
The VFS operations partially reflect the user-level file access API,
with the above-mentioned caveats. The above list is an abridged
version of the VFS interface as implemented in the Linux kernel.
Since VFS is internal to the kernel, it is not standardized in any
way, and different kernels approach the problem differently.

│ Standard IO
│
│  • the «usual» way to use files
│  • open the file
│    ◦ operations to read and write «bytes»
│  • data has to be «buffered» in user space
│    ◦ and then «copied» to/from kernel space
│  • not very efficient
 
The standard API for input and output is based around the ‹read› and
‹write› operations. Unfortunately, these are not always efficient,
since the data must be copied between the file system cache and the
(private) memory of the process requesting the operation. Efficiency
of read-heavy workloads can be significantly improved by using
memory-mapped IO instead. This is the case whether the program needs
random access to the content of a file (i.e. seeks from place to
place) or whether it simply wants to process a single byte at a
time.

│ Memory-mapped IO
│
│  • uses «virtual memory» (cf. last lecture)
│  • treat a file «as if» it was swap space
│  • the file is «mapped» into process memory
│    ◦ «page faults» indicate that data needs to be read
│    ◦ «dirty» pages cause writes
│  • available as the ‹mmap› system call
 
Memory-mapped IO uses the virtual memory subsystem to reduce the
amount of copying and context switching involved in IO operations.
Under this scheme, file data is (at least initially) «shared»
between the block access cache and the process.

Modification of data can be handled in two ways:

 1. modifications are «private», e.g. when the process simply needs
    to adjust the data as part of processing, but does not wish to
    write the changes into the original file,
 2. the changes are «shared», i.e. the process intends to make
    changes to the data and write that data into the file
    afterwards.

In the first case, the mapping is done in a copy-on-write regime:
the pages from cache are marked read-only for the process and write
attempts cause the operating system to make a new copy of the data
in physical memory. Future reads and writes are then redirected into
this new copy. In the second case, the pages are write-through –
modifications by the process affect both the cache and, eventually,
also the on-disk file (when the dirty pages are flushed to disk).

│ Sync-ing Data
│
│  • recall that the disk is very «slow»
│  • waiting for each write to hit disk is inefficient
│  • but if data is held in RAM, what if «power is cut»?
│    ◦ the ‹sync› operation ensures the data has hit disk
│    ◦ often used in database implementations

A write operation on a file is, usually, «asynchronous» with regards
to the physical medium. That is, control returns from a ‹write›
system call long before the data is durably stored. This means that
in case of a crash or a power loss, an application may be unable to
recover data that it has already written. If it is important that a
write is complete before computation continues, operating systems
offer a ‹sync› system call (or a variation, like ‹fsync› and
‹fdatasync›) which ensures that all outstanding writes are sent to
the device.

│ Filesystem-Agnostic Operations
│
│  • handling «executables»
│  • ‹fcntl› handling
│  • special files
│  • management of «file descriptors»
│  • file «locks»
 
Within the kernel, part of file-system-related functionality is
independent of a particular file system implementation, and is
instead implemented in other kernel modules. One example would be
the code which deals with executable files: let's assume that the
file system implements memory mapping of files (which itself is
partially implemented in the block layer and virtual memory
subsystems). When a program (stored in a regular file on some file
system) is executed, the responsible module of the operating system
will request a memory mapping of the file through VFS, but the
remainder of the code dealing with the ‹exec› system call does not
need any file-system-specific knowledge.

│ Executables
│ 
│  • «memory mapped» (like ‹mmap›)
│  • may be paged in «lazily»
│  • executables must be «immutable while running»
│  • but can be still unlinked from the directory
 
There are additional provisions regarding executable files. One of
them is, that they are often „paged in“ lazily: this is a concept
that we will explore in more detail in lecture 5. This is, however,
a compromise: it means that if a given executable is ‘running’ –
that is, if a process exists which is currently executing the
program stored in said executable – the operating system must
disallow any writes to the file. Otherwise, the image, which may
partially reside in memory and partially on disk, may be corrupted
as pages that are already in memory are combined with pages that are
loaded lazily from disk.

However, since the «name» of the file is external to the file
itself, it is not a problem for running executables to be
«unlinked». When the last process which uses the executable
terminates, the reference count on the i-node drops to zero and the
i-node is reclaimed (including any data blocks it was using).

│ File Locking
│
│  • multiple programs writing the same file is bad
│    ◦ operations will come in «randomly»
│    ◦ the resulting file will be a mess
│  • file locks fix this problem
│    ◦ multiple APIs: ‹fcntl› vs ‹flock›
│    ◦ differences on networked filesystems

The kernel offers a mechanism for «locking» files, making it
possible for multiple processes to read and write a shared file
safely (i.e. without corrupting it by issuing conflicting ‹write›
calls concurrently). In POSIX, there are historically two separate
mechanisms for locking files: the ‹flock› call, which locks the
entire file at once, and the ‹fcntl› call, which can lock a range of
bytes within a file.

There are, unfortunately, subtle problems with these locking APIs
which makes them easy to misuse, related both to the ‹close› system
call and to network file systems. We will not go into details, but
if you intend to use file locking, it is a good idea to do some
research on this.

│ The ‹fcntl› Syscall
│
│  • mostly operations relating to «file descriptors»
│    ◦ «synchronous» vs «asynchronous» access
│    ◦ blocking vs non-blocking
│    ◦ close on exec: more on this in a later lecture 
│  • one of the several «locking» APIs
 
This is another bit of the file system interface that does not
normally need to interact with the VFS or with a specific file
system implementation. In addition to exposing an API for locking
files, the ‹fcntl› system call mainly interacts with the file
descriptor table and various file descriptor flags: for instance, it
can be used to enable synchronous IO, meaning that every ‹write›
call will only return to the caller after the data has been sent to
the storage device.

Likewise, ‹fcntl› can be used to set blocking (default) or
non-blocking mode on a file descriptor: in non-blocking mode, if a
write buffer is full, the system call will not wait for space to
become available: instead, it will indicate to the program that the
operation cannot be completed and that it should be retried at a
later point. Likewise, if there is no data available (e.g. in a pipe
or in a socket), a ‹read› operation will return immediately
indicating that this is the case, instead of waiting for more data
to become available (as it would in blocking mode).

The close-on-exec flag instructs the kernel to internally perform a
‹close› operation on this file descriptor in the event that an
‹exec› call is performed while the file descriptor is still open.
 
│ Special Files
│
│  • device nodes, pipes, sockets, ...
│  • only «metadata» for special files lives on disk
│    ◦ this includes «permissions» & ownership
│    ◦ type and «properties» of the special file
│  • they are just different kind of an «i-node»
│  • ‹open›, ‹read›, ‹write›, etc. bypass the filesystem
 
The special files that we have considered in lecture 2 are,
likewise, mostly independent of the particular file system. The
mechanics of reading and writing such files are not tied to any
particular file system.

The variants of those file-like objects which are linked into the
file system hierarchy (named pipes, UNIX domain sockets, devices)
are, for the purpose of the file system, simply a special type of
i-node, and the file system only needs to store their metadata. When
such a file is opened, the file system only obtains the metadata and
hands off to a different part of the kernel.

│ Mount Points
│
│  • recall that there is only a «single directory tree»
│  • but there are «multiple disks» and filesystems
│  • file systems can be «joined» at directories
│  • «root» of one becomes a «subdirectory» of another

Finally, mount points are another VFS-layer feature that does not
extend into individual file system implementations. In fact,
arguably, the main reason for VFS to exist is that multiple
different on-disk file systems can be seamlessly integrated into a
single tree. The system calls which perform the joining (and
un-joining, as it were) are ‹mount› and ‹unmount›, respectively.

## The UNIX Filesystem

In this section, we will describe, in relatively low-level terms,
how the traditional UNIX file system is organized, including how
things are stored on disk. We will also discuss some of the problems
which are tied to organisation of data on a medium with severe
penalties for non-sequential data access, and with rather inflexible
read and write operations (i.e. the problems that arise because the
device only provides whole-block operations).

│ Superblock
│
│  • holds «toplevel» information about the filesystem
│  • locations of i-node tables
│  • locations of i-node and «free space» bitmaps
│  • «block size», filesystem size

There is considerable amount of metadata that the file system must
store on disk. It is usually impractical to allocate fixed-address
regions for all this metadata, hence the locations of these
«metadata blocks» must be stored somewhere in the file system
itself. Of course, at least some metadata must have a fixed,
well-known location, to bootstrap the in-memory data structures at
‹mount› time. This fixed part of the metadata is known as the
«superblock». Since it is quite essential, and changes only rarely,
it is usually stored in multiple fixed locations throughout the
disk, in case the primary copy is damaged.

│ I-Nodes
│
│  • recall that i-node is an «anonymous file»
│    ◦ or a directory, or a special file
│  • i-nodes only have «numbers»
│  • «directories» tie names to i-nodes

Until now, we have treated i-nodes as an abstract concept: they
simply represent an object in the file system, be it a regular file,
a directory, or some kind of special file. In traditional UNIX file
systems, the i-node is also an actual, physical data structure
stored on disk. Since the size of the i-node is fixed, they can be
stored in large arrays and indexed using numbers: these indices form
the basis of the i-node numbering system.

Since i-node tables need to be consulted quite often, and the
latency of random access to a traditional hard drive depends on the
physical distance of the different pieces of data, the i-node array
is often split into multiple chunks, each stored in a different
physical location. The system then attempts to keep both references
to the i-node, and data that the i-node refers to, close together.

│ I-Node Allocation
│
│  • often a «fixed number» of i-nodes
│  • i-nodes are either «used or free»
│  • free i-nodes may be stored in a «bitmap»
│  • alternatives: B-trees

In case i-nodes are stored in an array, this array usually has a
fixed size, which means some of the entries are unused. Since
creating (and erasing) files are fairly common operations, the
system must be able to quickly locate an unused i-node, and also
quickly mark a previously used i-node as unused. Scanning the entire
table of i-nodes to find an unused one would be very slow, since
only a few i-nodes fit into a single disk sector.

A common method of speeding this up is through bitmaps: in addition
to i-nodes themselves, the file system stores an array of bits, one
bit per i-node. This way, even though the system still has to do a
linear scan for a free i-node in the bitmap, it will do so at a rate
of 512 or 4096 i-nodes per sector.

Consider this example (the sector size is 512 bytes): a file system
is organized into groups, each with around 200MiB of data, and each
equipped with an array of 26000 i-nodes. This works out to around
8KiB of data per i-node, which is the average expected size of a
file in this file system (if it was less than that, the file system
might run out of i-nodes before it runs out of data space). Each
i-node is 128 bytes, hence 4 i-nodes are stored per sector and the
entire i-node array thus takes up 6500 sectors (or 3.25MiB). The
i-node bitmap is, on the other hand, 51 sectors (or about 25KiB).

│ I-Node Content
│
│  • exact content of an i-node depends on its type
│  • regular file i-nodes contain a list of «data blocks»
│    ◦ both direct and indirect (via a data block)
│  • symbolic «links» contain the target «path»
│  • special files «describe» what «device» they represent

The arguably most important part of an i-node is the «data pointers»
that it contains: i.e. the addresses of data blocks that hold the
actual data (whether the bytes of a regular file, or the data
structure which maps names to i-node numbers in case of
directories).

│ Attaching Data to I-Nodes
│
│  • a few «direct» block addresses in the i-node
│    ◦ e.g. 10 refs, 4K blocks, max. 40 kilobytes
│  • «indirect» data blocks
│    ◦ a block full of addresses of other blocks
│    ◦ one indirect block approx. 2 MiB of data
│  • «extents»: a contiguous range of blocks

A traditional approach lists each data block separately. A common
optimization makes the observation, that most files are stored in a
small number of contiguous spans of blocks. Those spans are called
«extents», and more modern file systems store, instead of a simple
array of block addresses, a list of extents (i.e. in addition to a
block address, they also keep a number that tells the file system
how many consecutive blocks are taken up by the file system). This
is, essentially, a form of run-length encoding. The obvious downside
is that finding the address of the block which contains any given
offset is now a linear operation (instead of constant-time), but in
a data structure that is usually much smaller (most files will only
have a single-digit number of extents, as opposed to perhaps
hundreds of individual data blocks).

│ Fragmentation
│
│  • «internal» – not all blocks are fully used
│    ◦ files are of variable size, blocks are fixed
│    ◦ a 4100 byte file needs 2 blocks of 4 KiB each
│    ◦ this leads to «waste» of disk «space»
│  • «external» – free space is non-contiguous
│    ◦ happens when many files try to grow at once
│    ◦ this means new «files are also fragmented»

Every time structured data is stored in an unstructured array of
bytes, certain trade-offs come up. One of them has to do with
storage efficiency: packing data more tightly often makes many
operations slower, and the required metadata more complicated.

One of the well-known problems of this type is «fragmentation», of
which there are two basic types. Internal fragmentation is caused by
alignment: it is much more efficient to start each file at a block
boundary, and hence to allocate a whole number of blocks to each
file. But because files have arbitrary sizes, there is often some
unused space at the end of each file. This space is overhead – it
does not store any useful data. Doing things this way, though, makes
file access faster. In other words, at the end of most files, there
is a small fragment of disk space that cannot be used (because it is
smaller than the minimum size that can be allocated for a file, i.e.
one block).

│ External Fragmentation Problems
│
│  • «performance»: can't use fast sequential IO
│    ◦ programs often read files sequentially
│    ◦ fragmentation → random IO on the device
│  • metadata size: can't use long extents

External fragmentation, on the other hand, does not directly waste
space. In this case, as files are created and destroyed, the free
space is eventually distributed across the entire file system,
instead of being concentrated in a single contiguous chunk.

When creating new files, finding space for them is more work, since
it has to be ‘glued’ from many smaller fragments. Metadata grows
bigger, since average extent length goes down and hence files need
more extents.

Finally, and perhaps most importantly, when free space is
fragmented, so are the files that eventually make use of it. Access
to these files is then less efficient, because every time there is a
discontinuity in the data, the system experiences additional
latency, due to non-sequential access to the underlying hard drive.

│ Directories
│
│  • uses «data blocks» (like regular files)
│  • but the blocks hold «name → i-node maps»
│  • modern file systems use «hashes» or «trees»
│  • the format of directory data is «filesystem-specific»

Like regular files, directories use data blocks to store their
content. However, while the content of regular files is completely
arbitrary (i.e. user- or application-defined), directories are
interpreted by the file system itself. The simplest format one could
adopt to store directories is to simply concatenate all the
(null-terminated name, i-node number) tuples after each other. Of
course, this would be very inefficient. Most file systems use a more
sophisticated data structure (an on-disk hash table, or a balanced
tree).

│ File Name Lookup
│
│  • we often need to find a file based on a path
│  • each component means a directory search
│  • directories can have many thousands entries

One of the most common operations in a file system is that of «file
name lookup». This operation is performed many times for each path
that needs to be resolved (once for each component of the path). It
is therefore very important that this can be done quickly.

While it is important that the on-disk format allows for fast
lookups, even the fastest on-disk structure would be too slow:
almost all operating systems employ an in-memory cache to speed up
‘common’ lookups (i.e. they keep a cache of directories or directory
entries which have been recently used in lookups).

│ Old-Style Directories
│ 
│  • unsorted sequential list of entries
│  • new entries are simply appended at the end
│  • unlinking can create holes
│  • lookup in large directories is very inefficient
 
As we have mentioned earlier, the simplest (but very inefficient)
method of storing directories is to simply keep a linear list of
directory entries (i.e. tuples which map a name to an i-node
number). This causes a whole lot of problems and no serious
contemporary file system uses this approach.

│ Hash-Based Directories
│
│  • only need one block read on «average»
│  • often the most efficient option
│  • «extendible» hashing
│    ◦ directories can grow over time
│    ◦ gradually allocates more blocks

A common alternative is to use hash tables, which usually make name
lookup very fast: the expected outcome is that, based on the hash of
the name, the first sector that is fetched from the disk will
contain the requested name/i-node pair, even for moderately large
directories.

Of course, there are downsides, too: traversing the directory in,
say, alphabetical order, will cause a random-access pattern in the
underlying disk IO, since the directory is sorted on the hash, which
is essentially random. Additionally, pathological directories can
cause very bad behaviour, in case all (or most) directory entries
hash to the same bucket.

│ Tree-Based Directories
│
│  • self-balancing search trees
│  • optimised for block-level access
│  • «B trees», B+ trees, B* trees
│  • logarithmic number of reads
│    ◦ this is worst case, unlike hashing

Another fairly common strategy is to use balanced trees, which have
slightly worse average performance, but much better worst-case
guarantees, compared to hash tables. Additionally, the entries are
stored in sorted order, making certain access patterns more
efficient.

│ Hard Links
│
│  • «multiple names» can refer to the «same i-node»
│    ◦ names are given by «directory entries»
│    ◦ we call such multiple-named files «hard links»
│    ◦ it's usually forbidden to hard-link directories
│  • hard links cannot cross device boundaries
│    ◦ i-node numbers are only unique within a filesystem

An immediate consequence of how files and directories are stored in
the file system is the existence of «hard links». Please note that
those are «not» special entities: they are merely a name for a
situation where multiple directory entries refer to the same i-node.
In this case, each of the ‘hard links’ is indistinguishable from all
the others, and the same file simply appears in multiple places in
the directory hierarchy.

Since i-nodes keep a reference count, it is usually possible to tell
that the file is available under multiple different paths. Since
files are only destroyed when their reference count drops to zero,
removing a file from a directory (called «unlinking») may or may not
cause the file to be actually erased.

│ Soft Links (Symlinks)
│
│  • they exist to lift the one-device limitation
│  • soft links to directories are allowed
│    ◦ this can cause «loops» in the filesystem
│  • the soft link i-node contains a «path»
│    ◦ the meaning can change when paths change
│  • «dangling» link: points to a non-existent path

Sometimes, it is useful to refer to a file not by i-node number, but
by a path. This can be done by employing «soft links»: unlike hard
links, those are actual objects, stored in the file system as a
special type of i-node. When such a file is opened, the operating
system will instead look up a file by the path stored in the soft
link. Of course, this leads to additional problems: for instance,
the target path may not exist (this is by far the most common
failure mode).

If it does, an i-node of the file corresponding to the path is
obtained the usual way. Unlike in standard directory entries, this
new i-node may reside on a different file system.

│ Free Space
│
│  • similar problem to i-node allocation
│    ◦ but regards data blocks
│  • goal: «quickly» locate data blocks to use
│    ◦ also: keep data of a single file «close together»
│    ◦ also: «minimise» external «fragmentation»
│  • usually bitmaps or B-trees

Like i-nodes, the file system needs to be able to quickly locate an
empty data block, either when files and directories are created, or
when when existing ones are extended. The task is slightly more
complex for data blocks: for i-nodes, it is sufficient to allocate a
single i-node, and placement of i-nodes relative to each other is
not very important (though of course it is useful to keep related
i-nodes close together).

However, many files require more than a single data block, and it is
rather important that those blocks reside next to each other, if
possible. To make matters worse, it is not always clear how big the
file is going to be, and the file system should probably reserve
some additional blocks on top of those required for the initial set
of writes. Armed with a size estimate, the system then needs to find
free space of that size, ideally without impinging on headroom for
existing files, and ideally as a single contiguous run of blocks. Of
course this may not be possible, in which case it will try to split
the file into multiple chunks of free space.

Compared to the i-node case, the on-disk data structures are pretty
much the same: either free block bitmaps, or balanced search trees.
Most of the difference is in the inputs and algorithms.

│ File System Consistency
│
│  • what happens if «power is cut»?
│  • data buffered in RAM is «lost»
│  • the IO scheduler can «re-order» disk writes
│  • the file system can become «corrupt»

File systems, and the data structures they employ, face a somewhat
unique challenge: the changes the file system makes to its metadata
can be cut off at an arbitrary point, even in the middle of an
operation, most often by a power loss. Even worse, the individual
writes the system has issued can be re-ordered (to improve
performance), which means that there isn't a sharp cut-off point.

As an example, consider creation of a regular file: an i-node must
be allocated (this means a write to the i-node bitmap), filled in
(write to the i-node itself) and linked into a directory (write, or
multiple writes, into the data structure of the directory). If
operations are performed in this sequence and power is cut off, two
things could happen:

 1. the bitmap is updated, but the remaining operations are lost: in
    this case, the i-node is lost, unless a consistency check is
    performed and discovers that an unused i-node is marked as
    allocated in the bitmap,
 2. the bitmap and the i-node itself are updated, but it is not
    linked to the directory hierarchy, giving essentially the same
    outcome, but in a form that is harder to detect during a consistency
    check.

In both cases, some resources are lost, which isn't good, but isn't
terrible either. However, consider the case when the writes are
re-ordered, and the directory update comes first. In this case, the
file system has a directory entry that points to an uninitialized
i-node, making a garbage file accessible to users. Depending on the
content of the uninitialized i-node, bad things could happen if the
file is accessed.

Fortunately, the file system can impose a partial order on the
writes (i.e. guarantee that all outstanding writes are finished
before it makes further changes). Using this mechanism carefully,
the amount of damage to data structures can be limited, without
significantly impacting performance.

The last line of defence is a «dirty flag» in the superblock: when a
file system is mounted, the dirty flag is written to disk, and when
it is unmounted, after all outstanding changes are written, it is
cleared. The file system will refuse to mount if the dirty flag is
already set, enforcing a consistency check.

│ Journalling
│
│  • also known as an «intent log»
│  • write down what was going to happen «synchronously»
│  • fix the actual metadata based on the journal
│  • has a «performance penalty» at run-time
│    ◦ «reduces downtime» due to faster consistency checks
│    ◦ may also «prevent data loss»

A particular technique (that relies on imposing a partial order on
writes) is so-called intent log, perhaps better known from
relational database systems. In this approach, the file system
maintains a dedicated area on disk and before commencing any
non-atomic data structure updates, writes the description of the
operation (atomically) into the intent log. In this case, if the
composite operation is later interrupted, the log captures the
«intent» and can be used to undo the effects of the incomplete
operation quickly. This means that in most scenarios, a full
consistency check (which is usually quite expensive) is not
required.

## Advanced Features

This section briefly introduces some additional concepts that often
appear in the context of file systems. We won't have time to delve
into too many details in this course, but it is important to be
aware of those features.

│ What Else Can Filesystems Do?
│
│  • transparent file compression
│  • file encryption
│  • block de-duplication
│  • snapshots
│  • checksums
│  • redundant storage

There are three rough categories of features in the above list:
storage efficiency (compression, de-duplication), security
(encryption) and reliability (checksums, redundant storage). The
ability to make snapshots falls somewhere between efficiency and
reliability, depending on the use case and implementation.

│ File Compression
│
│  • use one of the standard compression algorithms
│    ◦ must be fairly «general-purpose» (i.e. «not» JPEG)
│    ◦ and of course «lossless»
│    ◦ e.g. ‹LZ77›, ‹LZW›, Huffman Coding, ...
│  • quite «challenging to implement»
│    ◦ the length of the file changes (unpredictably)
│    ◦ efficient «random access» inside the file

Of course, it's always possible to compress files at the application
level, by simply using an appropriate compression program. However,
the user then needs to manually decompress the file before using it
and then re-compress it afterwards again. This is quite
inconvenient. Of course, this could be automated, and some programs
can do the (de)compression automatically when they open a file,
though this is neither very common, nor very efficient.

An alternative, then, is for the file system to implement
transparent file compression, that is, compress the data when it is
stored on disk, but present user-space programs with uncompressed
data when reading, and transparently compress newly written data
before storing it.

This is not without challenges, of course. The biggest problems stem
from the fact that data is never uniformly compressible, and hence,
different sections of a file will compress at different ratios. This
means that seeking to a specific «uncompressed» offset will be hard
to implement. Likewise, writes in the middle of a file (which
normally preserve file length) will cause the file to shrink or
lengthen, making the operation much more complicated.

│ File Encryption
│
│  • use «symmetric» encryption for individual files
│    ◦ must be «transparent» to upper layers (applications)
│    ◦ symmetric crypto is length-preserving
│    ◦ «encrypted directories», inheritance, &c.
│  • a new set of challenges
│    ◦ «key» and passphrase «management»

Unlike compression, most encryption algorithms are
length-preserving, making the whole affair much simpler in some
sense. Nonetheless, there is an important matter of dealing with
secrets which makes the code complicated in a different way.
Additionally, unlike with compression, depending on the use case,
the system might have to encrypt metadata as well (i.e. not just
file content). This would, most importantly, cover directories: not
only file names, but also the overall directory structure.

Additionally, a failure to compress a file that the user intended to
be compressed is not a big problem. With encryption, such a mistake
would be quite fatal.

Since block-level encryption is considerably simpler, and hence less
likely to contain fatal flaws, most modern systems use it instead of
file-system-level encryption as described here.

│ Block De-duplication
│
│  • sometimes the same «data block» appears «many times»
│    ◦ virtual machine images are a common example
│    ◦ also «containers» and so on
│  • some file systems will identify those cases
│    ◦ internally point many files to the «same block»
│    ◦ «copy on write» to preserve illusion of separate files

There are numerous use-cases where either entire files or fragments
of files are stored multiple times on a given file system. In those
cases, locating duplicated blocks can lead to significant space
savings. In modern file systems, it is usually not a problem to make
the same data block part of multiple files. Like with other
copy-on-write implementations, such shared blocks must be
specifically marked as such, so that any writes that would affect
them can un-share them (i.e. make a private copy).

De-duplication is somewhat expensive, since it is not easy to find
identical blocks: a naive scanning comparison would take O(n²) time
to find duplicated blocks (even though it only uses constant amount
of memory). Of course, that is impractical, considering how ⟦n =
10⁹⟧ is only about 4TB of storage and ⟦n² = 10¹⁸⟧ is a «very» large
number. Hash tables can make the operation essentially linear,
though it requires on the order of 4GiB of RAM per 1TB of storage.
Probabilistic algorithms and data structures can further reduce the
constant factors on both time and memory.

In most cases, de-duplication is an offline process, i.e. one that
is scheduled to run at some intervals, preferably during light load,
since it is too resource-intensive to be performed continuously with
each write (that is, online).

│ Snapshots
│
│  • it is convenient to be able to «copy» entire filesystems
│    ◦ but this is also «expensive»
│    ◦ snapshots provide an «efficient» means for this
│  • snapshot is a «frozen image» of the filesystem
│    ◦ cheap, because snapshots share storage
│    ◦ easier than de-duplication
│    ◦ again implemented as «copy-on-write»
 
If file system metadata is organized in a suitable data structure
(usually a variation of B trees), it is possible to implement
copy-on-write not just for data blocks, but also for this metadata.
In that case, it is not very hard to provide efficient «snapshots».

Taking a snapshot is, semantically, equivalent to making a copy of
the entire file system «while it is not mounted»: that is, the copy
is made atomically with regards to any concurrent writes. In yet
other words, if a program overwrites two files, say A to A' and later
it overwrites B to B', if a copy was taken the usual (non-atomic)
way, in the copy, it could easily be the case that files A and B'
are present (it is an easy exercise to work out how this could
happen).

The other major advantage of a snapshot is that initially (i.e. when
it does not differ very much from the live file system) it takes up
very little space. Of course, as the live file system begins to
diverge, more and more data blocks need to be copied upon
modification, increasing the storage demands, in the worst case
approaching the space requirements of a standard, full copy.
Nonetheless, the time cost associated with this are amortized over
the life of the snapshot. Finally, in case (read-only) snapshots are
made regularly, there are permanent savings to be had from
copy-on-write, since in those cases, at least some of the data will
be shared between the snapshots themselves.

│ Checksums
│ 
│  • hardware is «unreliable»
│    ◦ individual bytes or sectors may get «corrupted»
│    ◦ this may happen without the hardware noticing
│  • «checksums» may be stored along with metadata
│    ◦ and possibly also «file content»
│    ◦ this protects the integrity of the filesystem
│  • beware: «not» cryptographically secure

Unfortunately, storage devices are not 100% reliable, and can
sometimes quietly corrupt data. That is, reading a block may yield
different data than what was written to that block previously. A
file system may fight this by storing checksums (e.g. CRC) along
with metadata (or even along with data). This will not prevent
corruption as such, but at least allows the file system to detect it
early. When detected, the corrupt objects can be restored from
backup: since making a backup involves reading the file system,
under this scheme, such corruption should be always detected before
a good copy of the affected data is lost.

│ Redundant Storage
│ 
│  • like filesystem-level RAID
│  • data and metadata blocks are «replicated»
│    ◦ may be between multiple local block devices
│    ◦ but also across a «cluster» / many computers
│  • drastically improves «fault tolerance»

Finally, file systems (especially distributed file systems) may
employ redundant storage for both data and metadata. Essentially
this means that each data block and each metadata object is stored
in multiple copies which are all kept synchronized by the file
system, each copy stored on a different physical storage device. In
some cases, multiple computers may be involved, improving resilience
in face of failures in non-disk components (which usually knock out
the entire computer).

│ Review Questions
│
│  13. What is a block device?
│  14. What is an IO scheduler?
│  15. What does memory-mapped IO mean?
│  16. What is an i-node?