# File Systems File systems are an integral part of general-purpose operating systems. Besides storing user data, the programs and other components that make up an operating system also reside in the file system. In this lecture, we will look at some of the inner workings of a typical POSIX-compatible file system layer. │ Lecture Overview │ │ 1. Filesystem Basics │ 2. The Block Layer │ 3. Virtual Filesystem Switch │ 4. The UNIX Filesystem │ 5. Advanced Features We will first revisit some of the basic concepts that have been already introduced in previous lectures. Afterwards, we will look at the «block layer», which exists below the file system and provides a simpler data storage abstraction. Afterward, we will have a look at «VFS», a system that kernels typically use to allow multiple implementations of the on-disk structures while providing a uniform system call interface on the outside. Afterwards, we will look at both the semantics and the on-disk organization of a typical UNIX file system. Finally, we will take a quick look at some of the more advanced features present in modern file systems (and below, in the block layer). ## Filesystem Basics In this section, we will consider the basic elements of a file system, as understood by POSIX – some of the material will also repeat concepts that we have seen earlier. │ What is a File System? │ │ • a collection of «files» and «directories» │ • (mostly) «hierarchical» │ • usually exposed to the user │ • usually «persistent» (across reboots) │ • file managers, command line, etc. The first concern is the definition of a file system as a whole. The most abstract view is that a file system is a collection of «files» and «directories», where files are essentially byte sequences and directories give «names» to files and other directories. Since directories can contain other directories, the whole structure forms a tree (with some exceptions). File systems are usually user-visible: the user can directly access the directory hierarchy and explore both directory and file content, using tools provided by the operating system. Another typical property of a file system is that it is «persistent»: rebooting the computer will not erase or alter the data stored in the file system. (Note that both these properties have exceptions: file systems can be stored on virtual devices that use RAM for storage, and those are not persistent; likewise, some operating systems do not allow the user to directly access the file system). │ What is a (Regular) File? │ │ • a sequence of «bytes» │ • and some basic «metadata» │ ◦ owner, group, timestamp │ • the OS does «not» care about the content │ ◦ text, images, video, source code are all the same │ ◦ executables are somewhat special Since we have defined a file system as a collection of files and directories, we will need to also give a definition of a «file». Again, we can give a fairly abstract definition that works: a file is an object that consists of a sequence of bytes (the data) and some additional «metadata» – information about the file. The content (i.e. the byte sequence) is typically not interpreted by the operating system (apart from special cases, like executable files). The «metadata» contains things like the identifier of the «owner» of the file, a few timestamps (e.g. the time of last modification) and access permissions. Please note that the file metadata does «not» include a name: file name is not a property of the file itself in POSIX systems. │ What is a Directory? │ │ • a list of «name» → file «mappings» │ • an associative container if you will │ ◦ semantically, the value types are not homogeneous │ ◦ syntactically, they are just «i-nodes» │ • one directory = one component of a path │ ◦ ‹/usr/«local»/bin› The last part of the definition of a file system that we did not explain yet is a «directory». Directories are maps (associative arrays, dictionaries), where the keys are «names» and the values are «files» and other directories. In fact, both files and directories are represented by the same data structure in the file system: an «i-node». We also remember that directories form a tree, and we use «paths» to identify individual files and directories in the file system. Each component (a section delimited by ‹/› on either side) of such a path is used as a key in a single directory. │ What is an i-node? │ │ • an «anonymous», file-like object │ • could be a «regular file» │ ◦ or a directory │ ◦ or a special file │ ◦ or a symlink The «i-node» is how files and directories (and any other file system object) is represented by the POSIX file system. The «i-node» stores references to data blocks (more on that later) and the metadata of the file. │ Files are Anonymous │ │ • this is the case with UNIX │ ◦ not all file systems work like this │ • there are pros and cons to this approach │ ◦ e.g. open files can be «unlinked» │ • names are assigned via «directory entries» As we have already mentioned, files do not carry names – i.e. the «i-node» does not have a field which would store the name of the object. Instead, a name is given to the file by a «directory entry», which ties a string (anything goes as long as there are no ‹/› characters) to an «i-node». Among other things, this means that it is possible to «unlink» files (remove their directory entries) without erasing their data: this happens when the file is currently open. When the last process closes its last file descriptor tied to that «i-node», the data is finally erased from the disk. │ What Else is a Byte Sequence? │ │ • characters coming from a «keyboard» │ • bytes stored on a magnetic «tape» │ • audio data coming from a «microphone» │ • pixels coming from a «webcam» │ • data coming on a «TCP connection» We have previously mentioned that a «byte sequence» is a broadly applicable abstraction. There are many things that can be treated (with a bit of generosity) as byte sequences. Most of those are not persistent in the way files are, but this is not really a problem as far as programming interfaces go. │ Writing Byte Sequences │ │ • sending data to a «printer» │ • playing back «audio» │ • writing text to a «terminal» (emulator) │ • sending data over a «TCP stream» In addition to reading bytes, files (and many other things) support «writing» bytes, i.e. storing or otherwise processing a sequence of bytes that is sent (written) to the object in question. │ Special Files │ │ • many things look somewhat «like files» │ • let's exploit that and unify them with files │ • recall part 2 on APIs: “everything is a file” │ ◦ the «API» is the same for special and regular files │ ◦ «not» the implementation though The generality of the abstraction makes it possible (and desirable) to represent all such objects uniformly, using the file system API. However, besides using the same API, many of those objects literally appear in the file system as «i-nodes» with special properties. Of course, when such a file is opened, reads and writes to the file descriptor are not handled by the same kernel routines that deal with regular files. More on this in a later section, on «VFS». │ File System Types │ │ • fat16, fat32, vfat, exfat (DOS, flash media) │ • ISO 9660 (CD-ROMs) │ • UDF (DVD-ROM) │ • NTFS (Windows NT) │ • HFS+ (macOS) │ • ext2, ext3, ext4 (Linux) │ • ufs, ffs (BSD) Of course, there are many different implementations of the abstract idea of a file system. While many have identical or similar semantics (in most cases those described in this lecture and codified by POSIX), this is not always the case: file systems in the ‹FAT› family, for instance, do not have the concept of an i-node and file names are an intrinsic property of the file. While the semantics are often similar, the underlying disk format can vary rather widely: while the ‹ext› family and ‹ufs› family use a fairly traditional approach, many modern file systems, such as «ZFS», «btrfs» or «hammer» are internally built on B-trees, extendible hashing or other scalable data structures, which can handle much bigger volumes with many more files than the traditional, rather naive approaches. At the same time, modern file systems provide better resilience against corruption of their data structures in the event of errors or unexpected power loss. │ Multi-User Systems │ │ • file ownership │ • file permissions │ • disk quotas Multi-user systems come with an additional set of challenges when it comes to file system implementation. It is usually desirable that files of each user are inaccessible to any other user, unless the owner of the file desires to share that particular file. Likewise, we want to be able to limit how much space each user can take up, which is usually done via disk quotas. │ Ownership & Permissions │ │ • we assume a «discretionary» model │ • whoever creates a file is its «owner» │ • ownership can be transferred │ • the owner decides about «permissions» │ ◦ basically read, write, execute For these reasons, the system must be able to track: 1. file system ownership, i.e. remember which file belongs to which user, 2. file permissions, i.e. what the owner of the file deems as appropriate use of their file – whether other users (and which users) can read the content of the file, or even write new data or replace existing data in the file. Under a discretionary access model, the owner can freely decide about the permissions. There are other models, where this ability is limited (usually to improve security). │ Disk Quotas │ │ • disks are big but «not infinite» │ • bad things happen when the file system fills up │ ◦ denial of service │ ◦ programs may fail and even corrupt data │ • «quotas» limits the amount of space per user Besides limiting what can be done with files, the system must also be able to manage «free space» in the file system: since file systems are stored on some physical medium, the amount of data that can be stored is limited. Disk quotas are a mechanism by which the operating system ensures fair allocation of the space among users (or at least prevents a single user to hog all the space in the file system). ## The Block Layer The block layer (of an operating system) takes care of low-level access to persistent storage devices, such as hard drives, solid-state drives and other similar devices. The file system itself then does not need to understand the details of the underlying device or the protocol by which to communicate with that device. Instead, it uses a uniform API which allows the file system to store and retrieve blocks of data. Or, in other words, a «block device» is an abstraction which makes file system implementation both easier (fewer details to take care of) and more universal (does not depend on the particulars of a given storage device). │ Disk-Like Devices │ │ • disk drives provide «block-level» access │ • read and write data in 512-byte chunks │ ◦ or also 4K on big modern drives │ • a big numbered «array of blocks» A block device (i.e. a disk-like device) has the following basic operations: it can read or write a «block», which is a chunk of data of a fixed size. The specific size depends on the type of device, but the usual block sizes are either 512 bytes (in older devices) or 4096 bytes in more modern hardware. The blocks are numbered and the entire device essentially ‘looks’ like a big array of such blocks. Reading and writing blocks means, in this context, transferring the data stored in that block to or from main memory (RAM). │ Aside: Disk Addressing Schemes │ │ • CHS: Cylinder, Head, Sector │ ◦ «structured» adressing used in (very) old drives │ ◦ exposes information about relative seek times │ ◦ useless with variable-length cylinders │ ◦ 10:4:6 CHS = 1024 cylinders, 16 heads, 63 sectors │ • LBA: Logical Block Addessing │ ◦ «linear», unstructured address space │ ◦ started as 22, later 28, ... now 48 bit Old rotational drives used an addressing scheme which reflected their physical geometry. The address consisted of 3 numbers: the cylinder (distance from the center of the platter), the head (which platter or rather which side of which platter stores the sector) and the sector number (the angle at which the sector is stored). This allowed the driver and operating system to predict operation latency: reading sectors from the same cylinder would usually be fast, reading sectors from a different cylinder would require expensive head movements (known as ‘seeking’). The CHS scheme was abandoned as disk capacities increased and the address space began to run out. The replacement, known as LBA or Logical Block Addressing, uses an unstructured, flat address space (the same as main memory, but the unit of addressing is a block, not a byte). │ Block-Level Access │ │ • disk drivers only expose «linear addressing» │ • one block (sector) is the «minimum read/write size» │ • many sectors can be written ‘at once’ │ ◦ «sequential» access is faster than random │ ◦ maximum «throughput» vs «IOPS» Higher layers of an operating system (starting with the block layer and file system implementation) do not care about the interface between the disk driver and the disk drive itself and always use linear addressing. Within the block layer, transfers are often re-organized to maximize sequential writes (we will discuss this shortly), because most disk drives can read or write a contiguous sequence of blocks much faster than they can read them one at a time. The sequential access regime is important when we ask about the «throughput» of the system (how many bytes the device can supply or store per second), while the number of blocks that it can read or write when they are randomly distributed across the entire device is known as «IOPS» – input/output «operations» per second. │ Aside: Access Times │ │ • block devices are «slow» (compared to RAM) │ ◦ RAM is «slow» (compared to CPU) │ • we cannot treat drives as an extension of RAM │ ◦ not even fastest modern flash storage │ ◦ latency: HDD 3–12 ms, SSD 0.1 ms, RAM 70 ns The entire design of the storage stack, starting with hardware buses, all they way to file systems and even application-level software, most often database systems, is strongly influenced by the «slowness» of block devices. Even the latest flash-based drives are many times slower than RAM (which is many times slower than CPU caches, which are still slower than the computational units in the CPU). │ Block Access Cache │ │ • «caching» is used to hide latency │ ◦ same principle between CPU and RAM │ • files recently accessed are kept in RAM │ ◦ many «cache management» policies exist │ • implemented entirely in the OS │ ◦ many devices implement their own caching │ ◦ but the amount of fast memory is usually limited You already know that CPU relies on «cache memory» to «hide» latency of RAM: data that has been recently used, or that the system suspects will be required shortly, are stored in, or prefetched into, the much faster (but much smaller) «cache». This means that for data that is most often used, the CPU does not need to wait out the entire RAM latency time. The same principle is used by operating systems to hide the latency of block devices, but in this case, the ‘fast’ and ‘small’ memory is the main memory (RAM) while the big and slow memory is the block device. Unlike CPU cache, this so-called «disk cache» is implemented entirely in software. However, there is usually another layer (or two) of caching, which is done in hardware or in drive firmware and which uses dedicated hardware buffers (usually implemented as dynamic RAM or even flash storage) on the device side of the bus. This is distinct from the OS-level cache and we will ignore it from now on. │ Write Buffers │ │ • the «write» equivalent of the block cache │ • data is kept in RAM until it can be processed │ • must synchronise with caching │ ◦ other users may be reading the file There is one major difference between the CPU cache and the block device cache maintained by the OS: there is a persistence mismatch between the cache (RAM) and the main storage (disk). When the system powers down, the content of the RAM is lost and along with it, any modifications which were not yet replicated from the cache onto the persistent storage device. This creates a lot of complexity, since application software (and users) expect data written to disk to stay there. This is especially important (and problematic) in systems which need to be resistant to unexpected power loss, though operating system crashes can cause the same effect even if the system is hooked up to reliable power. For this reason, many operating systems use a split cache design: data to be written is stored in write buffers and flushed to disk as bandwidth and other resources permit. Of course, data stored in write buffers must be replicated into the read cache (or otherwise shared with it), since other processes may be reading the same blocks (usually via the file system) and should see the new content. │ I/O Scheduler (Elevator) │ │ • reads and writes are requested by users │ • access ordering is crucial on a mechanical drive │ ◦ not as important on an SSD │ ◦ but sequential access is still «much preferred» │ • requests are «queued» (recall, disks are slow) │ ◦ but they are «not» processed in «FIFO» order We have already mentioned that sequential access is much faster than randomly distributed access, and that latency in rotational drives depends on physical distance between the locations where individual pieces of data are stored. Since there is usually a lot of parallelism going on in the IO subsystems (different processes reading and writing different files at the same time), it often pays to re-order the requests for IO operations. The desired effect is that a completely random sequence of IO operations coming in from multiple processes, say 16 requests where each request is in a different physical location from the previous, is re-ordered into 4 sets of 4 operations happening in physical proximity (or even in a sequence). This reordering can easily improve overall throughput of the system (in this example situation) 3-4x, since a single batch of 4 sequential operations takes barely any extra time on top of the random-access latency incurred on the first operation in that sequence. The reordering is more easily done with buffered writes: in this case, the block layer is free to shuffle the write queue (subject to ordering constraints imposed by the file system or perhaps by the application). However, reading is more complicated: the OS can certainly speculate what read operations will be requested next. In systems with true asynchronous IO (where the application does not block until the IO is finished), the read queue may be longer, but still more limited than the write queue. │ RAID │ │ • hard drives are also «unreliable» │ ◦ backups help, but take a long «time to restore» │ • RAID = Redundant Array of Inexpensive Disks │ ◦ «live»-replicate same data across multiple drives │ ◦ many different configurations │ • the system «stays online» despite disk failures While «speed» of persistent storage is a major problem, its (lack of) «reliability» is also rather important. While the problem of speed is tackled with caching, reliability is usually improved through «redundancy»: in case a component fails, other components can take over. In case of storage, that means that the data must be replicated across multiple devices, in such a way that if one of them fails, the others can still put together a complete copy of the data. RAID is a low-level realization of this principle. RAID can be implemented in hardware or in software, the latter being more common in present-day systems. Software RAID is part of the block layer of an operating system, and is usually presented to upper layers as a single virtual device. Reading and writing to this virtual device will cause the RAID subsystem of the block layer to distribute the data across multiple physical devices. Most RAID configurations can then continue operating without data loss (and without interruption) if one of the physical devices fails. │ RAID Performance │ │ • RAID affects the performance of the block layer │ • often «improved reading» throughput │ ◦ data is recombined from multiple channels │ • write performance is more «mixed» │ ◦ may require a fair amount of computation │ ◦ «more data» needs to be written for «redundancy» In a fully functioning RAID array, read performance is usually significantly enhanced: the data is collected from multiple devices in parallel, each contributing part of the overall bandwidth. Writing is less clear-cut: depending on the RAID configuration, writes may be faster or slower than a single disk. Writing to a software RAID array also consumes additional CPU cycles, since the operating system must slice the data (and often also compute checksums). │ Block-Level Encryption │ │ • «symmetric» & length-preserving │ • encryption key is derived from a «passphrase» │ • also known as “full disk encryption” │ • incurs a small «performance penalty» │ • very important for «security» / privacy Another feature commonly present in modern operating systems is block-level «encryption» of data. This hardens the system against offline attacks: e.g. if the computer is stolen, the data remains inaccessible without the encryption key (which is usually derived from a passphrase). Disk encryption uses «symmetric crypto» (usually hardware-accelerated AES) and in most implementations, the encryption is «length-preserving»: a single block of data results in a single block of ciphertext which is then directly stored in the corresponding physical block of the persistent storage device. For most of the software stack, this type of encryption is completely transparent. Not even the file system implementation in the operating system needs to be aware that it is stored on an encrypted device. │ Storing Data in Blocks │ │ • splitting data into fixed-size chunks is unnatural │ • there is no permission system for individual blocks │ ◦ this is unlike «virtual (paged) memory» │ ◦ it'd be really inconvenient for users │ • processes are not persistent, but «block storage» is We are used to working with arbitrary-sized files, but this expectation does not map very neatly onto the block device abstraction, where data is stored in fairly big chunks. Additionally, the block layer does not offer a permission system or any kind of sharing or multiplexing: there are just blocks and a process can either read or write any of them, or none at all. While main memory is also a rather dumb array of bytes, organized into pages (which are, in fact, rather block-like), the content of this main memory is much more dynamic and much less important to the user. Part of the distinction comes down to «persistence», but other than that, it is really mostly just a historical accident. │ Filesystem as Resource Sharing │ │ • usually only 1 or few disks per computer │ • many programs want to store «persistent» data │ • file system «allocates space» for the data │ ◦ which blocks belong to which file │ • different programs can write to different files │ ◦ no risk of trying to «use the same block» Nonetheless, there needs to be «some» mechanism for sharing persistent storage among different processes (and different users). This mechanism is, of course, the «file system». Among other things, it manages allocation of free space into files, and maintains persistent identities of such files. │ Filesystem as Abstraction │ │ • allows the data to be «organised» into files │ • enables the user to «manage» and review data │ • files have arbitrary & «dynamic size» │ ◦ blocks are «transparently» allocated & recycled │ • «structured» data instead of a flat block array Block storage is not very convenient, and in almost all cases requires additional abstractions. We can draw an analogy with main memory: the flat array of bytes exposed by the CPU is not what programmers directly use: instead, this memory is managed by a memory allocator like ‹malloc›, which slices it up into logical objects of varying size which can be created and destroyed as needed. The file system plays the same role for persistent storage. However, since the file system is «shared» among many processes, and even directly exposed to users, it needs to be organized more rigidly and intuitively. Instead of storing numeric pointers at arbitrary locations within files (as is the case with in-memory data structures), objects within a file system are strictly separated into objects which carry pointers (structure), i.e. directories, and data-carrying objects, i.e. regular files. Since each pointer in a directory is given a name, this organizational principle gives rise to the directory hierarchy as was discussed in lecture 2. ## Virtual Filesystem Switch A typical kernel must be able to accommodate a fairly large number of different file system implementations: the native file system of the operating system, of course, but at minimum also file systems that are typically used on portable media, such as USB flash drives (‹FAT› or ‹exFAT›) or optical media (ISO 9660, UDF). It is also often desirable that non-native file systems can be mounted, such as accessing ‹NTFS› volumes on Unix systems. Finally, there are file systems which are ported to a particular operating system because of their desirable properties, such as high performance, reliability or scalability and not for compatibility reasons alone (e.g. the SGI ‹xfs› ported from IRIX to Linux, or Sun/Oracle ‹ZFS› ported from Solaris to FreeBSD). Additionally, a number of operating systems employ various virtual file systems (such as ‹proc› on many UNIX-like systems, or ‹sys› on Linux), which also present a unified interface with the other file systems. │ Virtual File System Layer │ │ • many different filesystems │ • the OS wants to treat them «all alike» │ • VFS provides an internal, «in-kernel API» │ • filesystem «syscalls» are hooked up to VFS Since all the different file systems provide essentially the same semantics, and need to be available via a single unified API to the outside world, the kernel would ideally also treat them all alike. This is the role of the virtual file system switch or virtual file system layer. │ VFS in OOP terms │ │ • VFS provides an abstract class, ‹filesystem› │ • each filesystem implementation derives ‹filesystem› │ ◦ e.g. ‹class iso9660 : public filesystem› │ • each actual file system gets an instance │ ◦ ‹/home›, ‹/usr›, ‹/mnt/usbflash› each one │ ◦ the kernel uses the abstract interface to talk to them If you have encountered object-oriented programming, this paradigm should be quite familiar: there is an «interface» which describes a file system, from the point of view of the kernel, in the abstract. Each specific file system implements this abstract interface, but from the point of view of the API, it doesn't really matter which file system it is, since they all provide a uniform API. Unlike the block layer, this abstraction is typically not visible from outside of the kernel. Nonetheless, it is still an important abstraction. │ The ‹filesystem› Class │ │ struct handle { /* ... */ }; /* C */ │ struct filesystem │ { │ virtual int open( const char *path ) = 0; │ virtual int read( handle file, ... ) = 0; │ /* ... */ │ } If we consider the VFS interface as a C++ class, this is approximately how it would look like. In particular, the handle is a possibly complex data structure: the file descriptor abstraction exists between the kernel and the user-space. Within kernel, it is much less useful, since it is, among other things, process-specific. When a file system implementation needs to work with a file, what it needs is a reference to the particular i-node which represents the file (or rather to its in-memory version), and perhaps an iterator into the file: ideally one that is more efficient than a mere offset, i.e. a data structure which can, without scanning through block lists in the i-node, pinpoint specific block (or blocks) which need to be fetched from disk. │ Filesystem-Specific Operations │ │ • ‹open›: «look up» the file for access │ • ‹read›, ‹write› – self-explanatory │ • ‹seek›: move the read/write pointer │ • ‹sync›: flush data to disk │ • ‹mmap›: memory-mapped IO │ • ‹select›: IO readiness notification The VFS operations partially reflect the user-level file access API, with the above-mentioned caveats. The above list is an abridged version of the VFS interface as implemented in the Linux kernel. Since VFS is internal to the kernel, it is not standardized in any way, and different kernels approach the problem differently. │ Standard IO │ │ • the «usual» way to use files │ • open the file │ ◦ operations to read and write «bytes» │ • data has to be «buffered» in user space │ ◦ and then «copied» to/from kernel space │ • not very efficient The standard API for input and output is based around the ‹read› and ‹write› operations. Unfortunately, these are not always efficient, since the data must be copied between the file system cache and the (private) memory of the process requesting the operation. Efficiency of read-heavy workloads can be significantly improved by using memory-mapped IO instead. This is the case whether the program needs random access to the content of a file (i.e. seeks from place to place) or whether it simply wants to process a single byte at a time. │ Memory-mapped IO │ │ • uses «virtual memory» (cf. last lecture) │ • treat a file «as if» it was swap space │ • the file is «mapped» into process memory │ ◦ «page faults» indicate that data needs to be read │ ◦ «dirty» pages cause writes │ • available as the ‹mmap› system call Memory-mapped IO uses the virtual memory subsystem to reduce the amount of copying and context switching involved in IO operations. Under this scheme, file data is (at least initially) «shared» between the block access cache and the process. Modification of data can be handled in two ways: 1. modifications are «private», e.g. when the process simply needs to adjust the data as part of processing, but does not wish to write the changes into the original file, 2. the changes are «shared», i.e. the process intends to make changes to the data and write that data into the file afterwards. In the first case, the mapping is done in a copy-on-write regime: the pages from cache are marked read-only for the process and write attempts cause the operating system to make a new copy of the data in physical memory. Future reads and writes are then redirected into this new copy. In the second case, the pages are write-through – modifications by the process affect both the cache and, eventually, also the on-disk file (when the dirty pages are flushed to disk). │ Sync-ing Data │ │ • recall that the disk is very «slow» │ • waiting for each write to hit disk is inefficient │ • but if data is held in RAM, what if «power is cut»? │ ◦ the ‹sync› operation ensures the data has hit disk │ ◦ often used in database implementations A write operation on a file is, usually, «asynchronous» with regards to the physical medium. That is, control returns from a ‹write› system call long before the data is durably stored. This means that in case of a crash or a power loss, an application may be unable to recover data that it has already written. If it is important that a write is complete before computation continues, operating systems offer a ‹sync› system call (or a variation, like ‹fsync› and ‹fdatasync›) which ensures that all outstanding writes are sent to the device. │ Filesystem-Agnostic Operations │ │ • handling «executables» │ • ‹fcntl› handling │ • special files │ • management of «file descriptors» │ • file «locks» Within the kernel, part of file-system-related functionality is independent of a particular file system implementation, and is instead implemented in other kernel modules. One example would be the code which deals with executable files: let's assume that the file system implements memory mapping of files (which itself is partially implemented in the block layer and virtual memory subsystems). When a program (stored in a regular file on some file system) is executed, the responsible module of the operating system will request a memory mapping of the file through VFS, but the remainder of the code dealing with the ‹exec› system call does not need any file-system-specific knowledge. │ Executables │ │ • «memory mapped» (like ‹mmap›) │ • may be paged in «lazily» │ • executables must be «immutable while running» │ • but can be still unlinked from the directory There are additional provisions regarding executable files. One of them is, that they are often „paged in“ lazily: this is a concept that we will explore in more detail in lecture 5. This is, however, a compromise: it means that if a given executable is ‘running’ – that is, if a process exists which is currently executing the program stored in said executable – the operating system must disallow any writes to the file. Otherwise, the image, which may partially reside in memory and partially on disk, may be corrupted as pages that are already in memory are combined with pages that are loaded lazily from disk. However, since the «name» of the file is external to the file itself, it is not a problem for running executables to be «unlinked». When the last process which uses the executable terminates, the reference count on the i-node drops to zero and the i-node is reclaimed (including any data blocks it was using). │ File Locking │ │ • multiple programs writing the same file is bad │ ◦ operations will come in «randomly» │ ◦ the resulting file will be a mess │ • file locks fix this problem │ ◦ multiple APIs: ‹fcntl› vs ‹flock› │ ◦ differences on networked filesystems The kernel offers a mechanism for «locking» files, making it possible for multiple processes to read and write a shared file safely (i.e. without corrupting it by issuing conflicting ‹write› calls concurrently). In POSIX, there are historically two separate mechanisms for locking files: the ‹flock› call, which locks the entire file at once, and the ‹fcntl› call, which can lock a range of bytes within a file. There are, unfortunately, subtle problems with these locking APIs which makes them easy to misuse, related both to the ‹close› system call and to network file systems. We will not go into details, but if you intend to use file locking, it is a good idea to do some research on this. │ The ‹fcntl› Syscall │ │ • mostly operations relating to «file descriptors» │ ◦ «synchronous» vs «asynchronous» access │ ◦ blocking vs non-blocking │ ◦ close on exec: more on this in a later lecture │ • one of the several «locking» APIs This is another bit of the file system interface that does not normally need to interact with the VFS or with a specific file system implementation. In addition to exposing an API for locking files, the ‹fcntl› system call mainly interacts with the file descriptor table and various file descriptor flags: for instance, it can be used to enable synchronous IO, meaning that every ‹write› call will only return to the caller after the data has been sent to the storage device. Likewise, ‹fcntl› can be used to set blocking (default) or non-blocking mode on a file descriptor: in non-blocking mode, if a write buffer is full, the system call will not wait for space to become available: instead, it will indicate to the program that the operation cannot be completed and that it should be retried at a later point. Likewise, if there is no data available (e.g. in a pipe or in a socket), a ‹read› operation will return immediately indicating that this is the case, instead of waiting for more data to become available (as it would in blocking mode). The close-on-exec flag instructs the kernel to internally perform a ‹close› operation on this file descriptor in the event that an ‹exec› call is performed while the file descriptor is still open. │ Special Files │ │ • device nodes, pipes, sockets, ... │ • only «metadata» for special files lives on disk │ ◦ this includes «permissions» & ownership │ ◦ type and «properties» of the special file │ • they are just different kind of an «i-node» │ • ‹open›, ‹read›, ‹write›, etc. bypass the filesystem The special files that we have considered in lecture 2 are, likewise, mostly independent of the particular file system. The mechanics of reading and writing such files are not tied to any particular file system. The variants of those file-like objects which are linked into the file system hierarchy (named pipes, UNIX domain sockets, devices) are, for the purpose of the file system, simply a special type of i-node, and the file system only needs to store their metadata. When such a file is opened, the file system only obtains the metadata and hands off to a different part of the kernel. │ Mount Points │ │ • recall that there is only a «single directory tree» │ • but there are «multiple disks» and filesystems │ • file systems can be «joined» at directories │ • «root» of one becomes a «subdirectory» of another Finally, mount points are another VFS-layer feature that does not extend into individual file system implementations. In fact, arguably, the main reason for VFS to exist is that multiple different on-disk file systems can be seamlessly integrated into a single tree. The system calls which perform the joining (and un-joining, as it were) are ‹mount› and ‹unmount›, respectively. ## The UNIX Filesystem In this section, we will describe, in relatively low-level terms, how the traditional UNIX file system is organized, including how things are stored on disk. We will also discuss some of the problems which are tied to organisation of data on a medium with severe penalties for non-sequential data access, and with rather inflexible read and write operations (i.e. the problems that arise because the device only provides whole-block operations). │ Superblock │ │ • holds «toplevel» information about the filesystem │ • locations of i-node tables │ • locations of i-node and «free space» bitmaps │ • «block size», filesystem size There is considerable amount of metadata that the file system must store on disk. It is usually impractical to allocate fixed-address regions for all this metadata, hence the locations of these «metadata blocks» must be stored somewhere in the file system itself. Of course, at least some metadata must have a fixed, well-known location, to bootstrap the in-memory data structures at ‹mount› time. This fixed part of the metadata is known as the «superblock». Since it is quite essential, and changes only rarely, it is usually stored in multiple fixed locations throughout the disk, in case the primary copy is damaged. │ I-Nodes │ │ • recall that i-node is an «anonymous file» │ ◦ or a directory, or a special file │ • i-nodes only have «numbers» │ • «directories» tie names to i-nodes Until now, we have treated i-nodes as an abstract concept: they simply represent an object in the file system, be it a regular file, a directory, or some kind of special file. In traditional UNIX file systems, the i-node is also an actual, physical data structure stored on disk. Since the size of the i-node is fixed, they can be stored in large arrays and indexed using numbers: these indices form the basis of the i-node numbering system. Since i-node tables need to be consulted quite often, and the latency of random access to a traditional hard drive depends on the physical distance of the different pieces of data, the i-node array is often split into multiple chunks, each stored in a different physical location. The system then attempts to keep both references to the i-node, and data that the i-node refers to, close together. │ I-Node Allocation │ │ • often a «fixed number» of i-nodes │ • i-nodes are either «used or free» │ • free i-nodes may be stored in a «bitmap» │ • alternatives: B-trees In case i-nodes are stored in an array, this array usually has a fixed size, which means some of the entries are unused. Since creating (and erasing) files are fairly common operations, the system must be able to quickly locate an unused i-node, and also quickly mark a previously used i-node as unused. Scanning the entire table of i-nodes to find an unused one would be very slow, since only a few i-nodes fit into a single disk sector. A common method of speeding this up is through bitmaps: in addition to i-nodes themselves, the file system stores an array of bits, one bit per i-node. This way, even though the system still has to do a linear scan for a free i-node in the bitmap, it will do so at a rate of 512 or 4096 i-nodes per sector. Consider this example (the sector size is 512 bytes): a file system is organized into groups, each with around 200MiB of data, and each equipped with an array of 26000 i-nodes. This works out to around 8KiB of data per i-node, which is the average expected size of a file in this file system (if it was less than that, the file system might run out of i-nodes before it runs out of data space). Each i-node is 128 bytes, hence 4 i-nodes are stored per sector and the entire i-node array thus takes up 6500 sectors (or 3.25MiB). The i-node bitmap is, on the other hand, 51 sectors (or about 25KiB). │ I-Node Content │ │ • exact content of an i-node depends on its type │ • regular file i-nodes contain a list of «data blocks» │ ◦ both direct and indirect (via a data block) │ • symbolic «links» contain the target «path» │ • special files «describe» what «device» they represent The arguably most important part of an i-node is the «data pointers» that it contains: i.e. the addresses of data blocks that hold the actual data (whether the bytes of a regular file, or the data structure which maps names to i-node numbers in case of directories). │ Attaching Data to I-Nodes │ │ • a few «direct» block addresses in the i-node │ ◦ e.g. 10 refs, 4K blocks, max. 40 kilobytes │ • «indirect» data blocks │ ◦ a block full of addresses of other blocks │ ◦ one indirect block approx. 2 MiB of data │ • «extents»: a contiguous range of blocks A traditional approach lists each data block separately. A common optimization makes the observation, that most files are stored in a small number of contiguous spans of blocks. Those spans are called «extents», and more modern file systems store, instead of a simple array of block addresses, a list of extents (i.e. in addition to a block address, they also keep a number that tells the file system how many consecutive blocks are taken up by the file system). This is, essentially, a form of run-length encoding. The obvious downside is that finding the address of the block which contains any given offset is now a linear operation (instead of constant-time), but in a data structure that is usually much smaller (most files will only have a single-digit number of extents, as opposed to perhaps hundreds of individual data blocks). │ Fragmentation │ │ • «internal» – not all blocks are fully used │ ◦ files are of variable size, blocks are fixed │ ◦ a 4100 byte file needs 2 blocks of 4 KiB each │ ◦ this leads to «waste» of disk «space» │ • «external» – free space is non-contiguous │ ◦ happens when many files try to grow at once │ ◦ this means new «files are also fragmented» Every time structured data is stored in an unstructured array of bytes, certain trade-offs come up. One of them has to do with storage efficiency: packing data more tightly often makes many operations slower, and the required metadata more complicated. One of the well-known problems of this type is «fragmentation», of which there are two basic types. Internal fragmentation is caused by alignment: it is much more efficient to start each file at a block boundary, and hence to allocate a whole number of blocks to each file. But because files have arbitrary sizes, there is often some unused space at the end of each file. This space is overhead – it does not store any useful data. Doing things this way, though, makes file access faster. In other words, at the end of most files, there is a small fragment of disk space that cannot be used (because it is smaller than the minimum size that can be allocated for a file, i.e. one block). │ External Fragmentation Problems │ │ • «performance»: can't use fast sequential IO │ ◦ programs often read files sequentially │ ◦ fragmentation → random IO on the device │ • metadata size: can't use long extents External fragmentation, on the other hand, does not directly waste space. In this case, as files are created and destroyed, the free space is eventually distributed across the entire file system, instead of being concentrated in a single contiguous chunk. When creating new files, finding space for them is more work, since it has to be ‘glued’ from many smaller fragments. Metadata grows bigger, since average extent length goes down and hence files need more extents. Finally, and perhaps most importantly, when free space is fragmented, so are the files that eventually make use of it. Access to these files is then less efficient, because every time there is a discontinuity in the data, the system experiences additional latency, due to non-sequential access to the underlying hard drive. │ Directories │ │ • uses «data blocks» (like regular files) │ • but the blocks hold «name → i-node maps» │ • modern file systems use «hashes» or «trees» │ • the format of directory data is «filesystem-specific» Like regular files, directories use data blocks to store their content. However, while the content of regular files is completely arbitrary (i.e. user- or application-defined), directories are interpreted by the file system itself. The simplest format one could adopt to store directories is to simply concatenate all the (null-terminated name, i-node number) tuples after each other. Of course, this would be very inefficient. Most file systems use a more sophisticated data structure (an on-disk hash table, or a balanced tree). │ File Name Lookup │ │ • we often need to find a file based on a path │ • each component means a directory search │ • directories can have many thousands entries One of the most common operations in a file system is that of «file name lookup». This operation is performed many times for each path that needs to be resolved (once for each component of the path). It is therefore very important that this can be done quickly. While it is important that the on-disk format allows for fast lookups, even the fastest on-disk structure would be too slow: almost all operating systems employ an in-memory cache to speed up ‘common’ lookups (i.e. they keep a cache of directories or directory entries which have been recently used in lookups). │ Old-Style Directories │ │ • unsorted sequential list of entries │ • new entries are simply appended at the end │ • unlinking can create holes │ • lookup in large directories is very inefficient As we have mentioned earlier, the simplest (but very inefficient) method of storing directories is to simply keep a linear list of directory entries (i.e. tuples which map a name to an i-node number). This causes a whole lot of problems and no serious contemporary file system uses this approach. │ Hash-Based Directories │ │ • only need one block read on «average» │ • often the most efficient option │ • «extendible» hashing │ ◦ directories can grow over time │ ◦ gradually allocates more blocks A common alternative is to use hash tables, which usually make name lookup very fast: the expected outcome is that, based on the hash of the name, the first sector that is fetched from the disk will contain the requested name/i-node pair, even for moderately large directories. Of course, there are downsides, too: traversing the directory in, say, alphabetical order, will cause a random-access pattern in the underlying disk IO, since the directory is sorted on the hash, which is essentially random. Additionally, pathological directories can cause very bad behaviour, in case all (or most) directory entries hash to the same bucket. │ Tree-Based Directories │ │ • self-balancing search trees │ • optimised for block-level access │ • «B trees», B+ trees, B* trees │ • logarithmic number of reads │ ◦ this is worst case, unlike hashing Another fairly common strategy is to use balanced trees, which have slightly worse average performance, but much better worst-case guarantees, compared to hash tables. Additionally, the entries are stored in sorted order, making certain access patterns more efficient. │ Hard Links │ │ • «multiple names» can refer to the «same i-node» │ ◦ names are given by «directory entries» │ ◦ we call such multiple-named files «hard links» │ ◦ it's usually forbidden to hard-link directories │ • hard links cannot cross device boundaries │ ◦ i-node numbers are only unique within a filesystem An immediate consequence of how files and directories are stored in the file system is the existence of «hard links». Please note that those are «not» special entities: they are merely a name for a situation where multiple directory entries refer to the same i-node. In this case, each of the ‘hard links’ is indistinguishable from all the others, and the same file simply appears in multiple places in the directory hierarchy. Since i-nodes keep a reference count, it is usually possible to tell that the file is available under multiple different paths. Since files are only destroyed when their reference count drops to zero, removing a file from a directory (called «unlinking») may or may not cause the file to be actually erased. │ Soft Links (Symlinks) │ │ • they exist to lift the one-device limitation │ • soft links to directories are allowed │ ◦ this can cause «loops» in the filesystem │ • the soft link i-node contains a «path» │ ◦ the meaning can change when paths change │ • «dangling» link: points to a non-existent path Sometimes, it is useful to refer to a file not by i-node number, but by a path. This can be done by employing «soft links»: unlike hard links, those are actual objects, stored in the file system as a special type of i-node. When such a file is opened, the operating system will instead look up a file by the path stored in the soft link. Of course, this leads to additional problems: for instance, the target path may not exist (this is by far the most common failure mode). If it does, an i-node of the file corresponding to the path is obtained the usual way. Unlike in standard directory entries, this new i-node may reside on a different file system. │ Free Space │ │ • similar problem to i-node allocation │ ◦ but regards data blocks │ • goal: «quickly» locate data blocks to use │ ◦ also: keep data of a single file «close together» │ ◦ also: «minimise» external «fragmentation» │ • usually bitmaps or B-trees Like i-nodes, the file system needs to be able to quickly locate an empty data block, either when files and directories are created, or when when existing ones are extended. The task is slightly more complex for data blocks: for i-nodes, it is sufficient to allocate a single i-node, and placement of i-nodes relative to each other is not very important (though of course it is useful to keep related i-nodes close together). However, many files require more than a single data block, and it is rather important that those blocks reside next to each other, if possible. To make matters worse, it is not always clear how big the file is going to be, and the file system should probably reserve some additional blocks on top of those required for the initial set of writes. Armed with a size estimate, the system then needs to find free space of that size, ideally without impinging on headroom for existing files, and ideally as a single contiguous run of blocks. Of course this may not be possible, in which case it will try to split the file into multiple chunks of free space. Compared to the i-node case, the on-disk data structures are pretty much the same: either free block bitmaps, or balanced search trees. Most of the difference is in the inputs and algorithms. │ File System Consistency │ │ • what happens if «power is cut»? │ • data buffered in RAM is «lost» │ • the IO scheduler can «re-order» disk writes │ • the file system can become «corrupt» File systems, and the data structures they employ, face a somewhat unique challenge: the changes the file system makes to its metadata can be cut off at an arbitrary point, even in the middle of an operation, most often by a power loss. Even worse, the individual writes the system has issued can be re-ordered (to improve performance), which means that there isn't a sharp cut-off point. As an example, consider creation of a regular file: an i-node must be allocated (this means a write to the i-node bitmap), filled in (write to the i-node itself) and linked into a directory (write, or multiple writes, into the data structure of the directory). If operations are performed in this sequence and power is cut off, two things could happen: 1. the bitmap is updated, but the remaining operations are lost: in this case, the i-node is lost, unless a consistency check is performed and discovers that an unused i-node is marked as allocated in the bitmap, 2. the bitmap and the i-node itself are updated, but it is not linked to the directory hierarchy, giving essentially the same outcome, but in a form that is harder to detect during a consistency check. In both cases, some resources are lost, which isn't good, but isn't terrible either. However, consider the case when the writes are re-ordered, and the directory update comes first. In this case, the file system has a directory entry that points to an uninitialized i-node, making a garbage file accessible to users. Depending on the content of the uninitialized i-node, bad things could happen if the file is accessed. Fortunately, the file system can impose a partial order on the writes (i.e. guarantee that all outstanding writes are finished before it makes further changes). Using this mechanism carefully, the amount of damage to data structures can be limited, without significantly impacting performance. The last line of defence is a «dirty flag» in the superblock: when a file system is mounted, the dirty flag is written to disk, and when it is unmounted, after all outstanding changes are written, it is cleared. The file system will refuse to mount if the dirty flag is already set, enforcing a consistency check. │ Journalling │ │ • also known as an «intent log» │ • write down what was going to happen «synchronously» │ • fix the actual metadata based on the journal │ • has a «performance penalty» at run-time │ ◦ «reduces downtime» due to faster consistency checks │ ◦ may also «prevent data loss» A particular technique (that relies on imposing a partial order on writes) is so-called intent log, perhaps better known from relational database systems. In this approach, the file system maintains a dedicated area on disk and before commencing any non-atomic data structure updates, writes the description of the operation (atomically) into the intent log. In this case, if the composite operation is later interrupted, the log captures the «intent» and can be used to undo the effects of the incomplete operation quickly. This means that in most scenarios, a full consistency check (which is usually quite expensive) is not required. ## Advanced Features This section briefly introduces some additional concepts that often appear in the context of file systems. We won't have time to delve into too many details in this course, but it is important to be aware of those features. │ What Else Can Filesystems Do? │ │ • transparent file compression │ • file encryption │ • block de-duplication │ • snapshots │ • checksums │ • redundant storage There are three rough categories of features in the above list: storage efficiency (compression, de-duplication), security (encryption) and reliability (checksums, redundant storage). The ability to make snapshots falls somewhere between efficiency and reliability, depending on the use case and implementation. │ File Compression │ │ • use one of the standard compression algorithms │ ◦ must be fairly «general-purpose» (i.e. «not» JPEG) │ ◦ and of course «lossless» │ ◦ e.g. ‹LZ77›, ‹LZW›, Huffman Coding, ... │ • quite «challenging to implement» │ ◦ the length of the file changes (unpredictably) │ ◦ efficient «random access» inside the file Of course, it's always possible to compress files at the application level, by simply using an appropriate compression program. However, the user then needs to manually decompress the file before using it and then re-compress it afterwards again. This is quite inconvenient. Of course, this could be automated, and some programs can do the (de)compression automatically when they open a file, though this is neither very common, nor very efficient. An alternative, then, is for the file system to implement transparent file compression, that is, compress the data when it is stored on disk, but present user-space programs with uncompressed data when reading, and transparently compress newly written data before storing it. This is not without challenges, of course. The biggest problems stem from the fact that data is never uniformly compressible, and hence, different sections of a file will compress at different ratios. This means that seeking to a specific «uncompressed» offset will be hard to implement. Likewise, writes in the middle of a file (which normally preserve file length) will cause the file to shrink or lengthen, making the operation much more complicated. │ File Encryption │ │ • use «symmetric» encryption for individual files │ ◦ must be «transparent» to upper layers (applications) │ ◦ symmetric crypto is length-preserving │ ◦ «encrypted directories», inheritance, &c. │ • a new set of challenges │ ◦ «key» and passphrase «management» Unlike compression, most encryption algorithms are length-preserving, making the whole affair much simpler in some sense. Nonetheless, there is an important matter of dealing with secrets which makes the code complicated in a different way. Additionally, unlike with compression, depending on the use case, the system might have to encrypt metadata as well (i.e. not just file content). This would, most importantly, cover directories: not only file names, but also the overall directory structure. Additionally, a failure to compress a file that the user intended to be compressed is not a big problem. With encryption, such a mistake would be quite fatal. Since block-level encryption is considerably simpler, and hence less likely to contain fatal flaws, most modern systems use it instead of file-system-level encryption as described here. │ Block De-duplication │ │ • sometimes the same «data block» appears «many times» │ ◦ virtual machine images are a common example │ ◦ also «containers» and so on │ • some file systems will identify those cases │ ◦ internally point many files to the «same block» │ ◦ «copy on write» to preserve illusion of separate files There are numerous use-cases where either entire files or fragments of files are stored multiple times on a given file system. In those cases, locating duplicated blocks can lead to significant space savings. In modern file systems, it is usually not a problem to make the same data block part of multiple files. Like with other copy-on-write implementations, such shared blocks must be specifically marked as such, so that any writes that would affect them can un-share them (i.e. make a private copy). De-duplication is somewhat expensive, since it is not easy to find identical blocks: a naive scanning comparison would take O(n²) time to find duplicated blocks (even though it only uses constant amount of memory). Of course, that is impractical, considering how ⟦n = 10⁹⟧ is only about 4TB of storage and ⟦n² = 10¹⁸⟧ is a «very» large number. Hash tables can make the operation essentially linear, though it requires on the order of 4GiB of RAM per 1TB of storage. Probabilistic algorithms and data structures can further reduce the constant factors on both time and memory. In most cases, de-duplication is an offline process, i.e. one that is scheduled to run at some intervals, preferably during light load, since it is too resource-intensive to be performed continuously with each write (that is, online). │ Snapshots │ │ • it is convenient to be able to «copy» entire filesystems │ ◦ but this is also «expensive» │ ◦ snapshots provide an «efficient» means for this │ • snapshot is a «frozen image» of the filesystem │ ◦ cheap, because snapshots share storage │ ◦ easier than de-duplication │ ◦ again implemented as «copy-on-write» If file system metadata is organized in a suitable data structure (usually a variation of B trees), it is possible to implement copy-on-write not just for data blocks, but also for this metadata. In that case, it is not very hard to provide efficient «snapshots». Taking a snapshot is, semantically, equivalent to making a copy of the entire file system «while it is not mounted»: that is, the copy is made atomically with regards to any concurrent writes. In yet other words, if a program overwrites two files, say A to A' and later it overwrites B to B', if a copy was taken the usual (non-atomic) way, in the copy, it could easily be the case that files A and B' are present (it is an easy exercise to work out how this could happen). The other major advantage of a snapshot is that initially (i.e. when it does not differ very much from the live file system) it takes up very little space. Of course, as the live file system begins to diverge, more and more data blocks need to be copied upon modification, increasing the storage demands, in the worst case approaching the space requirements of a standard, full copy. Nonetheless, the time cost associated with this are amortized over the life of the snapshot. Finally, in case (read-only) snapshots are made regularly, there are permanent savings to be had from copy-on-write, since in those cases, at least some of the data will be shared between the snapshots themselves. │ Checksums │ │ • hardware is «unreliable» │ ◦ individual bytes or sectors may get «corrupted» │ ◦ this may happen without the hardware noticing │ • «checksums» may be stored along with metadata │ ◦ and possibly also «file content» │ ◦ this protects the integrity of the filesystem │ • beware: «not» cryptographically secure Unfortunately, storage devices are not 100% reliable, and can sometimes quietly corrupt data. That is, reading a block may yield different data than what was written to that block previously. A file system may fight this by storing checksums (e.g. CRC) along with metadata (or even along with data). This will not prevent corruption as such, but at least allows the file system to detect it early. When detected, the corrupt objects can be restored from backup: since making a backup involves reading the file system, under this scheme, such corruption should be always detected before a good copy of the affected data is lost. │ Redundant Storage │ │ • like filesystem-level RAID │ • data and metadata blocks are «replicated» │ ◦ may be between multiple local block devices │ ◦ but also across a «cluster» / many computers │ • drastically improves «fault tolerance» Finally, file systems (especially distributed file systems) may employ redundant storage for both data and metadata. Essentially this means that each data block and each metadata object is stored in multiple copies which are all kept synchronized by the file system, each copy stored on a different physical storage device. In some cases, multiple computers may be involved, improving resilience in face of failures in non-disk components (which usually knock out the entire computer). │ Review Questions │ │ 13. What is a block device? │ 14. What is an IO scheduler? │ 15. What does memory-mapped IO mean? │ 16. What is an i-node?