Anatomy of Linux process management
Creation, management, scheduling, and destruction
Skill Level: Intermediate
M. Tim Jones (mtj@mtjones.com)
Independent Author
20 Dec 2008
The creation and management of user-space processes in Linux® have many
principles in common with UNIX® but also include several unique optimizations
specific to Linux. Here, review the life cycle of Linux processes and explore the
kernel internals for user process creation, memory management, scheduling, and
death.
Linux is a very dynamic system with constantly changing computing needs. The
representation of the computational needs of Linux centers around the common
abstraction of the process. Processes can be short-lived (a command executed from
the command line) or long-lived (a network service). For this reason, the general
management of processes and their scheduling is very important.
From user-space, processes are represented by process identifiers (PIDs). From the
user's perspective, a PID is a numeric value that uniquely identifies the process. A
PID doesn't change during the life of a process, but PIDs can be reused after a
process dies, so it's not always ideal to cache them.
In user-space, you can create processes in any of several ways. You can execute a
program (which results in the creation of a new process) or, within a program, you
can invoke a fork or exec system call. The fork call results in the creation of a
child process, while an exec call replaces the current process context with the new
program. I discuss each of these methods to understand how they work.
Read more by Tim Jones on developerWorks
• Tim's Anatomy of... articles
Anatomy of Linux process management
© Copyright IBM Corporation 2008. All rights reserved. Page 1 of 11
• All of Tim's articles on developerWorks
For this article, I build the description of processes by first showing the kernel
representation of processes and how they're managed in the kernel, then review the
various means by which processes are created and scheduled on one or more
processors, and finally, what happens if they die.
Process representation
Within the Linux kernel, a process is represented by a rather large structure called
task_struct. This structure contains all of the necessary data to represent the
process, along with a plethora of other data for accounting and to maintain
relationships with other processes (parents and children). A full description of the
task_struct is beyond the scope of this article, but a portion of task_struct is
shown in Listing 1. This code contains the specific elements this article explores.
Note that task_struct resides in ./linux/include/linux/sched.h.
Listing 1. A small portion of task_struct
struct task_struct {
volatile long state;
void *stack;
unsigned int flags;
int prio, static_prio;
struct list_head tasks;
struct mm_struct *mm, *active_mm;
pid_t pid;
pid_t tgid;
struct task_struct *real_parent;
char comm[TASK_COMM_LEN];
struct thread_struct thread;
struct files_struct *files;
...
};
In Listing 1, you can see several items that you'd expect, such as the state of
execution, a stack, a set of flags, the parent process, the thread of execution (of
which there can be many), and open files. I explore these later in the article but will
introduce a few here. The state variable is a set of bits that indicate the state of the
task. The most common states indicate that the process is running or in a run queue
developerWorks® ibm.com/developerWorks
Anatomy of Linux process management
Page 2 of 11 © Copyright IBM Corporation 2008. All rights reserved.
about to be running (TASK_RUNNING), sleeping (TASK_INTERRUPTIBLE), sleeping
but unable to be woken up (TASK_UNINTERRUPTIBLE), stopped (TASK_STOPPED),
or a few others. A complete list of these flags is available in
./linux/include/linux/sched.h.
The flags word defines a large number of indicators, indicating everything from
whether the process is being created (PF_STARTING) or exiting (PF_EXITING), or
even if the process is currently allocating memory (PF_MEMALLOC). The name of the
executable (excluding the path) occupies the comm (command) field.
Each process is also given a priority (called static_prio), but the actual priority of
the process is determined dynamically based on loading and other factors. The
lower the priority value, the higher its actual priority.
The tasks field provides the linked-list capability. It contains a prev pointer
(pointing to the previous task) and a next pointer (pointing to the next task).
The process's address space is represented by the mm and active_mm fields. The
mm represents the process's memory descriptors, while the active_mm is the
previous process's memory descriptors (an optimization to improve context switch
times).
Finally, the thread_struct identifies the stored state of the process. This element
depends on the particular architecture on which Linux is running, but you can see an
example of this in ./linux/include/asm-i386/processor.h. In this structure, you'll find
the storage for the process when it is switched from the executing context (hardware
registers, program counter, and so on).
Process management
Maximum processes
Although processes are dynamically allocated within Linux, certain
maximums are observed. The maximum is represented in the kernel
by a symbol called max_threads, which can be found in
./linux/kernel/fork.c). You can change this value from user-space
through the proc file system at /proc/sys/kernel/threads-max.
Now, let's explore how you manage processes within Linux. In most cases,
processes are dynamically created and represented by a dynamically allocated
task_struct. One exception is the init process itself, which always exists and is
represented by a statically allocated task_struct. You can see an example of this
in ./linux/arch/i386/kernel/init_task.c.
All processes in Linux are collected in two different ways. The first is a hash table,
which is hashed by the PID value; the second is a circular doubly linked list. The
ibm.com/developerWorks developerWorks®
Anatomy of Linux process management
© Copyright IBM Corporation 2008. All rights reserved. Page 3 of 11
circular list is ideal for iterating through the task list. As the list is circular, there's no
head or tail; but as the init_task always exists, you can use it as an anchor point
to iterate further. Let's look at an example of this to walk through the current set of
tasks.
The task list is not accessible from user-space, but you can easily solve that problem
by inserting code into the kernel in the form of a module. A very simple program is
shown in Listing 2 that iterates the task list and provides a small amount of
information about each task (name, pid, and parent name). Note here that the
module uses printk to emit the output. To view the output, you need to view the
/var/log/messages file with the cat utility (or tail -f /var/log/messages in
real time). The next_task function is a macro in sched.h that simplifies the iteration
of the task list (returns a task_struct reference of the next task).
Listing 2. Simple kernel module to emit task information (procsview.c)
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sched.h>
int init_module( void )
{
/* Set up the anchor point */
struct task_struct *task = &init_task;
/* Walk through the task list, until we hit the
init_task again */
do {
printk( KERN_INFO "*** %s [%d] parent %s\n",
task->comm, task->pid, task->parent->comm
);
} while ( (task = next_task(task)) != &init_task );
return 0;
}
void cleanup_module( void )
{
return;
}
You can compile this module with the Makefile shown in Listing 3. When compiled,
you can insert the kernel object with insmod procsview.ko and remove it with
rmmod procsview.
Listing 3. Makefile to build the kernel module
obj-m += procsview.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
developerWorks® ibm.com/developerWorks
Anatomy of Linux process management
Page 4 of 11 © Copyright IBM Corporation 2008. All rights reserved.
default:
$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
After insertion, /var/log/messages displays output as shown below. You can see
here the idle task (called swapper) and the init task (pid 1).
Nov 12 22:19:51 mtj-desktop kernel: [8503.873310] ***
swapper [0] parent swapper
Nov 12 22:19:51 mtj-desktop kernel: [8503.904182] *** init
[1] parent swapper
Nov 12 22:19:51 mtj-desktop kernel: [8503.904215] ***
kthreadd [2] parent swapper
Nov 12 22:19:51 mtj-desktop kernel: [8503.904233] ***
migration/0 [3] parent kthreadd
...
Note that it's also possible to identify the currently running task. Linux maintains a
symbol called current that is the currently running process (of type
task_struct). If at the end of init_module you add the line:
printk( KERN_INFO, "Current task is %s [%d],
current->comm, current->pid );
you would see:
Nov 12 22:48:45 mtj-desktop kernel: [10233.323662] Current
task is insmod [6538]
Note that the current task is insmod, because the init_module function executes
within the context of the execution of the insmod command. The current symbol
actually refers to a function (get_current) and can be found in an arch-specific
header (for example, ./linux/include/asm-i386/current.h).
Process creation
System call functions
You've probably seen a pattern with the system calls. In many
cases, system calls are named sys_* and provide some of the
initial functionality to implement the call (such as error checking or
user-space activities). The real work is often delegated to another
function called do_*.
So, let's walk through the creation of a process from user-space. The underlying
mechanism is the same for user-space tasks and kernel tasks, as both eventually
rely on a function called do_fork to create the new process. In the case of creating
ibm.com/developerWorks developerWorks®
Anatomy of Linux process management
© Copyright IBM Corporation 2008. All rights reserved. Page 5 of 11
a kernel thread, the kernel calls a function called kernel_thread (see
./linux/arch/i386/kernel/process.c), which performs some initialization, then calls
do_fork.
A similar action occurs for user-space process creation. In user-space, a program
calls fork, which results in a system call to the kernel function called sys_fork
(see ./linux/arch/i386/kernel/process.c). The function relationships are shown
graphically in Figure 1.
Figure 1. Function hierarchy for process creation
From Figure 1, you can see that do_fork provides the basis for process creation.
You can find the do_fork function in ./linux/kernel/fork.c (along with the partner
function, copy_process).
The do_fork function begins with a call to alloc_pidmap, which allocates a new
PID. Next, do_fork checks to see whether the debugger is tracing the parent
process. If it is, the CLONE_PTRACE flag is set in the clone_flags in preparation
for forking. The do_fork function then continues with a call to copy_process,
passing the flags, stack, registers, parent process, and newly allocated PID.
The copy_process function is where the new process is created as a copy of the
parent. This function performs all actions except for starting the process, which is
handled later. The first step in copy_process is validation of the CLONE flags to
ensure that they're consistent. If they're not, an EINVAL error is returned. Next, the
developerWorks® ibm.com/developerWorks
Anatomy of Linux process management
Page 6 of 11 © Copyright IBM Corporation 2008. All rights reserved.
Linux Security Module (LSM) is consulted to see whether the current task may
create a new task. To learn more about LSMs in the context of Security-Enhanced
Linux (SELinux), check out the Resources section.
Next, the dup_task_struct function (found in ./linux/kernel/fork.c) is called, which
allocates a new task_struct and copies the current process's descriptors into it.
After a new thread stack is set up, some state information is initialized and control
returns to copy_process. Back in copy_process, some housekeeping is
performed in addition to several other limit and security checks, including a variety of
initialization on your new task_struct. A sequence of copy functions is then
invoked that copy individual aspects of the process, from copying open file
descriptors (copy_files), copying signal information (copy_sighand and
copy_signal), copying process memory (copy_mm), and finally copying the thread
(copy_thread).
The new task is then assigned to a processor, with some additional checking based
on the processors on which the process is allowed to execute (cpus_allowed).
After the priority of the new process inherits the priority of the parent, a small amount
additional housekeeping is performed, and control returns to do_fork. At this point,
your new process exists but is not yet running. The do_fork function fixes this with
a call to wake_up_new_task. This function, which you can find in
./linux/kernel/sched.c), initializes some of the scheduler housekeeping information,
places the new process in a run queue, then wakes it up for execution. Finally, upon
returning to do_fork, the PID value is returned to the caller and the process is
complete.
Process scheduling
While a process exists in Linux, it can potentially be scheduled through the Linux
scheduler. Although outside of the scope of this article, the Linux scheduler
maintains a set of lists for each priority level on which task_struct references
reside. Tasks are invoked through the schedule function (available in
./linux/kernel/sched.c), which determines the best process to run based on loading
and prior process execution history. You can learn more about the Linux version 2.6
scheduler in Resources.
Process destruction
Process destruction can be driven by several events—from normal process
termination, through a signal, or through a call to the exit function. However
process exit is driven, the process ends through a call to the kernel function
do_exit (available in ./linux/kernel/exit.c). This process is shown graphically in
Figure 2.
ibm.com/developerWorks developerWorks®
Anatomy of Linux process management
© Copyright IBM Corporation 2008. All rights reserved. Page 7 of 11
Figure 2. Function hierarchy for process destruction
The purpose behind do_exit is to remove all references to the current process
from the operating system (for all resources that are not shared). The destruction
process first indicates that the process is exiting by setting the PF_EXITING flag.
Other aspects of the kernel use this indication to avoid manipulating this process
while it's being removed. The cycle of detaching the process from the various
resources that it attained during its life is performed through a series of calls,
including exit_mm (to remove memory pages) to exit_keys (which disposes of
per-thread session and process security keys). The do_exit function performs
various accountings for the disposal of the process, then a series of notifications (for
example, to signal the parent that the child is exiting) is performed through a call to
exit_notify. Finally, the process state is changed to PF_DEAD, and the
schedule function is called to select a new process to execute. Note that if
signalling is required to the parent (or the process is being traced), the task will not
completely disappear. If no signalling is necessary, a call to release_task will
actually reclaim the memory that the process used.
Going further
Linux continues to evolve, and one area that will see further innovation and
optimization is process management. While keeping true to UNIX principles, Linux
continues to push the boundaries. New processor architectures, symmetrical
multiprocessing (SMP), and virtualization will drive new advances in this area of the
kernel. One example is the new O(1) scheduler introduced in Linux version 2.6,
which provides scalability for systems with large numbers of tasks. Another is the
updated threading model using the Native POSIX Thread Library (NPTL), which
enables efficient threading beyond the prior LinuxThreads model. You can learn
developerWorks® ibm.com/developerWorks
Anatomy of Linux process management
Page 8 of 11 © Copyright IBM Corporation 2008. All rights reserved.
more about these innovations and what's ahead in Resources.
ibm.com/developerWorks developerWorks®
Anatomy of Linux process management
© Copyright IBM Corporation 2008. All rights reserved. Page 9 of 11
Resources
Learn
• One of the most innovative aspects of the 2.6 kernel is its O(1) scheduler. It
allows Linux to scale to very large numbers of processes without the typical
overhead. You can learn more about the 2.6 kernel schedule in "Inside the
Linux Scheduler" (developerWorks, June 2006).
• For a great look at memory management in Linux, check out Mel Gorman's
Understanding the Linux Virtual Memory Manager (Prentice Hall, 2004), which
is available in PDF form. This book provides a detailed but accessible
presentation of memory management in Linux, including a chapter on process
address spaces.
• For a nice introduction to process management, see Performance Tuning for
Linux: An Introduction to Kernels (Prentice Hall, 2005). A sample chapter is
available from IBM Press.
• Linux provides an interesting approach to system calls that involves
transitioning between user-space and the kernel (separate address spaces).
You can read more about this in "Kernel command using Linux system calls"
(developerWorks, March 2007).
• In this article, you saw cases in which the kernel checked the security
capabilities of the caller. The basic interface between the kernel and the
security framework is called the Linux Security Module. To explore this module
in the context of SELinux, read "Anatomy of Security-Enhanced Linux
(SELinux)" (developerWorks, April 2008).
• The Portable Operating System Interface (POSIX) standard for threads defines
a standard application programming interface (API) for creating and managing
threads. You can find implementations for POSIX on Linux, Sun Solaris, and
even non-UNIX-based operating systems.
• The Native POSIX Thread Library is a threading implementation in the Linux
kernel for efficiently executing POSIX threads. This technology was introduced
into the 2.6 kernel, where the prior implementation was called LinuxThreads.
• Read "TASK_KILLABLE: New process state in Linux" (developerWorks,
September 2008) for an introduction to a useful alternative to the
TASK_UNINTERRUPTIBLE and TASK_INTERRUPTIBLE process states.
• Read more of Tim's articles on developerWorks.
• In the developerWorks Linux zone, find more resources for Linux developers
(including developers who are new to Linux), and scan our most popular articles
and tutorials.
• See all Linux tips and Linux tutorials on developerWorks.
developerWorks® ibm.com/developerWorks
Anatomy of Linux process management
Page 10 of 11 © Copyright IBM Corporation 2008. All rights reserved.
• Stay current with developerWorks technical events and Webcasts.
Get products and technologies
• With IBM trial software, available for download directly from developerWorks,
build your next development project on Linux.
Discuss
• Get involved in the developerWorks community through blogs, forums,
podcasts, and spaces.
About the author
M. Tim Jones
M. Tim Jones is an embedded firmware architect and the author of Artificial
Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its
second edition), AI Application Programming (in its second edition), and BSD
Sockets Programming from a Multilanguage Perspective. His engineering
background ranges from the development of kernels for geosynchronous spacecraft
to embedded systems architecture and networking protocols development. Tim is a
Senior Architect for Emulex Corp. in Longmont, Colorado.
Trademarks
IBM, the IBM logo, ibm.com, DB2, developerWorks, Lotus, Rational, Tivoli, and
WebSphere are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. These and other
IBM trademarked terms are marked on their first occurrence in this information with
the appropriate symbol (® or ™), indicating US registered or common law
trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries.
See the current list of IBM trademarks.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
ibm.com/developerWorks developerWorks®
Anatomy of Linux process management
© Copyright IBM Corporation 2008. All rights reserved. Page 11 of 11