Chapter 100: Kernel Subsystems: Process Management and Scheduling
Chapter Objectives
Upon completing this chapter, you will be able to:
- Explain the fundamental concepts of processes, threads, and their lifecycle within the Linux kernel.
- Differentiate between the Completely Fair Scheduler (CFS) for standard tasks and real-time scheduling policies like
SCHED_FIFO
andSCHED_RR
. - Implement C programs that create and manage processes using the
fork()
,exec()
, andwait()
system calls. - Analyze the scheduling behavior of running processes using command-line tools like
ps
,top
, andchrt
. - Develop multi-threaded applications using the POSIX threads (pthreads) library.
- Debug common scheduling and process management issues, such as priority inversion and resource contention in an embedded context.
Introduction
At the very heart of any modern operating system lies a fundamental challenge: how to share a limited number of CPU cores among a potentially large number of competing tasks. This challenge is magnified in the world of embedded systems, where resources are often constrained and performance requirements can be incredibly strict. A misbehaving process in a desktop environment might lead to a slow user interface; in an embedded system controlling a factory robot or a medical device, it could have catastrophic consequences. This chapter delves into the kernel’s most critical role: process management and scheduling. We will explore the elegant machinery the Linux kernel uses to create the illusion of concurrency, allowing countless programs to run “simultaneously” on a platform like the Raspberry Pi 5. You will learn not just what a process is, but how it is born, how it lives its life cycle managed by the scheduler, and how it eventually terminates. Understanding this subsystem is not merely an academic exercise; it is the key to unlocking the full potential of Embedded Linux, enabling you to build responsive, reliable, and deterministic systems that meet the demanding needs of the real world.
Technical Background
The Essence of a Process
To begin our journey, we must first draw a clear distinction between a program and a process. A program is a passive entity; it is a collection of instructions and data stored in an executable file on your storage device, such as /bin/bash
. It does nothing on its own. A process, in contrast, is the active, living instance of a program when it is loaded into memory and executed. If you open three separate terminal windows, you have one program (/bin/bash
) but three distinct processes, each with its own unique identity and state.
The kernel’s central data structure for managing a process is the task_struct
, often referred to as the process descriptor. This massive C structure is the kernel’s complete knowledge base for a given process. It contains everything the kernel needs to know to manage and schedule the task, including its unique Process ID (PID), its current state (running, sleeping, etc.), pointers to its memory space (the page tables that define its virtual address space), security credentials (the user and group it belongs to), and a record of open files. When the kernel needs to switch from running one process to another—an event called a context switch—it saves the current CPU state (registers, program counter) into the current process’s task_struct
and loads the state from the next process’s task_struct
. This mechanism is what allows the system to multitask effectively.
A process travels through several states during its lifetime. When first created via the fork()
system call, it enters a runnable state, meaning it is ready to execute and is waiting for the scheduler to allocate a CPU core to it. Once the scheduler selects it, the process transitions to the running state. A process might voluntarily give up the CPU by entering a sleeping or waiting state, for instance, when it needs to read data from a file or wait for a network packet. This is an efficient state, as the process consumes no CPU cycles while waiting. Once the event it was waiting for occurs (e.g., the data arrives), the kernel moves it back to the runnable state. Finally, when a process completes its task, it enters a zombie state. It no longer runs, but its task_struct
is kept in memory so its parent process can retrieve its exit status. Once the parent has collected this information, the zombie process is “reaped,” and all its resources are fully released.
--- title: Linux Process State Lifecycle --- graph TD subgraph Process Lifecycle direction TB New -- "fork()" --> Runnable; Runnable -- "Scheduler dispatches" --> Running; Running -- "Scheduler preempts" --> Runnable; Running -- "I/O or Event Wait<br>(e.g., read())" --> Sleeping; Sleeping -- "Event Occurs<br>(e.g., data ready)" --> Runnable; Running -- "exit()" --> Zombie; Zombie -- "Parent calls wait()" --> Terminated; end %% Styling classDef start fill:#1e3a8a,stroke:#1e3a8a,stroke-width:2px,color:#ffffff; classDef process fill:#0d9488,stroke:#0d9488,stroke-width:1px,color:#ffffff; classDef check fill:#ef4444,stroke:#ef4444,stroke-width:1px,color:#ffffff; classDef endo fill:#10b981,stroke:#10b981,stroke-width:2px,color:#ffffff; classDef system fill:#8b5cf6,stroke:#8b5cf6,stroke-width:1px,color:#ffffff; classDef warning fill:#eab308,stroke:#eab308,stroke-width:1px,color:#1f2937; class New,Runnable,Running,Sleeping,Zombie start; class Terminated check; linkStyle 0 stroke:#0d9488,stroke-width:2px; linkStyle 1 stroke:#0d9488,stroke-width:2px; linkStyle 2 stroke:#0d9488,stroke-width:2px; linkStyle 3 stroke:#f59e0b,stroke-width:2px,stroke-dasharray: 5 5; linkStyle 4 stroke:#10b981,stroke-width:2px; linkStyle 5 stroke:#ef4444,stroke-width:2px; linkStyle 6 stroke:#64748b,stroke-width:2px;
The Kernel Scheduler: The Ultimate Arbitrator
The scheduler is the component of the kernel responsible for deciding which runnable process gets to use a CPU core at any given time. Its decisions are governed by a scheduling policy, an algorithm designed to meet specific system goals. These goals often conflict. For example, a scheduler might aim for high throughput (completing the maximum number of jobs in a given time) or low latency (minimizing the response time for a specific event). In a general-purpose desktop system, fairness is a primary concern, ensuring that no single process can monopolize the CPU and starve others. In an embedded system, however, determinism—the ability to guarantee that a task will run within a predictable timeframe—is often the most critical requirement.
To accommodate these diverse needs, the Linux kernel employs a modular, class-based scheduler architecture. Each scheduling class implements a different policy. When the kernel needs to pick a new process to run, it queries the classes in order of priority, starting with the highest. The first class that has a runnable process gets to choose which of its processes runs next.
The Completely Fair Scheduler (CFS)
For the vast majority of tasks on a typical Linux system, scheduling decisions are handled by the Completely Fair Scheduler (CFS). This is the default scheduler for normal, non-real-time processes (SCHED_NORMAL
). The genius of CFS is that it abandons the traditional fixed-timeslice model. Instead of giving each process a predetermined “slice” of time, CFS strives to provide an idealized, perfectly fair multitasking CPU. In this ideal world, every runnable process would get an equal fraction of the processor’s power.
To approximate this ideal, CFS maintains a special variable for each process called vruntime
(virtual runtime). A process’s vruntime
advances only when it is actually running on a CPU. The scheduler always picks the process with the lowest vruntime
to run next. This is elegantly simple: the process that has had the least amount of time on the CPU is the most deserving of it now. To efficiently find the process with the minimum vruntime
, CFS organizes all runnable processes in a red-black tree, a self-balancing binary search tree, ordered by their vruntime
. The scheduler simply picks the leftmost node of the tree to run.
The concept of process priority is implemented by adjusting how quickly a process’s vruntime
accumulates. This is controlled by the nice value, an integer ranging from -20 (highest priority) to +19 (lowest priority). A process with a lower nice value (higher priority) accumulates vruntime
more slowly than a process with a higher nice value. Consequently, it will be considered more “deserving” by the scheduler more often and will receive a larger share of the CPU over time. This system ensures that while higher-priority tasks get more CPU time, no process is ever truly starved of it, hence the name “Completely Fair.”
--- title: Conceptual CFS Red-Black Tree --- graph TD subgraph "Runnable Processes Ordered by vruntime" direction LR A["<br><b>Process A</b><br>vruntime: 1024<br><i>(Next to run)</i>"] B["<br><b>Process B</b><br>vruntime: 1050"] C["<br><b>Process C</b><br>vruntime: 1090"] D["<br><b>Process D</b><br>vruntime: 1120"] E["<br><b>Process E</b><br>vruntime: 1150"] B -- "left" --> A; B -- "right" --> D; D -- "left" --> C; D -- "right" --> E; end subgraph Scheduler Logic Scheduler["<b>CFS Scheduler</b><br>Picks leftmost node"] --> A; end %% Styling style A fill:#10b981,stroke:#10b981,stroke-width:2px,color:#ffffff; style B fill:#0d9488,stroke:#0d9488,stroke-width:1px,color:#ffffff; style C fill:#0d9488,stroke:#0d9488,stroke-width:1px,color:#ffffff; style D fill:#0d9488,stroke:#0d9488,stroke-width:1px,color:#ffffff; style E fill:#0d9488,stroke:#0d9488,stroke-width:1px,color:#ffffff; style Scheduler fill:#8b5cf6,stroke:#8b5cf6,stroke-width:1px,color:#ffffff; linkStyle 0 stroke-width:2px,stroke:black,fill:none; linkStyle 1 stroke-width:2px,stroke:black,fill:none; linkStyle 2 stroke-width:2px,stroke:black,fill:none; linkStyle 3 stroke-width:2px,stroke:black,fill:none; linkStyle 4 stroke-width:2px,stroke:#1e3a8a,stroke-dasharray: 5 5,fill:none;
Real-Time Scheduling
While CFS is excellent for fairness and overall system throughput, it is unsuitable for tasks with strict timing deadlines. An embedded system controlling a motor, for instance, cannot afford to have its control task delayed just because other processes are running. For these scenarios, Linux provides real-time scheduling policies, which are managed by a separate, higher-priority scheduling class.
The two primary real-time policies are SCHED_FIFO
(First-In, First-Out) and SCHED_RR
(Round-Robin). Both policies use a static priority scheme, with values ranging from 1 (lowest) to 99 (highest). Unlike CFS nice values, these are absolute priorities. The scheduler will always run a runnable real-time process with priority 50 over any real-time process with priority 49, and it will run any runnable real-time process over any CFS process, regardless of its nice value.
Under SCHED_FIFO
, a process runs until it either completes, blocks (e.g., waits for I/O), or is preempted by a higher-priority real-time process. It has no time slice. If a SCHED_FIFO
process at priority 80 starts an infinite loop, it will block all other processes at priority 80 or lower from ever running. SCHED_RR
is a simple modification of this. It behaves identically to SCHED_FIFO
, except that it associates a time slice with each process. If an SCHED_RR
process runs for its entire time slice, it is moved to the back of the queue for its priority level, allowing other SCHED_RR
processes at the same priority to run. This prevents a single SCHED_RR
process from monopolizing the CPU among its peers of equal priority.
Warning: With great power comes great responsibility. Misusing real-time schedulers can easily “brick” a system by starving essential system services, making it unresponsive to network requests or even keyboard input. Real-time priorities should only be used for tasks with proven, critical timing requirements.
The Lifecycle in Action: System Calls
Processes are not created out of thin air. They are brought to life and managed through a specific set of system calls, which are the fundamental interface between a user application and the kernel. The classic mechanism for creating a new process in Linux is a two-step dance involving fork()
and exec()
.
The fork()
system call creates a new process, called the child, which is nearly an identical copy of the calling process, the parent. The child inherits copies of the parent’s memory, file descriptors, and credentials. The only initial difference is the PID. To make this operation efficient, the kernel uses a technique called copy-on-write (COW). Instead of immediately duplicating the parent’s entire memory space, the kernel initially lets the parent and child share the same memory pages. Only when one of them attempts to write to a shared page does the kernel step in, create a private copy of that page for the writing process, and update its memory map.
After a fork()
, the child process often needs to run a different program. This is accomplished with one of the exec()
family of system calls (e.g., execlp()
, execve()
). The exec()
call replaces the current process’s memory image with that of a new program. The PID remains the same, but the code, data, and stack are completely overwritten. This fork-exec
combination is the standard way new applications are launched on Linux.
The parent process can monitor its children using the wait()
family of system calls. A call to wait()
will block the parent until one of its children terminates. This is the mechanism for reaping zombie processes and retrieving their exit codes, allowing a parent to know if its child succeeded or failed.
--- title: The fork() - exec() - wait() Lifecycle --- sequenceDiagram participant P as Parent Process participant C as Child Process participant K as Linux Kernel P ->> K: fork() K -->> P: return child_pid K -->> C: return 0 Note over P,C: Both processes now running.<br>Child is a near-identical copy. C ->> K: exec("ls", "-l") Note over C: Child's memory is replaced<br>by the 'ls' program. PID is unchanged. K -->> C: (ls program starts) P ->> K: wait(&status) Note over P: Parent blocks, waiting for<br>child to terminate. C -->> K: (ls program finishes) exit(0) K ->> P: Unblock parent, return status Note over P: Parent reaps child zombie<br>and can now continue execution. P ->> P: Continue execution...
Practical Examples
This section provides hands-on examples that you can run on your Raspberry Pi 5. You will need a standard Raspberry Pi OS installation and access to the command line.
Inspecting and Controlling Processes
The most fundamental tools for process inspection are ps
, top
, and htop
. Let’s explore them.
Open a terminal and run the following command to see all processes running on the system in a detailed format:
ps aux
The output shows the USER, PID, %CPU, %MEM, STAT (state, e.g., ‘R’ for running, ‘S’ for sleeping), and the COMMAND that was executed.
For a more dynamic, real-time view, use top
:
top
top
provides a continuously updated list of the most resource-intensive processes. You can see the PID, user, priority (PR), and nice value (NI). To change a process’s priority, you can use the renice
command. First, find the PID of a non-critical process (e.g., your shell, bash
). Then, give it a lower priority (higher nice value):
# Find the PID of your bash shell
pgrep bash
# Let's say the PID is 1234
# The command requires sudo as you are changing a process you own
# to a different priority level.
sudo renice 10 1234
Running top
again will show that the NI
value for your bash
process is now 10.
Real-Time Scheduling in Practice
Now, let’s demonstrate the tangible difference between CFS and a real-time policy. We will write a simple C program that performs a task and then sleeps for a short period, and we’ll observe how consistently it can meet its timing.
Code Snippet: Latency Test Program
Create a file named latency_test.c
with the following content. This program takes a duration in microseconds as an argument and loops, performing a busy-wait for that duration and then sleeping for 1 millisecond.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <sched.h>
// A simple function to burn CPU cycles for a specified number of microseconds
void busy_wait(long microseconds) {
struct timespec start, current;
clock_gettime(CLOCK_MONOTONIC, &start);
while (1) {
clock_gettime(CLOCK_MONOTONIC, ¤t);
long elapsed_ns = (current.tv_sec - start.tv_sec) * 1000000000L + (current.tv_nsec - start.tv_nsec);
if (elapsed_ns / 1000 >= microseconds) {
break;
}
}
}
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <busy_wait_microseconds>\n", argv[0]);
return 1;
}
long busy_wait_us = atol(argv[1]);
struct timespec sleep_interval = {0, 1000000}; // 1 millisecond
printf("Starting latency test. Press Ctrl+C to stop.\n");
// Print current scheduling policy
int policy = sched_getscheduler(0); // 0 for current process
printf("Current policy: ");
switch(policy) {
case SCHED_FIFO: printf("SCHED_FIFO\n"); break;
case SCHED_RR: printf("SCHED_RR\n"); break;
case SCHED_OTHER: printf("SCHED_OTHER (CFS)\n"); break;
default: printf("Unknown\n"); break;
}
while (1) {
busy_wait(busy_wait_us);
nanosleep(&sleep_interval, NULL);
}
return 0;
}
Build and Execution Steps
1. Compile the code:
gcc -o latency_test latency_test.c -lrt
The -lrt
flag is necessary to link against the real-time library for clock_gettime
.
2. Run under the default CFS scheduler:
While the program is running, generate some system load. For example, open another terminal and run a command like find / -name "*" > /dev/null.
This will cause disk I/O and CPU usage.
./latency_test 500
You will notice the program runs, but its execution might be choppy, especially under heavy system load. It’s competing for CPU time with everything else.
3. Run under a real-time SCHED_FIFO
policy:
Now, use the chrt
(change real-time attributes) utility to run the same program with a high-priority SCHED_FIFO
policy.
Warning: Running untrusted or buggy code at high real-time priority can lock up your system. Ensure your test program is simple and can be terminated (Ctrl+C).
# -f specifies SCHED_FIFO, 80 is the priority
sudo chrt -f 80 ./latency_test 500
Observe the difference. Even with heavy background load, the program’s busy-wait loop will execute far more consistently. It is now prioritized by the kernel above almost all other system tasks, demonstrating the power of real-time scheduling for deterministic execution.
The fork()
–exec()
–wait()
Pattern
Let’s see process creation in action. The following C program demonstrates the full lifecycle.
Code Snippet: Process Creation
Create a file named fork_exec.c
.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main() {
pid_t child_pid = fork();
// fork() returns -1 on error
if (child_pid < 0) {
perror("fork failed");
exit(1);
}
// This code is executed by the child process because fork() returns 0 to it.
if (child_pid == 0) {
printf("CHILD: I am the child process, my PID is %d\n", getpid());
printf("CHILD: My parent's PID is %d\n", getppid());
printf("CHILD: I will now execute 'ls -l'\n");
// Replace this process image with 'ls -l'
execlp("ls", "ls", "-l", NULL);
// This line will only be reached if execlp fails
perror("execlp failed");
exit(1);
}
// This code is executed by the parent process.
else {
printf("PARENT: I am the parent process, my PID is %d\n", getpid());
printf("PARENT: I created a child with PID %d\n", child_pid);
int status;
// Wait for the child process to terminate
wait(&status);
printf("PARENT: My child process has terminated.\n");
if (WIFEXITED(status)) {
printf("PARENT: Child exited with status %d\n", WEXITSTATUS(status));
}
}
printf("This line is printed only by the parent: %d\n", getpid());
return 0;
}
Build, Flash, and Boot Procedures
1. Compile the program on your Raspberry Pi:
gcc -o fork_exec fork_exec.c
2. Execute it:
./fork_exec
3. Expected Output:
The output order may vary slightly due to scheduling, but the content will be similar to this:
PARENT: I am the parent process, my PID is 2345
PARENT: I created a child with PID 2346
CHILD: I am the child process, my PID is 2346
CHILD: My parent's PID is 2345
CHILD: I will now execute 'ls -l'
total 24
-rwxr-xr-x 1 pi pi 9168 Oct 14 15:30 fork_exec
-rw-r--r-- 1 pi pi 980 Oct 14 15:29 fork_exec.c
... (output of ls -l) ...
PARENT: My child process has terminated.
PARENT: Child exited with status 0
This line is printed only by the parent: 2345
This example cleanly demonstrates the separation between parent and child, the replacement of the child’s program with `ls` via `execlp`, and the parent waiting for the child to complete before exiting itself.
Common Mistakes & Troubleshooting
Exercises
- Process Priority Analysis: Write a shell script that uses
ps
,sort
, andhead
to identify the top 5 processes currently consuming the most CPU time on your Raspberry Pi. For each of these processes, print its PID, current nice value, and the command name. - Creating a Process Chain: Modify the
fork_exec.c
example to create a chain of three processes. The main process should fork a child, which in turn forks a grandchild. Each process should print its own PID and its parent’s PID. The parent must wait for its direct child, and the child must wait for its direct grandchild, ensuring an orderly shutdown. - Race Condition Demonstration: Write a C program using pthreads that creates two threads. Both threads should execute a loop 1,000,000 times, incrementing a single global integer variable in each iteration. Run the program and print the final value of the integer. It will likely be less than 2,000,000. Then, modify the program to use a
pthread_mutex_t
to protect the increment operation and verify that the final result is correct. - Real-Time Starvation Experiment: Create a simple C program that does nothing but an infinite loop (
while(1);
). Compile it. First, run it in the background as a standard CFS process (./my_loop &
). Observe system responsiveness withtop
; it should be fine. Next, terminate that process and run it with a high-prioritySCHED_FIFO
policy (sudo chrt -f 90 ./my_loop &
). Observe how difficult it becomes to use the terminal or run other commands. Analyze and explain why this happens.
Summary
- Processes vs. Programs: A program is a static file on disk, while a process is a running instance of that program with its own memory, state, and resources managed by the kernel via the
task_struct
. - The Scheduler’s Role: The kernel scheduler decides which runnable process gets to use the CPU, balancing the competing goals of fairness, throughput, and responsiveness.
- CFS for Fairness: The default Completely Fair Scheduler (CFS) uses a
vruntime
concept to give every process a fair share of the CPU, influenced bynice
values. - Real-Time Policies for Determinism:
SCHED_FIFO
andSCHED_RR
provide static, high-priority scheduling for tasks with strict timing requirements, preempting all CFS tasks. - Process Lifecycle: Processes are created using the
fork()
–exec()
pattern and reaped by their parents usingwait()
, transitioning through states like running, sleeping, and zombie. - Practical Management: Tools like
ps
,top
, andchrt
are essential for observing and controlling process execution and scheduling policies on a live system.
Further Reading
- Linux Kernel Documentation on Scheduling: The official source for understanding the intricacies of the Linux scheduler. (https://www.kernel.org/doc/html/latest/scheduler/index.html)
- “Understanding the Linux Kernel, 3rd Edition” by Daniel P. Bovet & Marco Cesati: A classic, deep dive into the kernel’s internals. While older, the core concepts of process management remain highly relevant.
- “Linux System Programming, 2nd Edition” by Robert Love: An excellent, practical guide to using Linux system calls, including detailed chapters on process and thread management.
- The
sched(7)
Linux Manual Page: An essential reference. On your Raspberry Pi, runman 7 sched
for a detailed technical description of the different scheduling policies and their APIs. - Raspberry Pi Foundation Documentation: Official hardware and software documentation for the Raspberry Pi platform. (https://www.raspberrypi.com/documentation/)
- LWN.net Kernel Section: A premier source for high-quality articles and discussions on Linux kernel development, including many deep dives into the scheduler’s evolution. (https://lwn.net/Kernel/)