Chapter 101: Kernel Subsystems: Memory Management
Chapter Objectives
Upon completing this chapter, you will be able to:
- Understand the fundamental differences between physical and virtual memory and the role of the Memory Management Unit (MMU).
- Explain the mechanisms of paging, demand paging, and the structure of page tables in a modern ARMv8 architecture like the one in the Raspberry Pi 5.
- Analyze the kernel’s physical memory allocation strategies, including the buddy system and the slab allocator, and their impact on performance.
- Implement and monitor swap space on an embedded device, understanding the trade-offs involved, especially with flash-based storage.
- Debug common memory-related issues in embedded Linux, such as memory leaks and out-of-memory (OOM) conditions.
- Configure kernel memory parameters and use standard Linux utilities to profile and interpret memory usage on a target system.
Introduction
Memory is the lifeblood of any computing system, but in the resource-constrained world of embedded devices, its management is not just a matter of performance—it is a critical factor for stability, reliability, and power efficiency. This chapter delves into the sophisticated world of memory management within the Linux kernel, a subsystem responsible for orchestrating how every process, and the kernel itself, accesses RAM. While a desktop user might take gigabytes of available memory for granted, an embedded engineer must account for every megabyte. The techniques discussed here are fundamental to building robust systems, from industrial controllers that must run flawlessly for years to battery-powered IoT devices where every milliampere counts.
We will explore the elegant abstraction of virtual memory, which gives each process a clean, private address space, and the hardware that makes it possible: the Memory Management Unit (MMU). You will learn how the kernel juggles physical memory pages, swapping less-used ones to secondary storage, and how it efficiently handles countless small allocations. For developers working on the Raspberry Pi 5, understanding these mechanisms is the key to unlocking the full potential of its powerful 64-bit processor and avoiding the pitfalls that can lead to system crashes or unpredictable behavior.
Technical Background
The Illusion of Virtual Memory and the Role of the MMU
At the heart of modern operating systems lies a powerful illusion: every process believes it has the entire system’s memory to itself. This is the magic of virtual memory. A process doesn’t operate on physical RAM addresses directly. Instead, it works with a set of addresses in its own private virtual address space. On a 64-bit system like the Raspberry Pi 5, this address space is immense—2^64 bytes, a theoretical limit far beyond any physical memory available. This abstraction provides two crucial benefits: process isolation and memory flexibility. Isolation ensures that a misbehaving process cannot corrupt the memory of another process or, more importantly, the kernel itself. Flexibility allows the kernel to manage physical memory far more efficiently, placing data wherever it sees fit without the application’s knowledge.
This entire mechanism is enabled by a piece of hardware called the Memory Management Unit (MMU). The MMU is a component of the CPU that sits between the processor core and the physical memory bus. Its primary job is to translate the virtual addresses generated by the CPU into the physical addresses that correspond to actual locations in the RAM chips. If a process requests access to memory location 0x4000
, the MMU intercepts this request and, using a set of translation tables maintained by the kernel, converts it to a real physical address, say 0x1A3B8000
. This translation is transparent to the application.
To speed up this translation process, the MMU includes a high-speed cache called the Translation Lookaside Buffer (TLB). The TLB stores recently used virtual-to-physical address mappings. When the CPU requests a memory access, the MMU first checks the TLB. If a valid mapping (a TLB hit) is found, the translation is nearly instantaneous. If not (a TLB miss), a more complex process is triggered. The MMU’s page table walker hardware must traverse the page tablesresiding in main memory to find the correct translation. Once found, this mapping is cached in the TLB for future use. A TLB miss is significantly slower than a hit, so system performance relies heavily on maintaining a high TLB hit rate.
Paging and Page Tables: The Blueprint of Memory
Modern memory management is built upon the concept of paging. Both virtual and physical memory are divided into fixed-size blocks. The blocks of virtual memory are called pages, while the blocks of physical memory are called frames. The Linux kernel, along with the underlying hardware, manages memory by mapping pages to frames. On the ARMv8 architecture used in the Raspberry Pi 5, a standard page size is 4KB, though larger page sizes (like 2MB or 1GB, known as huge pages) are also supported to improve performance for applications that use large, contiguous memory regions, as this reduces the number of TLB entries needed.
The blueprint for this mapping is stored in the page tables. A page table is a data structure, typically a multi-level tree, that stores the translations between virtual and physical addresses. When the MMU needs to translate an address, it uses parts of the virtual address as indices to “walk” this tree. For a 64-bit ARM architecture, this typically involves a four-level page table structure (though this can vary). The levels are commonly named:
- Page Global Directory (PGD)
- Page Upper Directory (PUD)
- Page Middle Directory (PMD)
- Page Table Entry (PTE)
Each entry in these tables contains not just the physical address of the next level table or the final frame, but also a set of permission and status bits. These bits control access rights (read, write, execute), whether the page is present in memory, if it has been accessed or modified (the “dirty” bit), and caching policies. This granular control is what allows the kernel to enforce memory protection. If a process tries to write to a read-only page, the MMU will detect the permission violation and trigger a hardware exception, known as a page fault, which transfers control to the kernel. The kernel’s page fault handler then examines the cause. If it’s an illegal operation, it will terminate the process with a “Segmentation fault” error.
However, page faults are not just for errors. They are a cornerstone of efficient memory management, enabling a technique called demand paging. When an application is launched, the kernel doesn’t load the entire executable into memory at once. Instead, it sets up the page table entries to indicate that the pages are not present. The first time the application tries to access code or data in a given page, a page fault occurs. The kernel’s handler then steps in, finds the required data in the executable file on the storage device, loads it into a free physical frame, updates the page table entry to map the virtual page to that frame, and resumes the application. The application is completely unaware of this interruption. This “lazy loading” approach speeds up application startup and reduces memory consumption, as only the parts of a program that are actually used are loaded into RAM.
Swapping: Extending Memory Beyond Physical Limits
Demand paging also enables another powerful feature: swapping. If the system is running low on physical memory, the kernel needs a way to free up some frames for new requests. It can do this by taking a page that is currently in memory but hasn’t been used recently and writing its contents out to a designated storage area, known as swap space or a backing store. This is typically a dedicated partition on an SD card or eMMC, or a special file. Once the page is safely stored, the physical frame it occupied can be repurposed. The page table entry for the swapped-out page is updated to indicate that it is no longer in RAM but is now located in the swap area.
If a process later tries to access this page, a page fault is generated. The kernel’s page fault handler checks the PTE, sees that the page has been swapped out, allocates a new physical frame, reads the page’s content back from the swap space into the frame, updates the page table, and resumes the process. This entire process is transparent to the application, except for the noticeable delay, as accessing an SD card is orders of magnitude slower than accessing RAM.
In the context of embedded systems, swapping is a double-edged sword. While it provides a safety net against out-of-memory conditions, the high latency of flash storage can bring a system to a crawl if it starts swapping heavily (a condition known as thrashing). Furthermore, flash memory has a limited number of write cycles. Aggressive swapping can significantly reduce the lifespan of the storage medium. For these reasons, many embedded systems disable swapping entirely, while others use it cautiously, sometimes with tweaked kernel parameters (like swappiness
) to control how aggressively the kernel chooses to swap.
Physical Memory Management: The Buddy System and Slab Allocator
While the MMU and page tables manage the virtual-to-physical mapping, the kernel needs a robust strategy for managing the pool of physical frames itself. The primary challenge is fragmentation. Over time, as memory is allocated and freed in various-sized chunks, the available free memory can become broken up into many small, non-contiguous blocks. This is known as external fragmentation. A request for a large, contiguous block of memory might fail even if the total amount of free memory is sufficient.
To combat this, the Linux kernel uses the buddy memory allocation algorithm. The buddy system manages all physical memory by dividing it into power-of-2 sized blocks. For example, if a process needs a 9KB block of memory, the buddy system will allocate a 16KB block (the next power of 2). The entire pool of memory is initially treated as one large block. When a request comes in, the allocator finds a block of a suitable size. If an exact match isn’t available, it takes a larger block, splits it in half, and checks again. It continues splitting blocks in half—creating “buddies”—until a block of the desired size is created. The unused half (the buddy) is placed on a free list for its size. When a block is freed, the allocator checks to see if its buddy is also free. If it is, the two are merged back into a single, larger block, effectively reversing the splitting process. This continuous merging process is what makes the buddy system highly effective at combating external fragmentation.
flowchart TD subgraph Freeing Process direction TB F1(16KB Block Freed) --> F2{Is its 16KB buddy free?}; F2 -- Yes --> F3[Merge Buddies]; F3 --> F4(New 32KB Block); F4 --> F5{Is its 32KB buddy free?}; F5 -- Yes --> F6[... Continue merging ...]; F5 -- No --> F7([Return 32KB Block to Free List]); F2 -- No --> F8([Return 16KB Block to Free List]); end subgraph Allocation Process direction TB A1(Start: 1MB Block Available) --> A2{Request for 9KB received}; A2 --> A3["Kernel needs 16KB block (next power of 2)"]; A3 --> A4{16KB block available?}; A4 -- No --> A5[Split 1MB Block]; A5 --> A6[512KB Block] & A7[Buddy: 512KB Block]; A7 --> A8((Free List)); A6 --> A9[Split 512KB Block]; A9 --> A10[256KB Block] & A11[Buddy: 256KB Block]; A11 --> A8; A10 --> A12[... Continue splitting ...]; A12 --> A13[Split 32KB Block]; A13 --> A14[16KB Block] & A15[Buddy: 16KB Block]; A15 --> A8; A14 --> A16([Success: 16KB Block Allocated]); end classDef start fill:#1e3a8a,stroke:#1e3a8a,stroke-width:2px,color:#ffffff; classDef success fill:#10b981,stroke:#10b981,stroke-width:2px,color:#ffffff; classDef decision fill:#f59e0b,stroke:#f59e0b,stroke-width:1px,color:#ffffff; classDef process fill:#0d9488,stroke:#0d9488,stroke-width:1px,color:#ffffff; classDef data fill:#8b5cf6,stroke:#8b5cf6,stroke-width:1px,color:#ffffff; class A1,F1,L1 start; class A16,L4 success; class A2,A4,F2,F5,L3 decision; class A3,A5,A9,A12,A13,F3,F6,L2 process; class A6,A7,A10,A11,A14,A15,F4,F7,F8,A8,L5 data;
The buddy system is excellent for allocating page-sized chunks of memory, but it’s inefficient for the kernel’s own internal needs. The kernel frequently needs to allocate small objects, often just a few bytes in size, to store data structures like inodes, directory entries, or process descriptors. Allocating a full 4KB page for a 128-byte object would be incredibly wasteful, leading to severe internal fragmentation.
To solve this, the kernel employs a second layer of memory management on top of the buddy system: the slab allocator. The slab allocator is designed for the efficient allocation of small, fixed-size objects. It operates by creating caches for specific types of objects (e.g., a cache for inode objects). When a cache is created, it requests one or more contiguous pages from the buddy system. These pages are called slabs. The slab is then sliced up into a number of the fixed-size objects it is meant to manage. When the kernel needs to allocate an object of that type, the slab allocator can provide one from a slab in the cache almost instantaneously, without needing to go through the more complex buddy system logic. When an object is freed, it is simply marked as available within its slab, ready for immediate reuse. This not only avoids internal fragmentation but also improves performance by keeping frequently used objects in a “hot” state (likely to be in the CPU cache) and avoiding the overhead of initializing and destroying objects repeatedly.
Practical Examples
Analyzing Memory Usage on the Raspberry Pi 5
Before we change anything, let’s establish a baseline of the system’s memory. The primary tools for this are found in the /proc
virtual filesystem, which provides a window into the kernel’s state. The most important file is /proc/meminfo
.
Step-by-step procedure:
- Boot your Raspberry Pi 5 with Raspberry Pi OS or a custom-built Linux image.
- Open a terminal or connect via SSH.
- Use the
cat
command to view the contents of/proc/meminfo
. You can pipe it tohead
to see the most important lines.
cat /proc/meminfo | head -n 15
Expected Output:
MemTotal: 8104260 kB
MemFree: 7530484 kB
MemAvailable: 7761040 kB
Buffers: 6240 kB
Cached: 264880 kB
SwapCached: 0 kB
Active: 194888 kB
Inactive: 238080 kB
Active(anon): 89024 kB
Inactive(anon): 752 kB
Active(file): 105864 kB
Inactive(file): 237328 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 102396 kB
Explanation of Key Fields:
- MemTotal: The total amount of usable physical RAM.
- MemFree: Memory that is completely unused. This number can often be deceptively low on a running Linux system.
- MemAvailable: An estimate of how much memory is available for starting new applications, without swapping. This is often the most useful number to look at. It accounts for free memory plus memory that can be easily reclaimed (like page cache).
- Buffers: Memory used by kernel buffers, primarily for block device I/O.
- Cached: Memory used for the page cache, which stores file data read from disk. This is a key reason
MemFree
can be low; Linux aggressively uses free RAM for caching to speed up file access. This memory can be instantly relinquished if an application needs it. - SwapTotal / SwapFree: The total and available swap space.
A more user-friendly command is free
, which parses /proc/meminfo
for you.
free -h
The -h
flag provides “human-readable” output (e.g., in MB or GB).
Expected Output:
total used free shared buff/cache available
Mem: 7.7Gi 311Mi 7.1Gi 1.0Mi 315Mi 7.4Gi
Swap: 99Mi 0Bi 99Mi
Creating and Enabling a Swap File
Many embedded distributions don’t enable swap by default. Let’s create a 512MB swap file on the SD card.
Warning: Performing frequent writes to a swap file can reduce the lifespan of an SD card. This is for demonstration; for a production device, careful consideration is needed.
Step-by-step procedure:
- Use the
dd
command to create an empty 512MB file.dd
is a low-level utility for copying and converting data.sudo dd if=/dev/zero of=/swapfile bs=1M count=512
if=/dev/zero
: Use the special file that outputs null characters as the input.of=/swapfile
: The output file path.bs=1M count=512
: Create a file by writing 512 blocks of 1MB each.
- Set the correct permissions on the swap file. It should not be readable by regular users.
sudo chmod 600 /swapfile
- Format the file as swap space.
sudo mkswap /swapfile
- Enable the swap file for the current session.
sudo swapon /swapfile
- Verify that the swap space is active using
free
orswapon --show
.free -h
You should now see the new swap space reflected in the output. - To make the swap file permanent across reboots, add an entry to
/etc/fstab
.# Add this line to the end of /etc/fstab /swapfile none swap sw 0 0
You can use a text editor like nano
: sudo nano /etc/fstab
.
Observing Memory Allocation with a C Program
Let’s write a simple C program that allocates memory and see how it affects the system. This program will allocate memory in chunks and pause, allowing us to observe the system state in another terminal.
File: allocator.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// Allocate 10 MB
#define CHUNK_SIZE (10 * 1024 * 1024)
int main() {
char *ptr = NULL;
long total_allocated = 0;
int step = 1;
printf("Starting memory allocation. PID: %d\n", getpid());
printf("Press Enter to allocate 10MB chunks. Type 'q' to quit.\n");
while (1) {
// Wait for user input
if (getchar() == 'q') {
break;
}
ptr = (char *)malloc(CHUNK_SIZE);
if (ptr == NULL) {
perror("Failed to allocate memory");
break;
}
// IMPORTANT: We must write to the memory to ensure physical pages
// are actually allocated by the kernel (due to demand paging).
memset(ptr, 0, CHUNK_SIZE);
total_allocated += CHUNK_SIZE;
printf("Step %d: Allocated %ld MB total.\n", step++, total_allocated / (1024 * 1024));
}
printf("Program finished. Memory will be freed on exit.\n");
// Memory is automatically freed when the process exits, but for long-running
// applications, you would need to call free().
return 0;
}
Build and Run Procedure:
- Save the code above as
allocator.c
. - Compile it using GCC.
gcc allocator.c -o allocator
- Open two SSH sessions to your Raspberry Pi.
- In the first terminal, run the compiled program.
./allocator
- In the second terminal, use the
top
command orwatch
withfree
to monitor memory usage. Let’s usewatch
to see live updates every second.watch -n 1 free -h
- Go back to the first terminal. Each time you press Enter, the program allocates another 10MB of memory. You will see the “used” memory in the second terminal increase accordingly.
- The key insight here is that
malloc
only allocates virtual address space. It’s thememset
call, which writes to that memory, that triggers page faults and forces the kernel to map those virtual pages to physical frames. Withoutmemset
, the “used” memory reported byfree
would not change significantly.
Tuning Kernel Memory Behavior: Swappiness
The kernel has a tunable parameter called vm.swappiness
that controls how aggressively it swaps memory pages versus dropping file-backed pages from the page cache. It takes a value from 0 to 100.
vm.swappiness = 100
: Very aggressive swapping. The kernel will swap out application memory frequently.vm.swappiness = 0
: The kernel will avoid swapping out application memory unless absolutely necessary (i.e., to avoid an out-of-memory error).vm.swappiness = 60
: This is the default on most desktop systems.
For an embedded device with slow flash storage, a lower swappiness value is often desirable to prioritize responsiveness over freeing up memory.
Procedure to Tune Swappiness:
- Check the current value.
cat /proc/sys/vm/swappiness
- Change the value temporarily. Let’s set it to 10.
sudo sysctl vm.swappiness=10
- To make the change permanent, you need to edit the
sysctl
configuration file.sudo nano /etc/sysctl.conf
Add the following line to the file:vm.swappiness=10
Save the file and reboot, or runsudo sysctl -p
to apply the changes from the file.
Tip: On systems without swap, the swappiness setting has no effect. It only influences the kernel’s choice between reclaiming page cache and swapping anonymous pages.
Common Mistakes & Troubleshooting
Exercises
- Memory Footprint Analysis: Choose a standard utility on your Raspberry Pi (e.g.,
ssh
ornano
). Use theps
command (ps aux | grep <process_name>
) to find its Process ID (PID). Then, examine the process’s memory usage in detail using the/proc
filesystem. Specifically, look at/proc/<PID>/status
and/proc/<PID>/smaps
. Identify theVmRSS
(Resident Set Size) andVmSize
(Virtual Memory Size) in the status file. Explain in your own words why these two values are different. - Triggering the OOM Killer: Take the
allocator.c
program and modify it to run without waiting for user input. Run it as a background process (./allocator &
). Monitor the system usingdmesg -w
in another terminal. Observe how memory usage climbs and what happens when the system runs out of memory. Capture the OOM Killer’s output fromdmesg
. Note which process it decided to kill (it should be yourallocator
process) and why itsoom_score_adj
might have made it a target. - Configuring zram for Compressed RAM Swap: Disable the file-based swap you created earlier (
sudo swapoff /swapfile
). Research and install thezram-tools
package (sudo apt install zram-tools
). Configure it to create a compressed swap space in RAM. A typical configuration is to use an algorithm likelz4
and set the disk size to 50% of your total RAM. Verify its operation usingswapon --show
andzramctl
. Discuss the advantages and disadvantages of using zram over a file-based swap on an SD card. - Creating a Simple Kernel Module for Memory Allocation: Write a basic Linux kernel module that, when loaded, uses
kmalloc()
to allocate a small chunk of memory (e.g., 256 bytes) andkfree()
to release it when the module is unloaded. Useprintk()
to log messages to the kernel buffer indicating the success or failure of the allocation and the virtual address of the allocated memory. Compile this module using a proper Makefile and the Raspberry Pi kernel headers. Useinsmod
to load it andrmmod
to unload it, and check the output withdmesg
. This exercise introduces you to the kernel’s internal memory allocation APIs.
Summary
- Virtual Memory is a powerful abstraction that provides each process with a private, linear address space, enabling process isolation and efficient memory management.
- The Memory Management Unit (MMU) is the hardware that translates virtual addresses to physical addresses using page tables. The TLB is a critical cache that speeds up this translation.
- Linux uses demand paging to load application code and data from storage into memory only when it is first accessed, which improves startup times and reduces RAM usage.
- Swapping allows the system to use secondary storage (like an SD card) as a temporary holding area for less-used memory pages, but it comes with a significant performance penalty and can cause wear on flash media.
- The kernel manages physical memory frames using the buddy system, which is efficient at allocating page-sized blocks and mitigating external fragmentation.
- For small, frequent allocations, the kernel uses the slab allocator on top of the buddy system to avoid internal fragmentation and improve performance.
- Standard Linux utilities like
free
,top
, and the/proc/meminfo
file are essential tools for monitoring and analyzing memory usage on an embedded system.
Further Reading
- Linux Kernel Documentation – Memory Management: The official source of truth. The documentation within the kernel source tree is invaluable. (https://www.kernel.org/doc/html/latest/mm/index.html)
- Understanding the Linux Kernel, 3rd Edition by Daniel P. Bovet & Marco Cesati: While slightly older, this book provides an exceptionally detailed and clear explanation of the core memory management subsystems.
- Linux Device Drivers, 3rd Edition by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman: Chapter 8, “Allocation of Memory,” is an excellent, practical guide to how kernel drivers should handle memory. (https://lwn.net/Kernel/LDD3/)
- “What every programmer should know about memory” by Ulrich Drepper: A deep and comprehensive paper covering everything from RAM hardware architecture to how it impacts software performance. A must-read for any serious systems programmer. (https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)
- ARMv8-A Architecture Reference Manual: For those wanting to understand the low-level details of the MMU, page table formats, and memory model on the Raspberry Pi 5’s processor. (Available from the ARM Developer website).
- LWN.net – Kernel Index: An ongoing source of articles and discussions about the latest developments in the Linux kernel, including frequent deep dives into memory management changes. (https://lwn.net/Kernel/Index/)