Virtual Memory in Linux
Last modified : 3 August, 2017
The Virtual Memory subsystem in Linux is extremely interesting. Here’s my understanding of it. I’m sure there are lots of things I am missing still, so if you could please leave me a comment, I’d greatly appreciate it.
Let’s start with the view of memory address space that every process has. On a 64-bit OS, a process thinks it can address memory from 0x0000000000000000 to 0xFFFFFFFFFFFFFFFF .
Let’s not worry about what the process does with this address space except briefly. Some applications will choose to “grow the stack downwards and heap upwards.” (Interestingly I recently became aware of the Stack Clash vulnerability ) . I suspect (although I’m not sure), that different file formats (e.g. ELF) would cause the program to be loaded differently. In the end the operating system will load up the Instruction Register and hand off control. What the list of instructions does after that point, is totally up to the application. So long as it interfaces with the OS system calls properly and doesn’t try to run priviledged instructions, it can go on its merry way. A JVM process may partition the address space into different regions for the permanent generation, heap, stack, etc.
So what happened when Linux so generously allocated this 64-bit address space to the process? Very little actually. The kernel created a new process and assigned it a new page table. You can see this using pmap. For example, the following is pmap output of a top process on my machine. Using the -XX option will tabulate all the information present in the /proc/<pid>/smaps file.
$ pmap -XX 49910
49910: top
Address Perm Offset Device Inode Size Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous AnonHugePages ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss KernelPageSize MMUPageSize Locked VmFlagsMapping
55cde8d1a000 r-xp 00000000 fd:01 5899811 96 96 96 0 0 96 0 96 0 0 0 0 0 0 0 4 4 0 rd ex mr mw me dw sd top
55cde8f32000 r--p 00018000 fd:01 5899811 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 4 4 0 rd mr mw me dw ac sd top
55cde8f33000 rw-p 00019000 fd:01 5899811 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 4 4 0 rd wr mr mw me dw ac sd top
55cde8f34000 rw-p 00000000 00:00 0 160 36 36 0 0 0 36 36 36 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd
55cdeae5e000 rw-p 00000000 00:00 0 1096 984 984 0 0 0 984 984 984 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd [heap]
7fd7b2efd000 r-xp 00000000 fd:01 5904102 44 44 0 44 0 0 0 44 0 0 0 0 0 0 0 4 4 0 rd ex mr mw me sd libnss_files-2.25.so
7fd7b2f08000 ---p 0000b000 fd:01 5904102 2044 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 mr mw me sd libnss_files-2.25.so
7fd7b3107000 r--p 0000a000 fd:01 5904102 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 4 4 0 rd mr mw me ac sd libnss_files-2.25.so
7fd7baf82000 ---p 00012000 fd:01 5906405 2048 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 mr mw me sd libgpg-error.so.0.20.0
7fd7bb183000 rw-p 00013000 fd:01 5906405 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd libgpg-error.so.0.20.0
7fd7bb48b000 rw-p 00107000 fd:01 5905989 24 24 24 0 0 0 24 24 24 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd libgcrypt.so.20.1.8
7fd7bb6a4000 r--p 00012000 fd:01 5905318 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 4 4 0 rd mr mw me ac sd liblz4.so.1.7.5
7fd7bbf01000 r-xp 00000000 fd:01 5904106 88 60 0 60 0 0 0 60 0 0 0 0 0 0 0 4 4 0 rd ex mr mw me sd libresolv-2.25.so
7fd7bbf17000 ---p 00016000 fd:01 5904106 2044 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 mr mw me sd libresolv-2.25.so
....
.....
.......
........
7fd7bcd6a000 r-xp 00000000 fd:01 5904081 156 152 0 152 0 0 0 152 0 0 0 0 0 0 0 4 4 0 rd ex mr mw me dw sd ld-2.25.so
7fd7bce78000 rw-p 00000000 00:00 0 448 188 188 0 0 0 188 188 188 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd
7fd7bcee8000 r-xp 00000000 fd:01 5902146 532 64 0 64 0 0 0 64 0 0 0 0 0 0 0 4 4 0 rd ex mr mw me sd libsystemd.so.0.18.0
7fd7bcf6d000 ---p 00085000 fd:01 5902146 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 mr mw me sd libsystemd.so.0.18.0
7fd7bcf72000 rw-p 00000000 00:00 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd
7fd7bcf8e000 rw-p 00000000 00:00 0 8 8 8 0 0 0 8 8 8 0 0 0 0 0 0 4 4 0 rd wr mr mw me ac sd
7fd7bcf90000 r--p 00026000 fd:01 5904081 4 4 4 0 0 0 4 4 4 0 0 0 0 0 0 4 4 0 rd mr mw me dw ac sd ld-2.25.so
7fd7bcf91000 rw-p 00027000 fd:01 5904081 8 8 8 0 0 0 8 8 8 0 0 0 0 0 0 4 4 0 rd wr mr mw me dw ac sd ld-2.25.so
7ffd2004d000 rw-p 00000000 00:00 0 136 28 28 0 0 0 28 28 28 0 0 0 0 0 0 4 4 0 rd wr mr mw me gd ac [stack]
7ffd20134000 r--p 00000000 00:00 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 rd mr pf io de dd sd [vvar]
7ffd20136000 r-xp 00000000 00:00 0 8 4 0 4 0 0 0 4 0 0 0 0 0 0 0 4 4 0 rd ex mr mw me de sd [vdso]
ffffffffff600000 r-xp 00000000 00:00 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 rd ex [vsyscall]
====== ==== ==== ============ ============ ============= ============= ========== ========= ============= ============== ============== =============== ==== ======= ============== =========== ======
164924 5340 1782 3656 0 160 1524 5340 1524 0 0 0 0 0 0 416 416 0 KB
Here is my understanding of the different columns:
- Address : The virtual memory address
- Perm : Whether this process is allowed to read / write / execute instructions from this page. p refers to whether the changes on this page will be private and s to shared.
- Offset : ?
- Device : ?
- Inode : ?
- Size : The page size in Kilobytes (no I am not going to call 1024 bytes Kibibytes)
- Rss : Resident set size. This is the actual amount of RAM used.
- Pss : ??
- Shared : Whether pages are private or shared.
- Clean/Dirty : Whether the pages have been modified.
- HugePages : https://wiki.debian.org/Hugepages . Pages much bigger than 4kb. (To reduce the size of the page table)
- Swap : Whether the page is actually swapped to disk.
- Anonymous : Whether this page is a mapping of a file or not.
I’m not sure I understand how to interpret the table fully, but thankfully I’ve never needed to. eg. Is each row describing one page? Or a group of continuous pages? Or just the Mapping? And what are these [stack] and [heap] pages? Toodly-do.
Logical View
Lets take a step back and try to understand what the different kinds of pages may be.
Here PageA is a shared page. The shared pages although mapped in the virtual memory of several processes are only taking 1 page of RAM. Also Linux has Copy-On-Write, so if a page is not modified, it isn’t copied.
PageB although allocated by the Linux Kernel in the address space of the process has never been used. So no RAM is actually being used for it.
PageC is a private page and is allocated on the RAM.
Swapping
Now lets say you turn on swap. At this point Linux has the luxury of mapping some pages in the processes address space to a disk instead of physical memory. It may choose to do this based on several criteria. Maybe there isn’t a whole lot of RAM left. Maybe PageC hasn’t been used in a while. Then,
The kernel has moved PageC to disk. Ofcourse if the process tries to use this page (read or write), this would cause a page fault and make the Linux Kernel try to move PageC back into RAM (since RAM is the only memory directly addressable by the CPU)
No Swapping (or swap is full)
In case you turn swap off with swapoff a process will still be able to get virtual pages, but soon as it writes to a page, space must be made available in RAM. Initially when the cumulative sum of all pages increases beyond the RAM, the Linux OOM Killer kicks in. There are oom_scores that it takes into account. I know this could lead to some pretty important processes getting shot (including sshd). I’m not sure what would happen if the single process tried using all its virtual memory exceeding the RAM with swap off.
You can limit a process from asking for too much virtual memory by setting ulimits
$ ulimit -v
unlimited
In addition there are a bunch of things one can control using the proc file system (although I try not to because I don’t know enough). I’ve used drop_caches and swappiness
$ ls /proc/sys/vm
admin_reserve_kbytes dirty_expire_centisecs hugetlb_shm_group min_free_kbytes nr_hugepages_mempolicy overcommit_memory swappiness
block_dump dirty_ratio laptop_mode min_slab_ratio nr_overcommit_hugepages overcommit_ratio user_reserve_kbytes
compact_memory dirtytime_expire_seconds legacy_va_layout min_unmapped_ratio nr_pdflush_threads page-cluster vfs_cache_pressure
compact_unevictable_allowed dirty_writeback_centisecs lowmem_reserve_ratio mmap_min_addr numa_zonelist_order panic_on_oom watermark_scale_factor
dirty_background_bytes drop_caches max_map_count mmap_rnd_bits oom_dump_tasks percpu_pagelist_fraction zone_reclaim_mode
dirty_background_ratio extfrag_threshold memory_failure_early_kill mmap_rnd_compat_bits oom_kill_allocating_task stat_interval
dirty_bytes hugepages_treat_as_movable memory_failure_recovery nr_hugepages overcommit_kbytes stat_refresh
This is all in addition to memory-mapped files, and caches which are incredible pieces of engineering deserving their own post.
Resources I found useful:
All content on this website is licensed as Creative Commons-Attribution-ShareAlike 4.0 License. Opinions expressed are solely my own.