1 .\" Copyright (c) 2016, IBM Corporation.
2 .\" Written by Mike Rapoport <rppt@linux.vnet.ibm.com>
3 .\" and Copyright (C) 2017 Michael Kerrisk <mtk.manpages@gmail.com>
5 .\" %%%LICENSE_START(VERBATIM)
6 .\" Permission is granted to make and distribute verbatim copies of this
7 .\" manual provided the copyright notice and this permission notice are
8 .\" preserved on all copies.
10 .\" Permission is granted to copy and distribute modified versions of this
11 .\" manual under the conditions for verbatim copying, provided that the
12 .\" entire resulting derived work is distributed under the terms of a
13 .\" permission notice identical to this one.
15 .\" Since the Linux kernel and libraries are constantly changing, this
16 .\" manual page may be incorrect or out-of-date. The author(s) assume no
17 .\" responsibility for errors or omissions, or for damages resulting from
18 .\" the use of the information contained herein. The author(s) may not
19 .\" have taken the same level of care in the production of this manual,
20 .\" which is licensed free of charge, as they might when working
23 .\" Formatted or processed versions of this manual, if unaccompanied by
24 .\" the source, must acknowledge the copyright and authors of this work.
27 .TH USERFAULTFD 2 2021-03-22 "Linux" "Linux Programmer's Manual"
29 userfaultfd \- create a file descriptor for handling page faults in user space
32 .B #include <sys/types.h>
33 .B #include <linux/userfaultfd.h>
35 .BI "int userfaultfd(int " flags );
39 There is no glibc wrapper for this system call; see NOTES.
42 creates a new userfaultfd object that can be used for delegation of page-fault
43 handling to a user-space application,
44 and returns a file descriptor that refers to the new object.
45 The new userfaultfd object is configured using
48 Once the userfaultfd object is configured, the application can use
50 to receive userfaultfd notifications.
51 The reads from userfaultfd may be blocking or non-blocking,
52 depending on the value of
54 used for the creation of the userfaultfd or subsequent calls to
57 The following values may be bitwise ORed in
59 to change the behavior of
63 Enable the close-on-exec flag for the new userfaultfd file descriptor.
64 See the description of the
70 Enables non-blocking operation for the userfaultfd object.
71 See the description of the
76 When the last file descriptor referring to a userfaultfd object is closed,
77 all memory ranges that were registered with the object are unregistered
78 and unread events are flushed.
81 Userfaultfd supports two modes of registration:
83 .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
85 .B UFFDIO_REGISTER_MODE_MISSING
86 mode, user-space will receive a page-fault notification
87 when a missing page is accessed.
88 The faulted thread will be stopped from execution until the page fault is
89 resolved from user-space by either an
95 .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
97 .B UFFDIO_REGISTER_MODE_WP
98 mode, user-space will receive a page-fault notification
99 when a write-protected page is written.
100 The faulted thread will be stopped from execution
101 until user-space write-unprotects the page using an
102 .B UFFDIO_WRITEPROTECT
105 Multiple modes can be enabled at the same time for the same memory range.
107 Since Linux 4.14, a userfaultfd page-fault notification can selectively embed
108 faulting thread ID information into the notification.
109 One needs to enable this feature explicitly using the
110 .B UFFD_FEATURE_THREAD_ID
111 feature bit when initializing the userfaultfd context.
112 By default, thread ID reporting is disabled.
114 The userfaultfd mechanism is designed to allow a thread in a multithreaded
115 program to perform user-space paging for the other threads in the process.
116 When a page fault occurs for one of the regions registered
117 to the userfaultfd object,
118 the faulting thread is put to sleep and
119 an event is generated that can be read via the userfaultfd file descriptor.
120 The fault-handling thread reads events from this file descriptor and services
121 them using the operations described in
122 .BR ioctl_userfaultfd (2).
123 When servicing the page fault events,
124 the fault-handling thread can trigger a wake-up for the sleeping thread.
126 It is possible for the faulting threads and the fault-handling threads
127 to run in the context of different processes.
128 In this case, these threads may belong to different programs,
129 and the program that executes the faulting threads
130 will not necessarily cooperate with the program that handles the page faults.
131 In such non-cooperative mode,
132 the process that monitors userfaultfd and handles page faults
133 needs to be aware of the changes in the virtual memory layout
134 of the faulting process to avoid memory corruption.
137 userfaultfd can also notify the fault-handling threads about changes
138 in the virtual memory layout of the faulting process.
139 In addition, if the faulting process invokes
141 the userfaultfd objects associated with the parent may be duplicated
142 into the child process and the userfaultfd monitor will be notified
146 about the file descriptor associated with the userfault objects
147 created for the child process,
148 which allows the userfaultfd monitor to perform user-space paging
149 for the child process.
150 Unlike page faults which have to be synchronous and require an
151 explicit or implicit wakeup,
152 all other events are delivered asynchronously and
153 the non-cooperative process resumes execution as
154 soon as the userfaultfd manager executes
156 The userfaultfd manager should carefully synchronize calls to
158 with the processing of events.
160 The current asynchronous model of the event delivery is optimal for
161 single threaded non-cooperative userfaultfd manager implementations.
162 .\" Regarding the preceding sentence, Mike Rapoport says:
163 .\" The major point here is that current events delivery model could be
164 .\" problematic for multi-threaded monitor. I even suspect that it would be
165 .\" impossible to ensure synchronization between page faults and non-page
166 .\" fault events in multi-threaded monitor.
168 .\" FIXME elaborate about non-cooperating mode, describe its limitations
169 .\" for kernels before 4.11, features added in 4.11
170 .\" and limitations remaining in 4.11
171 .\" Maybe it's worth adding a dedicated sub-section...
174 Since Linux 5.7, userfaultfd is able to do
175 synchronous page dirty tracking using the new write-protect register mode.
176 One should check against the feature bit
177 .B UFFD_FEATURE_PAGEFAULT_FLAG_WP
178 before using this feature.
179 Similar to the original userfaultfd missing mode, the write-protect mode will
180 generate a userfaultfd notification when the protected page is written.
181 The user needs to resolve the page fault by unprotecting the faulted page and
182 kicking the faulted thread to continue.
183 For more information,
184 please refer to the "Userfaultfd write-protect mode" section.
186 .SS Userfaultfd operation
187 After the userfaultfd object is created with
189 the application must enable it using the
193 This operation allows a handshake between the kernel and user space
194 to determine the API version and supported features.
195 This operation must be performed before any of the other
197 operations described below (or those operations fail with the
204 the application then registers memory address ranges using the
208 After successful completion of a
211 a page fault occurring in the requested memory range, and satisfying
212 the mode defined at the registration time, will be forwarded by the kernel to
213 the user-space application.
214 The application can then use the
219 operations to resolve the page fault.
221 Since Linux 4.14, if the application sets the
222 .B UFFD_FEATURE_SIGBUS
223 feature bit using the
226 no page-fault notification will be forwarded to user space.
229 signal is delivered to the faulting process.
231 userfaultfd can be used for robustness purposes to simply catch
232 any access to areas within the registered address range that do not
233 have pages allocated, without having to listen to userfaultfd events.
234 No userfaultfd monitor will be required for dealing with such memory
236 For example, this feature can be useful for applications that
237 want to prevent the kernel from automatically allocating pages and filling
238 holes in sparse files when the hole is accessed through a memory mapping.
241 .B UFFD_FEATURE_SIGBUS
242 feature is implicitly inherited through
244 if used in combination with
245 .BR UFFD_FEATURE_FORK .
247 Details of the various
249 operations can be found in
250 .BR ioctl_userfaultfd (2).
252 Since Linux 4.11, events other than page-fault may enabled during
257 userfaultfd can be used only with anonymous private memory mappings.
259 userfaultfd can be also used with hugetlbfs and shared memory mappings.
261 .SS Userfaultfd write-protect mode (since 5.7)
262 Since Linux 5.7, userfaultfd supports write-protect mode.
263 The user needs to first check availability of this feature using
265 ioctl against the feature bit
266 .B UFFD_FEATURE_PAGEFAULT_FLAG_WP
267 before using this feature.
269 To register with userfaultfd write-protect mode, the user needs to initiate the
272 .B UFFDIO_REGISTER_MODE_WP
274 Note that it is legal to monitor the same memory range with multiple modes.
275 For example, the user can do
278 .BR "UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP" .
280 .B UFFDIO_REGISTER_MODE_WP
281 registered, user-space will
283 receive any notification when a missing page is written.
284 Instead, user-space will receive a write-protect page-fault notification
285 only when an existing but write-protected page got written.
290 .B UFFDIO_REGISTER_MODE_WP
292 the user can write-protect any existing memory within the range using the ioctl
293 .B UFFDIO_WRITEPROTECT
295 .I uffdio_writeprotect.mode
297 .BR UFFDIO_WRITEPROTECT_MODE_WP .
299 When a write-protect event happens,
300 user-space will receive a page-fault notification whose
301 .I uffd_msg.pagefault.flags
303 .B UFFD_PAGEFAULT_FLAG_WP
305 Note: since only writes can trigger this kind of fault,
306 write-protect notifications will always have the
307 .B UFFD_PAGEFAULT_FLAG_WRITE
308 bit set along with the
309 .BR UFFD_PAGEFAULT_FLAG_WP
312 To resolve a write-protection page fault, the user should initiate another
313 .B UFFDIO_WRITEPROTECT
315 .I uffd_msg.pagefault.flags
317 .B UFFDIO_WRITEPROTECT_MODE_WP
318 cleared upon the faulted page or range.
320 Write-protect mode supports only private anonymous memory.
321 .SS Reading from the userfaultfd structure
324 from the userfaultfd file descriptor returns one or more
326 structures, each of which describes a page-fault event
327 or an event required for the non-cooperative userfaultfd usage:
332 __u8 event; /* Type of event */
336 __u64 flags; /* Flags describing fault */
337 __u64 address; /* Faulting address */
339 __u32 ptid; /* Thread ID of the fault */
343 struct { /* Since Linux 4.11 */
344 __u32 ufd; /* Userfault file descriptor
345 of the child process */
348 struct { /* Since Linux 4.11 */
349 __u64 from; /* Old address of remapped area */
350 __u64 to; /* New address of remapped area */
351 __u64 len; /* Original mapping length */
354 struct { /* Since Linux 4.11 */
355 __u64 start; /* Start address of removed area */
356 __u64 end; /* End address of removed area */
361 /* Padding fields omitted */
366 If multiple events are available and the supplied buffer is large enough,
368 returns as many events as will fit in the supplied buffer.
369 If the buffer supplied to
371 is smaller than the size of the
378 The fields set in the
380 structure are as follows:
384 Depending of the event type,
385 different fields of the
387 union represent details required for the event processing.
388 The non-page-fault events are generated only when appropriate feature
389 is enabled during API handshake with
393 The following values can appear in the
398 .BR UFFD_EVENT_PAGEFAULT " (since Linux 4.3)"
400 The page-fault details are available in the
404 .BR UFFD_EVENT_FORK " (since Linux 4.11)"
405 Generated when the faulting process invokes
412 The event details are available in the
415 .\" FIXME describe duplication of userfault file descriptor during fork
417 .BR UFFD_EVENT_REMAP " (since Linux 4.11)"
418 Generated when the faulting process invokes
420 The event details are available in the
424 .BR UFFD_EVENT_REMOVE " (since Linux 4.11)"
425 Generated when the faulting process invokes
432 The event details are available in the
436 .BR UFFD_EVENT_UNMAP " (since Linux 4.11)"
437 Generated when the faulting process unmaps a memory range,
438 either explicitly using
444 The event details are available in the
450 The address that triggered the page fault.
453 A bit mask of flags that describe the event.
455 .BR UFFD_EVENT_PAGEFAULT ,
456 the following flag may appear:
459 .B UFFD_PAGEFAULT_FLAG_WRITE
460 If the address is in a range that was registered with the
461 .B UFFDIO_REGISTER_MODE_MISSING
463 .BR ioctl_userfaultfd (2))
464 and this flag is set, this a write fault;
465 otherwise it is a read fault.
467 .B UFFD_PAGEFAULT_FLAG_WP
468 If the address is in a range that was registered with the
469 .B UFFDIO_REGISTER_MODE_WP
470 flag, when this bit is set, it means it is a write-protect fault.
471 Otherwise it is a page-missing fault.
474 .I pagefault.feat.pid
475 The thread ID that triggered the page fault.
478 The file descriptor associated with the userfault object
479 created for the child created by
483 The original address of the memory range that was remapped using
487 The new address of the memory range that was remapped using
491 The original length of the memory range that was remapped using
495 The start address of the memory range that was freed using
500 The end address of the memory range that was freed using
506 on a userfaultfd file descriptor can fail with the following errors:
509 The userfaultfd object has not yet been enabled using the
516 flag is enabled in the associated open file description,
517 the userfaultfd file descriptor can be monitored with
522 When events are available, the file descriptor indicates as readable.
525 flag is not enabled, then
527 (always) indicates the file as having a
531 indicates the file descriptor as both readable and writable.
532 .\" FIXME What is the reason for this seemingly odd behavior with respect
533 .\" to the O_NONBLOCK flag? (see userfaultfd_poll() in fs/userfaultfd.c).
534 .\" Something needs to be said about this.
538 returns a new file descriptor that refers to the userfaultfd object.
539 On error, \-1 is returned, and
541 is set to indicate the error.
545 An unsupported value was specified in
549 The per-process limit on the number of open file descriptors has been
553 The system-wide limit on the total number of open files has been
557 Insufficient kernel memory was available.
559 .BR EPERM " (since Linux 5.2)"
560 .\" cefdca0a86be517bc390fc4541e3674b8e7803b0
561 The caller is not privileged (does not have the
563 capability in the initial user namespace), and
564 .I /proc/sys/vm/unprivileged_userfaultfd
569 system call first appeared in Linux 4.3.
571 The support for hugetlbfs and shared memory areas and
572 non-page-fault events was added in Linux 4.11
575 is Linux-specific and should not be used in programs intended to be
578 Glibc does not provide a wrapper for this system call; call it using
581 The userfaultfd mechanism can be used as an alternative to
582 traditional user-space paging techniques based on the use of the
586 It can also be used to implement lazy restore
587 for checkpoint/restore mechanisms,
588 as well as post-copy migration to allow (nearly) uninterrupted execution
589 when transferring virtual machines and Linux containers
590 from one host to another.
593 .B UFFD_FEATURE_EVENT_FORK
594 is enabled and a system call from the
596 family is interrupted by a signal or failed, a stale userfaultfd descriptor
598 In this case, a spurious
600 will be delivered to the userfaultfd monitor.
602 The program below demonstrates the use of the userfaultfd mechanism.
603 The program creates two threads, one of which acts as the
604 page-fault handler for the process, for the pages in a demand-page zero
608 The program takes one command-line argument,
609 which is the number of pages that will be created in a mapping
610 whose page faults will be handled via userfaultfd.
611 After creating a userfaultfd object,
612 the program then creates an anonymous private mapping of the specified size
613 and registers the address range of that mapping using the
617 The program then creates a second thread that will perform the
618 task of handling page faults.
620 The main thread then walks through the pages of the mapping fetching
621 bytes from successive pages.
622 Because the pages have not yet been accessed,
623 the first access of a byte in each page will trigger a page-fault event
624 on the userfaultfd file descriptor.
626 Each of the page-fault events is handled by the second thread,
627 which sits in a loop processing input from the userfaultfd file descriptor.
628 In each loop iteration, the second thread first calls
630 to check the state of the file descriptor,
631 and then reads an event from the file descriptor.
632 All such events should be
633 .B UFFD_EVENT_PAGEFAULT
635 which the thread handles by copying a page of data into
636 the faulting region using the
641 The following is an example of what we see when running the program:
645 $ \fB./userfaultfd_demo 3\fP
646 Address returned by mmap() = 0x7fd30106c000
648 fault_handler_thread():
649 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
650 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
651 (uffdio_copy.copy returned 4096)
652 Read address 0x7fd30106c00f in main(): A
653 Read address 0x7fd30106c40f in main(): A
654 Read address 0x7fd30106c80f in main(): A
655 Read address 0x7fd30106cc0f in main(): A
657 fault_handler_thread():
658 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
659 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
660 (uffdio_copy.copy returned 4096)
661 Read address 0x7fd30106d00f in main(): B
662 Read address 0x7fd30106d40f in main(): B
663 Read address 0x7fd30106d80f in main(): B
664 Read address 0x7fd30106dc0f in main(): B
666 fault_handler_thread():
667 poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
668 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
669 (uffdio_copy.copy returned 4096)
670 Read address 0x7fd30106e00f in main(): C
671 Read address 0x7fd30106e40f in main(): C
672 Read address 0x7fd30106e80f in main(): C
673 Read address 0x7fd30106ec0f in main(): C
679 /* userfaultfd_demo.c
681 Licensed under the GNU General Public License version 2 or later.
684 #include <inttypes.h>
685 #include <sys/types.h>
687 #include <linux/userfaultfd.h>
696 #include <sys/mman.h>
697 #include <sys/syscall.h>
698 #include <sys/ioctl.h>
701 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \e
704 static int page_size;
707 fault_handler_thread(void *arg)
709 static struct uffd_msg msg; /* Data read from userfaultfd */
710 static int fault_cnt = 0; /* Number of faults so far handled */
711 long uffd; /* userfaultfd file descriptor */
712 static char *page = NULL;
713 struct uffdio_copy uffdio_copy;
718 /* Create a page that will be copied into the faulting region. */
721 page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
722 MAP_PRIVATE | MAP_ANONYMOUS, \-1, 0);
723 if (page == MAP_FAILED)
727 /* Loop, handling incoming events on the userfaultfd
732 /* See what poll() tells us about the userfaultfd. */
734 struct pollfd pollfd;
737 pollfd.events = POLLIN;
738 nready = poll(&pollfd, 1, \-1);
742 printf("\enfault_handler_thread():\en");
743 printf(" poll() returns: nready = %d; "
744 "POLLIN = %d; POLLERR = %d\en", nready,
745 (pollfd.revents & POLLIN) != 0,
746 (pollfd.revents & POLLERR) != 0);
748 /* Read an event from the userfaultfd. */
750 nread = read(uffd, &msg, sizeof(msg));
752 printf("EOF on userfaultfd!\en");
759 /* We expect only one kind of event; verify that assumption. */
761 if (msg.event != UFFD_EVENT_PAGEFAULT) {
762 fprintf(stderr, "Unexpected event on userfaultfd\en");
766 /* Display info about the page\-fault event. */
768 printf(" UFFD_EVENT_PAGEFAULT event: ");
769 printf("flags = %"PRIx64"; ", msg.arg.pagefault.flags);
770 printf("address = %"PRIx64"\en", msg.arg.pagefault.address);
772 /* Copy the page pointed to by \(aqpage\(aq into the faulting
773 region. Vary the contents that are copied in, so that it
774 is more obvious that each fault is handled separately. */
776 memset(page, \(aqA\(aq + fault_cnt % 20, page_size);
779 uffdio_copy.src = (unsigned long) page;
781 /* We need to handle page faults in units of pages(!).
782 So, round faulting address down to page boundary. */
784 uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
785 \(ti(page_size \- 1);
786 uffdio_copy.len = page_size;
787 uffdio_copy.mode = 0;
788 uffdio_copy.copy = 0;
789 if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == \-1)
790 errExit("ioctl\-UFFDIO_COPY");
792 printf(" (uffdio_copy.copy returned %"PRId64")\en",
798 main(int argc, char *argv[])
800 long uffd; /* userfaultfd file descriptor */
801 char *addr; /* Start of region handled by userfaultfd */
802 uint64_t len; /* Length of region handled by userfaultfd */
803 pthread_t thr; /* ID of thread that handles page faults */
804 struct uffdio_api uffdio_api;
805 struct uffdio_register uffdio_register;
809 fprintf(stderr, "Usage: %s num\-pages\en", argv[0]);
813 page_size = sysconf(_SC_PAGE_SIZE);
814 len = strtoull(argv[1], NULL, 0) * page_size;
816 /* Create and enable userfaultfd object. */
818 uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
820 errExit("userfaultfd");
822 uffdio_api.api = UFFD_API;
823 uffdio_api.features = 0;
824 if (ioctl(uffd, UFFDIO_API, &uffdio_api) == \-1)
825 errExit("ioctl\-UFFDIO_API");
827 /* Create a private anonymous mapping. The memory will be
828 demand\-zero paged\-\-that is, not yet allocated. When we
829 actually touch the memory, it will be allocated via
832 addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
833 MAP_PRIVATE | MAP_ANONYMOUS, \-1, 0);
834 if (addr == MAP_FAILED)
837 printf("Address returned by mmap() = %p\en", addr);
839 /* Register the memory range of the mapping we just created for
840 handling by the userfaultfd object. In mode, we request to track
841 missing pages (i.e., pages that have not yet been faulted in). */
843 uffdio_register.range.start = (unsigned long) addr;
844 uffdio_register.range.len = len;
845 uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
846 if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == \-1)
847 errExit("ioctl\-UFFDIO_REGISTER");
849 /* Create a thread that will process the userfaultfd events. */
851 s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
854 errExit("pthread_create");
857 /* Main thread now touches memory in the mapping, touching
858 locations 1024 bytes apart. This will trigger userfaultfd
859 events for all pages in the region. */
862 l = 0xf; /* Ensure that faulting address is not on a page
863 boundary, in order to test that we correctly
864 handle that case in fault_handling_thread(). */
867 printf("Read address %p in main(): ", addr + l);
870 usleep(100000); /* Slow things down a little */
879 .BR ioctl_userfaultfd (2),
883 .IR Documentation/admin\-guide/mm/userfaultfd.rst
884 in the Linux kernel source tree