1 .\" Copyright (C) 2020 Michael Kerrisk <mtk.manpages@gmail.com>
3 .\" %%%LICENSE_START(VERBATIM)
4 .\" Permission is granted to make and distribute verbatim copies of this
5 .\" manual provided the copyright notice and this permission notice are
6 .\" preserved on all copies.
8 .\" Permission is granted to copy and distribute modified versions of this
9 .\" manual under the conditions for verbatim copying, provided that the
10 .\" entire resulting derived work is distributed under the terms of a
11 .\" permission notice identical to this one.
13 .\" Since the Linux kernel and libraries are constantly changing, this
14 .\" manual page may be incorrect or out-of-date. The author(s) assume no
15 .\" responsibility for errors or omissions, or for damages resulting from
16 .\" the use of the information contained herein. The author(s) may not
17 .\" have taken the same level of care in the production of this manual,
18 .\" which is licensed free of charge, as they might when working
21 .\" Formatted or processed versions of this manual, if unaccompanied by
22 .\" the source, must acknowledge the copyright and authors of this work.
25 .TH SECCOMP_UNOTIFY 2 2020-10-01 "Linux" "Linux Programmer's Manual"
27 seccomp_unotify \- Seccomp user-space notification mechanism
30 .B #include <linux/seccomp.h>
31 .B #include <linux/filter.h>
32 .B #include <linux/audit.h>
34 .BI "int seccomp(unsigned int " operation ", unsigned int " flags \
37 .B #include <sys/ioctl.h>
39 .BI "int ioctl(int " fd ", SECCOMP_IOCTL_NOTIF_RECV,"
40 .BI " struct seccomp_notif *" req );
41 .BI "int ioctl(int " fd ", SECCOMP_IOCTL_NOTIF_SEND,"
42 .BI " struct seccomp_notif_resp *" resp );
43 .BI "int ioctl(int " fd ", SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *" id );
44 .BI "int ioctl(int " fd ", SECCOMP_IOCTL_NOTIF_ADDFD,"
45 .BI " struct seccomp_notif_addfd *" addfd );
48 This page describes the user-space notification mechanism provided by the
49 Secure Computing (seccomp) facility.
50 As well as the use of the
51 .B SECCOMP_FILTER_FLAG_NEW_LISTENER
53 .BR SECCOMP_RET_USER_NOTIF
55 .B SECCOMP_GET_NOTIF_SIZES
56 operation described in
58 this mechanism involves the use of a number of related
60 operations (described below).
63 In conventional usage of a seccomp filter,
64 the decision about how to treat a system call is made by the filter itself.
65 By contrast, the user-space notification mechanism allows
66 the seccomp filter to delegate
67 the handling of the system call to another user-space process.
68 Note that this mechanism is explicitly
70 intended as a method implementing security policy; see NOTES.
72 In the discussion that follows,
73 the thread(s) on which the seccomp filter is installed is (are)
76 and the process that is notified by the user-space notification
77 mechanism is referred to as the
80 A suitably privileged supervisor can use the user-space notification
81 mechanism to perform actions on behalf of the target.
82 The advantage of the user-space notification mechanism is that
84 usually be able to retrieve information about the target and the
85 performed system call that the seccomp filter itself cannot.
86 (A seccomp filter is limited in the information it can obtain and
87 the actions that it can perform because it
88 is running on a virtual machine inside the kernel.)
90 An overview of the steps performed by the target and the supervisor
92 .\"-------------------------------------
94 The target establishes a seccomp filter in the usual manner,
95 but with two differences:
101 argument includes the flag
102 .BR SECCOMP_FILTER_FLAG_NEW_LISTENER .
103 Consequently, the return value of the (successful)
105 call is a new "listening"
106 file descriptor that can be used to receive notifications.
107 Only one "listening" seccomp filter can be installed for a thread.
109 .\" Is the last sentence above correct?
111 .\" Kees Cook (25 Oct 2020) notes:
113 .\" I like this limitation, but I expect that it'll need to change in the
114 .\" future. Even with LSMs, we see the need for arbitrary stacking, and the
115 .\" idea of there being only 1 supervisor will eventually break down. Right
116 .\" now there is only 1 because only container managers are using this
117 .\" feature. But if some daemon starts using it to isolate some thread,
118 .\" suddenly it might break if a container manager is trying to listen to it
119 .\" too, etc. I expect it won't be needed soon, but I do think it'll change.
122 In cases where it is appropriate, the seccomp filter returns the action value
123 .BR SECCOMP_RET_USER_NOTIF .
124 This return value will trigger a notification event.
126 .\"-------------------------------------
128 In order that the supervisor can obtain notifications
129 using the listening file descriptor,
130 (a duplicate of) that file descriptor must be passed from
131 the target to the supervisor.
132 One way in which this could be done is by passing the file descriptor
133 over a UNIX domain socket connection between the target and the supervisor
136 ancillary message type described in
138 Another way to do this is through the use of
141 .\" Instead of using unix domain sockets to send the fd to the
142 .\" parent, I think you could also use clone3() with
143 .\" flags==CLONE_FILES|SIGCHLD, dup2() the seccomp fd to an fd
144 .\" that was reserved in the parent, call unshare(CLONE_FILES)
145 .\" in the child after setting up the seccomp fd, and wake
146 .\" up the parent with something like pthread_cond_signal()?
147 .\" I'm not sure whether that'd look better or worse in the
148 .\" end though, so maybe just ignore this comment.
149 .\"-------------------------------------
151 The supervisor will receive notification events
152 on the listening file descriptor.
153 These events are returned as structures of type
155 Because this structure and its size may evolve over kernel versions,
156 the supervisor must first determine the size of this structure
159 .B SECCOMP_GET_NOTIF_SIZES
160 operation, which returns a structure of type
161 .IR seccomp_notif_sizes .
162 The supervisor allocates a buffer of size
163 .I seccomp_notif_sizes.seccomp_notif
164 bytes to receive notification events.
165 In addition,the supervisor allocates another buffer of size
166 .I seccomp_notif_sizes.seccomp_notif_resp
167 bytes for the response (a
168 .I struct seccomp_notif_resp
170 that it will provide to the kernel (and thus the target).
171 .\"-------------------------------------
173 The target then performs its workload,
174 which includes system calls that will be controlled by the seccomp filter.
175 Whenever one of these system calls causes the filter to return the
176 .B SECCOMP_RET_USER_NOTIF
177 action value, the kernel does
179 (yet) execute the system call;
180 instead, execution of the target is temporarily blocked inside
181 the kernel (in a sleep state that is interruptible by signals)
182 and a notification event is generated on the listening file descriptor.
183 .\"-------------------------------------
185 The supervisor can now repeatedly monitor the
186 listening file descriptor for
187 .BR SECCOMP_RET_USER_NOTIF -triggered
189 To do this, the supervisor uses the
190 .B SECCOMP_IOCTL_NOTIF_RECV
192 operation to read information about a notification event;
193 this operation blocks until an event is available.
194 The operation returns a
196 structure containing information about the system call
197 that is being attempted by the target.
198 (As described in NOTES,
199 the file descriptor can also be monitored with
205 .\" Christian Brauner:
207 .\" Do we support O_NONBLOCK with SECCOMP_IOCTL_NOTIF_RECV and if
212 .\" A quick test suggests that O_NONBLOCK has no effect on the blocking
213 .\" behavior of SECCOMP_IOCTL_NOTIF_RECV.
215 .\"-------------------------------------
219 structure returned by the
220 .B SECCOMP_IOCTL_NOTIF_RECV
221 operation includes the same information (a
223 structure) that was passed to the seccomp filter.
224 This information allows the supervisor to discover the system call number and
225 the arguments for the target's system call.
226 In addition, the notification event contains the ID of the thread
227 that triggered the notification and a unique cookie value that
228 is used in subsequent
229 .B SECCOMP_IOCTL_NOTIF_ID_VALID
231 .B SECCOMP_IOCTL_NOTIF_SEND
234 The information in the notification can be used to discover the
235 values of pointer arguments for the target's system call.
236 (This is something that can't be done from within a seccomp filter.)
237 One way in which the supervisor can do this is to open the corresponding
241 and read bytes from the location that corresponds to one of
242 the pointer arguments whose value is supplied in the notification event.
243 .\" Tycho Andersen mentioned that there are alternatives to /proc/PID/mem,
244 .\" such as ptrace() and /proc/PID/map_files
245 (The supervisor must be careful to avoid
246 a race condition that can occur when doing this;
247 see the description of the
248 .BR SECCOMP_IOCTL_NOTIF_ID_VALID
252 the supervisor can access other system information that is visible
253 in user space but which is not accessible from a seccomp filter.
254 .\"-------------------------------------
256 Having obtained information as per the previous step,
257 the supervisor may then choose to perform an action in response
258 to the target's system call
259 (which, as noted above, is not executed when the seccomp filter returns the
260 .B SECCOMP_RET_USER_NOTIF
263 One example use case here relates to containers.
264 The target may be located inside a container where
265 it does not have sufficient capabilities to mount a filesystem
266 in the container's mount namespace.
267 However, the supervisor may be a more privileged process that
268 does have sufficient capabilities to perform the mount operation.
269 .\"-------------------------------------
271 The supervisor then sends a response to the notification.
272 The information in this response is used by the kernel to construct
273 a return value for the target's system call and provide
274 a value that will be assigned to the
276 variable of the target.
278 The response is sent using the
279 .B SECCOMP_IOCTL_NOTIF_SEND
281 operation, which is used to transmit a
282 .I seccomp_notif_resp
283 structure to the kernel.
284 This structure includes a cookie value that the supervisor obtained in the
286 structure returned by the
287 .B SECCOMP_IOCTL_NOTIF_RECV
289 This cookie value allows the kernel to associate the response with the
291 This structure must include the cookie value that the supervisor
294 structure returned by the
295 .B SECCOMP_IOCTL_NOTIF_RECV
297 the cookie allows the kernel to associate the response with the target.
298 .\"-------------------------------------
300 Once the notification has been sent,
301 the system call in the target thread unblocks,
302 returning the information that was provided by the supervisor
303 in the notification response.
304 .\"-------------------------------------
306 As a variation on the last two steps,
307 the supervisor can send a response that tells the kernel that it
308 should execute the target thread's system call; see the discussion of
309 .BR SECCOMP_USER_NOTIF_FLAG_CONTINUE ,
312 .SS ioctl(2) operations
315 operations are supported by the seccomp user-space
316 notification file descriptor.
317 For each of these operations, the first (file descriptor) argument of
319 is the listening file descriptor returned by a call to
322 .BR SECCOMP_FILTER_FLAG_NEW_LISTENER
325 .BR SECCOMP_IOCTL_NOTIF_RECV " (since Linux 5.0)"
326 This operation is used to obtain a user-space
328 If no such event is currently pending,
329 the operation blocks until an event occurs.
332 argument is a pointer to a structure of the following form
333 which contains information about the event.
334 This structure must be zeroed out before the call.
338 struct seccomp_notif {
339 __u64 id; /* Cookie */
340 __u32 pid; /* TID of target thread */
341 __u32 flags; /* Currently unused (0) */
342 struct seccomp_data data; /* See seccomp(2) */
347 The fields in this structure are as follows:
351 This is a cookie for the notification.
352 Each such cookie is guaranteed to be unique for the corresponding
356 The cookie can be used with the
357 .B SECCOMP_IOCTL_NOTIF_ID_VALID
359 operation described below.
361 When returning a notification response to the kernel,
362 the supervisor must include the cookie value in the
363 .IR seccomp_notif_resp
364 structure that is specified as the argument of the
365 .BR SECCOMP_IOCTL_NOTIF_SEND
370 This is the thread ID of the target thread that triggered
371 the notification event.
374 This is a bit mask of flags providing further information on the event.
375 In the current implementation, this field is always zero.
380 structure containing information about the system call that
381 triggered the notification.
382 This is the same structure that is passed to the seccomp filter.
385 for details of this structure.
388 On success, this operation returns 0; on failure, \-1 is returned, and
390 is set to indicate the cause of the error.
391 This operation can fail with the following errors:
394 .BR EINVAL " (since Linux 5.5)"
395 .\" commit 2882d53c9c6f3b8311d225062522f03772cf0179
398 structure that was passed to the call contained nonzero fields.
401 The target thread was killed by a signal as the notification information
403 or the target's (blocked) system call was interrupted by a signal handler.
406 .\" From my experiments,
407 .\" it appears that if a SECCOMP_IOCTL_NOTIF_RECV is done after
408 .\" the target thread terminates, then the ioctl() simply
409 .\" blocks (rather than returning an error to indicate that the
410 .\" target no longer exists).
412 .\" I found that surprising, and it required some contortions in
413 .\" the example program. It was not possible to code my SIGCHLD
414 .\" handler (which reaps the zombie when the worker/target
415 .\" terminates) to simply set a flag checked in the main
416 .\" handleNotifications() loop, since this created an
417 .\" unavoidable race where the child might terminate just after
418 .\" I had checked the flag, but before I blocked (forever!) in the
419 .\" SECCOMP_IOCTL_NOTIF_RECV operation. Instead, I had to code
420 .\" the signal handler to simply call _exit(2) in order to
421 .\" terminate the parent process (the supervisor).
423 .\" Is this expected behavior? It seems to me rather
424 .\" desirable that SECCOMP_IOCTL_NOTIF_RECV should give an error
425 .\" if the target has terminated.
427 .\" Jann posted a patch to rectify this, but there was no response
428 .\" (Lore link: https://bit.ly/3jvUBxk) to his question about fixing
429 .\" this issue. (I've tried building with the patch, but encountered
430 .\" an issue with the target process entering D state after a signal.)
432 .\" For now, this behavior is documented in BUGS.
434 .\" Kees Cook commented: Let's change [this] ASAP!
436 .BR SECCOMP_IOCTL_NOTIF_ID_VALID " (since Linux 5.0)"
437 This operation can be used to check that a notification ID
438 returned by an earlier
439 .B SECCOMP_IOCTL_NOTIF_RECV
440 operation is still valid
441 (i.e., that the target still exists and its system call
442 is still blocked waiting for a response).
446 argument is a pointer to the cookie
449 .B SECCOMP_IOCTL_NOTIF_RECV
452 This operation is necessary to avoid race conditions that can occur when the
455 .B SECCOMP_IOCTL_NOTIF_RECV
456 operation terminates, and that process ID is reused by another process.
457 An example of this kind of race is the following
460 A notification is generated on the listening file descriptor.
463 contains the TID of the target thread (in the
465 field of the structure).
467 The target terminates.
469 Another thread or process is created on the system that by chance reuses the
470 TID that was freed when the target terminated.
476 file for the TID obtained in step 1, with the intention of (say)
477 inspecting the memory location(s) that containing the argument(s) of
478 the system call that triggered the notification in step 1.
481 In the above scenario, the risk is that the supervisor may try
482 to access the memory of a process other than the target.
483 This race can be avoided by following the call to
486 .B SECCOMP_IOCTL_NOTIF_ID_VALID
487 operation to verify that the process that generated the notification
489 (Note that if the target terminates after the latter step,
492 from the file descriptor may return 0, indicating end of file.)
494 .\" the PID can be reused, but the /proc/$pid directory is
495 .\" internally not associated with the numeric PID, but,
496 .\" conceptually speaking, with a specific incarnation of the
497 .\" PID, or something like that. (Actually, it is associated
498 .\" with the "struct pid", which is not reused, instead of the
501 See NOTES for a discussion of other cases where
502 .B SECCOMP_IOCTL_NOTIF_ID_VALID
503 checks must be performed.
505 On success (i.e., the notification ID is still valid),
506 this operation returns 0.
507 On failure (i.e., the notification ID is no longer valid),
513 .BR SECCOMP_IOCTL_NOTIF_SEND " (since Linux 5.0)"
514 This operation is used to send a notification response back to the kernel.
517 argument of this structure is a pointer to a structure of the following form:
521 struct seccomp_notif_resp {
522 __u64 id; /* Cookie value */
523 __s64 val; /* Success return value */
524 __s32 error; /* 0 (success) or negative
526 __u32 flags; /* See below */
531 The fields of this structure are as follows:
535 This is the cookie value that was obtained using the
536 .B SECCOMP_IOCTL_NOTIF_RECV
538 This cookie value allows the kernel to correctly associate this response
539 with the system call that triggered the user-space notification.
542 This is the value that will be used for a spoofed
543 success return for the target's system call; see below.
546 This is the value that will be used as the error number
548 for a spoofed error return for the target's system call; see below.
551 This is a bit mask that includes zero or more of the following flags:
554 .BR SECCOMP_USER_NOTIF_FLAG_CONTINUE " (since Linux 5.5)"
555 Tell the kernel to execute the target's system call.
556 .\" commit fb3c5386b382d4097476ce9647260fc89b34afdb
560 Two kinds of response are possible:
563 A response to the kernel telling it to execute the
564 target's system call.
568 .B SECCOMP_USER_NOTIF_FLAG_CONTINUE
575 This kind of response can be useful in cases where the supervisor needs
576 to do deeper analysis of the target's system call than is possible
577 from a seccomp filter (e.g., examining the values of pointer arguments),
578 and, having decided that the system call does not require emulation
579 by the supervisor, the supervisor wants the system call to
580 be executed normally in the target.
583 .B SECCOMP_USER_NOTIF_FLAG_CONTINUE
584 flag should be used with caution; see NOTES.
586 A spoofed return value for the target's system call.
587 In this case, the kernel does not execute the target's system call,
588 instead causing the system call to return a spoofed value as specified by
590 .I seccomp_notif_resp
592 The supervisor should set the fields of this structure as follows:
597 .BR SECCOMP_USER_NOTIF_FLAG_CONTINUE .
600 is set either to 0 for a spoofed "success" return or to a negative
601 error number for a spoofed "failure" return.
602 In the former case, the kernel causes the target's system call
603 to return the value specified in the
606 In the latter case, the kernel causes the target's system call
609 is assigned the negated
614 is set to a value that will be used as the return value for a spoofed
615 "success" return for the target's system call.
616 The value in this field is ignored if the
618 field contains a nonzero value.
620 .\" Kees Cook suggested:
622 .\" Strictly speaking, this is architecture specific, but
623 .\" all architectures do it this way. Should seccomp enforce
624 .\" val == 0 when err != 0 ?
626 .\" Christian Brauner
628 .\" Feels like it should, at least for the SEND ioctl where we already
629 .\" verify that val and err are both 0 when CONTINUE is specified (as you
630 .\" pointed out correctly above).
634 On success, this operation returns 0; on failure, \-1 is returned, and
636 is set to indicate the cause of the error.
637 This operation can fail with the following errors:
641 A response to this notification has already been sent.
644 An invalid value was specified in the
652 .BR SECCOMP_USER_NOTIF_FLAG_CONTINUE ,
660 The blocked system call in the target
661 has been interrupted by a signal handler
662 or the target has terminated.
664 .\" you could also get this [ENOENT] if a response has already
665 .\" been sent, instead of EINPROGRESS - the only difference is
666 .\" whether the target thread has picked up the response yet
669 .BR SECCOMP_IOCTL_NOTIF_ADDFD " (since Linux 5.9)"
670 This operation allows the supervisor to install a file descriptor
671 into the target's file descriptor table.
674 messages described in
676 this operation is semantically equivalent to duplicating
677 a file descriptor from the supervisor's file descriptor table
678 into the target's file descriptor table.
681 .BR SECCOMP_IOCTL_NOTIF_ADDFD
682 operation permits the supervisor to emulate a target system call (such as
686 that generates a file descriptor.
687 The supervisor can perform the system call that generates
688 the file descriptor (and associated open file description)
689 and then use this operation to allocate
690 a file descriptor that refers to the same open file description in the target.
691 (For an explanation of open file descriptions, see
694 Once this operation has been performed,
695 the supervisor can close its copy of the file descriptor.
698 the received file descriptor is subject to the same
699 Linux Security Module (LSM) checks as are applied to a file descriptor
700 that is received in an
703 If the file descriptor refers to a socket,
704 it inherits the cgroup version 1 network controller settings
712 argument is a pointer to a structure of the following form:
716 struct seccomp_notif_addfd {
717 __u64 id; /* Cookie value */
718 __u32 flags; /* Flags */
719 __u32 srcfd; /* Local file descriptor number */
720 __u32 newfd; /* 0 or desired file descriptor
722 __u32 newfd_flags; /* Flags to set on target file
728 The fields in this structure are as follows:
732 This field should be set to the notification ID
733 (cookie value) that was obtained via
734 .BR SECCOMP_IOCTL_NOTIF_RECV .
737 This field is a bit mask of flags that modify the behavior of the operation.
738 Currently, only one flag is supported:
741 .BR SECCOMP_ADDFD_FLAG_SETFD
742 When allocating the file descriptor in the target,
743 use the file descriptor number specified in the
749 This field should be set to the number of the file descriptor
750 in the supervisor that is to be duplicated.
753 This field determines which file descriptor number is allocated in the target.
755 .BR SECCOMP_ADDFD_FLAG_SETFD
757 then this field specifies which file descriptor number should be allocated.
758 If this file descriptor number is already open in the target,
759 it is atomically closed and reused.
760 If the descriptor duplication fails due to an LSM check, or if
762 is not a valid file descriptor,
765 will not be closed in the target process.
768 .BR SECCOMP_ADDFD_FLAG_SETFD
769 flag it not set, then this field must be 0,
770 and the kernel allocates the lowest unused file descriptor number
774 This field is a bit mask specifying flags that should be set on
775 the file descriptor that is received in the target process.
776 Currently, only the following flag is implemented:
780 Set the close-on-exec flag on the received file descriptor.
786 call returns the number of the file descriptor that was allocated
788 Assuming that the emulated system call is one that returns
789 a file descriptor as its function result (e.g.,
791 this value can be used as the return value
793 that is supplied in the response that is subsequently sent with the
794 .BR SECCOMP_IOCTL_NOTIF_SEND
797 On error, \-1 is returned and
799 is set to indicate the cause of the error.
801 This operation can fail with the following errors:
805 Allocating the file descriptor in the target would cause the target's
807 limit to be exceeded (see
811 The user-space notification specified in the
813 field exists but has not yet been fetched (by a
814 .BR SECCOMP_IOCTL_NOTIF_RECV )
815 or has already been responded to (by a
816 .BR SECCOMP_IOCTL_NOTIF_SEND ).
819 An invalid flag was specified in the
825 field is nonzero and the
826 .B SECCOMP_ADDFD_FLAG_SETFD
827 flag was not specified in the
832 The file descriptor number specified in
834 exceeds the limit specified in
835 .IR /proc/sys/fs/nr_open .
838 The blocked system call in the target
839 has been interrupted by a signal handler
840 or the target has terminated.
843 Here is some sample code (with error handling omitted) that uses the
844 .B SECCOMP_ADDFD_FLAG_SETFD
845 operation (here, to emulate a call to
852 fd = openat(req->data.args[0], path, req->data.args[2],
855 struct seccomp_notif_addfd addfd;
856 addfd.id = req->id; /* Cookie from
857 SECCOMP_IOCTL_NOTIF_RECV */
861 addfd.newfd_flags = O_CLOEXEC;
863 targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD,
866 close(fd); /* No longer needed in supervisor */
868 struct seccomp_notif_resp *resp;
869 /* Code to allocate 'resp' omitted */
871 resp->error = 0; /* "Success" */
872 resp->val = targetFd;
874 ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
878 One example use case for the user-space notification
879 mechanism is to allow a container manager
880 (a process which is typically running with more privilege than
881 the processes inside the container)
882 to mount block devices or create device nodes for the container.
883 The mount use case provides an example of where the
884 .BR SECCOMP_USER_NOTIF_FLAG_CONTINUE
887 Upon receiving a notification for the
889 system call, the container manager (the "supervisor") can distinguish
890 a request to mount a block filesystem
891 (which would not be possible for a "target" process inside the container)
892 and mount that file system.
893 If, on the other hand, the container manager detects that the operation
894 could be performed by the process inside the container
897 filesystem), it can notify the kernel that the target process's
899 system call can continue.
901 .SS select()/poll()/epoll semantics
902 The file descriptor returned when
905 .B SECCOMP_FILTER_FLAG_NEW_LISTENER
906 flag can be monitored using
911 These interfaces indicate that the file descriptor is ready as follows:
913 When a notification is pending,
914 these interfaces indicate that the file descriptor is readable.
915 Following such an indication, a subsequent
916 .B SECCOMP_IOCTL_NOTIF_RECV
918 will not block, returning either information about a notification
919 or else failing with the error
921 if the target has been killed by a signal or its system call
922 has been interrupted by a signal handler.
924 After the notification has been received (i.e., by the
925 .B SECCOMP_IOCTL_NOTIF_RECV
927 operation), these interfaces indicate that the file descriptor is writable,
928 meaning that a notification response can be sent using the
929 .B SECCOMP_IOCTL_NOTIF_SEND
933 After the last thread using the filter has terminated and been reaped using
936 the file descriptor indicates an end-of-file condition (readable in
938 .BR POLLHUP / EPOLLHUP
942 .SS Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
943 The intent of the user-space notification feature is
944 to allow system calls to be performed on behalf of the target.
945 The target's system call should either be handled by the supervisor or
946 allowed to continue normally in the kernel (where standard security
947 policies will be applied).
950 this mechanism must not be used to make security policy decisions
951 about the system call,
952 which would be inherently race-prone for reasons described next.
955 .B SECCOMP_USER_NOTIF_FLAG_CONTINUE
956 flag must be used with caution.
957 If set by the supervisor, the target's system call will continue.
958 However, there is a time-of-check, time-of-use race here,
959 since an attacker could exploit the interval of time where the target is
960 blocked waiting on the "continue" response to do things such as
961 rewriting the system call arguments.
963 Note furthermore that a user-space notifier can be bypassed if
964 the existing filters allow the use of
968 to install a filter that returns an action value with a higher precedence than
969 .B SECCOMP_RET_USER_NOTIF
973 It should thus be absolutely clear that the
974 seccomp user-space notification mechanism
976 be used to implement a security policy!
977 It should only ever be used in scenarios where a more privileged process
978 supervises the system calls of a lesser privileged target to
979 get around kernel-enforced security restrictions when
980 the supervisor deems this safe.
982 in order to continue a system call, the supervisor should be sure that
983 another security mechanism or the kernel itself will sufficiently block
984 the system call if its arguments are rewritten to something unsafe.
986 .SS Caveats regarding the use of /proc/[tid]/mem
987 The discussion above noted the need to use the
988 .BR SECCOMP_IOCTL_NOTIF_ID_VALID
993 to avoid the possibility of accessing the memory of the wrong process
994 in the event that the target terminates and its ID
995 is recycled by another (unrelated) thread.
996 However, the use of this
998 operation is also necessary in other situations,
999 as explained in the following paragraphs.
1001 Consider the following scenario, where the supervisor
1002 tries to read the pathname argument of a target's blocked
1006 From one of its functions
1010 which triggers a user-space notification and causes the target to block.
1012 The supervisor receives the notification, opens
1013 .IR /proc/[tid]/mem ,
1014 and (successfully) performs the
1015 .BR SECCOMP_IOCTL_NOTIF_ID_VALID
1018 The target receives a signal, which causes the
1022 The signal handler executes in the target, and returns.
1024 Upon return from the handler, the execution of
1026 resumes, and it returns (and perhaps other functions are called,
1027 overwriting the memory that had been used for the stack frame of
1030 Using the address provided in the notification information,
1031 the supervisor reads from the target's memory location that used to
1032 contain the pathname.
1034 The supervisor now calls
1036 with some arbitrary bytes obtained in the previous step.
1038 The conclusion from the above scenario is this:
1039 since the target's blocked system call may be interrupted by a signal handler,
1040 the supervisor must be written to expect that the
1041 target may abandon its system call at
1044 in such an event, any information that the supervisor obtained from
1045 the target's memory must be considered invalid.
1047 To prevent such scenarios,
1048 every read from the target's memory must be separated from use of
1049 the bytes so obtained by a
1050 .BR SECCOMP_IOCTL_NOTIF_ID_VALID
1052 In the above example, the check would be placed between the two final steps.
1053 An example of such a check is shown in EXAMPLES.
1055 Following on from the above, it should be clear that
1056 a write by the supervisor into the target's memory can
1060 .SS Interaction with SA_RESTART signal handlers
1061 Consider the following scenario:
1063 The target process has used
1065 to install a signal handler with the
1069 The target has made a system call that triggered a seccomp
1070 user-space notification and the target is currently blocked
1071 until the supervisor sends a notification response.
1073 A signal is delivered to the target and the signal handler is executed.
1075 When (if) the supervisor attempts to send a notification response, the
1076 .B SECCOMP_IOCTL_NOTIF_SEND
1078 operation will fail with the
1082 In this scenario, the kernel will restart the target's system call.
1083 Consequently, the supervisor will receive another user-space notification.
1084 Thus, depending on how many times the blocked system call
1085 is interrupted by a signal handler,
1086 the supervisor may receive multiple notifications for
1087 the same instance of a system call in the target.
1089 One oddity is that system call restarting as described in this scenario
1090 will occur even for the blocking system calls listed in
1094 normally be restarted by the
1098 .\" About the above, Kees Cook commented:
1100 .\" Does this need fixing? I imagine the correct behavior for this case
1101 .\" would be a response to _SEND of EINPROGRESS and the target would see
1104 .\" I mean, it's not like seccomp doesn't already expose weirdness with
1105 .\" syscall restarts. Not even arm64 compat agrees[3] with arm32 in this
1109 .\" Michael Kerrisk:
1110 .\" I wonder about the effect of this oddity for system calls that
1111 .\" are normally nonrestartable because they have timeouts. My
1112 .\" understanding is that the kernel doesn't restart those system
1113 .\" calls because it's impossible for the kernel to restart the call
1114 .\" with the right timeout value. I wonder what happens when those
1115 .\" system calls are restarted in the scenario we're discussing.)
1118 .BR SECCOMP_IOCTL_NOTIF_RECV
1121 .\" or a poll/epoll/select
1122 is performed after the target terminates, then the
1124 call simply blocks (rather than returning an error to indicate that the
1125 target no longer exists).
1127 .\" Comment from Kees Cook:
1129 .\" I want this fixed. It caused me no end of pain when building the
1130 .\" selftests, and ended up spawning my implementing a global test timeout
1131 .\" in kselftest. :P Before the usage counter refactor, there was no sane
1132 .\" way to deal with this, but now I think we're close.
1135 The (somewhat contrived) program shown below demonstrates the use of
1136 the interfaces described in this page.
1137 The program creates a child process that serves as the "target" process.
1138 The child process installs a seccomp filter that returns the
1139 .B SECCOMP_RET_USER_NOTIF
1140 action value if a call is made to
1142 The child process then calls
1144 once for each of the supplied command-line arguments,
1145 and reports the result returned by the call.
1146 After processing all arguments, the child process terminates.
1148 The parent process acts as the supervisor, listening for the notifications
1149 that are generated when the target process calls
1151 When such a notification occurs,
1152 the supervisor examines the memory of the target process (using
1153 .IR /proc/[pid]/mem )
1154 to discover the pathname argument that was supplied to the
1156 call, and performs one of the following actions:
1158 If the pathname begins with the prefix "/tmp/",
1159 then the supervisor attempts to create the specified directory,
1160 and then spoofs a return for the target process based on the return
1161 value of the supervisor's
1164 In the event that that call succeeds,
1165 the spoofed success return value is the length of the pathname.
1167 If the pathname begins with "./" (i.e., it is a relative pathname),
1168 the supervisor sends a
1169 .B SECCOMP_USER_NOTIF_FLAG_CONTINUE
1170 response to the kernel to say that the kernel should execute
1171 the target process's
1175 If the pathname begins with some other prefix,
1176 the supervisor spoofs an error return for the target process,
1177 so that the target process's
1179 call appears to fail with the error
1181 ("Operation not supported").
1182 Additionally, if the specified pathname is exactly "/bye",
1183 then the supervisor terminates.
1185 This program can be used to demonstrate various aspects of the
1186 behavior of the seccomp user-space notification mechanism.
1187 To help aid such demonstrations,
1188 the program logs various messages to show the operation
1189 of the target process (lines prefixed "T:") and the supervisor
1190 (indented lines prefixed "S:").
1192 In the following example, the target attempts to create the directory
1194 Upon receiving the notification, the supervisor creates the directory on the
1196 and spoofs a success return to be received by the target process's
1202 $ \fB./seccomp_unotify /tmp/x\fP
1205 T: about to mkdir("/tmp/x")
1206 S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
1207 S: executing: mkdir("/tmp/x", 0700)
1208 S: success! spoofed return = 6
1209 S: sending response (flags = 0; val = 6; error = 0)
1210 T: SUCCESS: mkdir(2) returned 6
1213 S: target has terminated; bye
1217 In the above output, note that the spoofed return value seen by the target
1218 process is 6 (the length of the pathname
1222 call returns 0 on success.
1224 In the next example, the target attempts to create a directory using the
1227 Since this pathname starts with "./",
1228 the supervisor sends a
1229 .B SECCOMP_USER_NOTIF_FLAG_CONTINUE
1230 response to the kernel,
1231 and the kernel then (successfully) executes the target process's
1237 $ \fB./seccomp_unotify ./sub\fP
1240 T: about to mkdir("./sub")
1241 S: got notification (ID 0xddb16abe25b4c12) for PID 23204
1242 S: target can execute system call
1243 S: sending response (flags = 0x1; val = 0; error = 0)
1244 T: SUCCESS: mkdir(2) returned 0
1247 S: target has terminated; bye
1251 If the target process attempts to create a directory with
1252 a pathname that doesn't start with "." and doesn't begin with the prefix
1253 "/tmp/", then the supervisor spoofs an error return
1255 "Operation not supported")
1258 call (which is not executed):
1262 $ \fB./seccomp_unotify /xxx\fP
1265 T: about to mkdir("/xxx")
1266 S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
1267 S: spoofing error response (Operation not supported)
1268 S: sending response (flags = 0; val = 0; error = \-95)
1269 T: ERROR: mkdir(2): Operation not supported
1272 S: target has terminated; bye
1276 In the next example,
1277 the target process attempts to create a directory with the pathname
1278 .BR /tmp/nosuchdir/b .
1279 Upon receiving the notification,
1280 the supervisor attempts to create that directory, but the
1282 call fails because the directory
1285 Consequently, the supervisor spoofs an error return that passes the error
1286 that it received back to the target process's
1292 $ \fB./seccomp_unotify /tmp/nosuchdir/b\fP
1295 T: about to mkdir("/tmp/nosuchdir/b")
1296 S: got notification (ID 0x8744454293506046) for PID 23199
1297 S: executing: mkdir("/tmp/nosuchdir/b", 0700)
1298 S: failure! (errno = 2; No such file or directory)
1299 S: sending response (flags = 0; val = 0; error = \-2)
1300 T: ERROR: mkdir(2): No such file or directory
1303 S: target has terminated; bye
1307 If the supervisor receives a notification and sees that the
1308 argument of the target's
1310 is the string "/bye", then (as well as spoofing an
1312 error), the supervisor terminates.
1313 If the target process subsequently executes another
1315 that triggers its seccomp filter to return the
1316 .B SECCOMP_RET_USER_NOTIF
1317 action value, then the kernel causes the target process's system call to
1320 ("Function not implemented").
1321 This is demonstrated by the following example:
1325 $ \fB./seccomp_unotify /bye /tmp/y\fP
1328 T: about to mkdir("/bye")
1329 S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
1330 S: spoofing error response (Operation not supported)
1331 S: sending response (flags = 0; val = 0; error = \-95)
1332 S: terminating **********
1333 T: ERROR: mkdir(2): Operation not supported
1335 T: about to mkdir("/tmp/y")
1336 T: ERROR: mkdir(2): Function not implemented
1348 #include <linux/audit.h>
1349 #include <linux/filter.h>
1350 #include <linux/seccomp.h>
1352 #include <stdbool.h>
1357 #include <sys/socket.h>
1358 #include <sys/ioctl.h>
1359 #include <sys/prctl.h>
1360 #include <sys/stat.h>
1361 #include <sys/types.h>
1363 #include <sys/syscall.h>
1366 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \e
1369 /* Send the file descriptor \(aqfd\(aq over the connected UNIX domain socket
1370 \(aqsockfd\(aq. Returns 0 on success, or \-1 on error. */
1373 sendfd(int sockfd, int fd)
1378 struct cmsghdr *cmsgp;
1380 /* Allocate a char array of suitable size to hold the ancillary data.
1381 However, since this buffer is in reality a \(aqstruct cmsghdr\(aq, use a
1382 union to ensure that it is suitably aligned. */
1384 char buf[CMSG_SPACE(sizeof(int))];
1385 /* Space large enough to hold an \(aqint\(aq */
1386 struct cmsghdr align;
1389 /* The \(aqmsg_name\(aq field can be used to specify the address of the
1390 destination socket when sending a datagram. However, we do not
1391 need to use this field because \(aqsockfd\(aq is a connected socket. */
1393 msgh.msg_name = NULL;
1394 msgh.msg_namelen = 0;
1396 /* On Linux, we must transmit at least one byte of real data in
1397 order to send ancillary data. We transmit an arbitrary integer
1398 whose value is ignored by recvfd(). */
1400 msgh.msg_iov = &iov;
1401 msgh.msg_iovlen = 1;
1402 iov.iov_base = &data;
1403 iov.iov_len = sizeof(int);
1406 /* Set \(aqmsghdr\(aq fields that describe ancillary data */
1408 msgh.msg_control = controlMsg.buf;
1409 msgh.msg_controllen = sizeof(controlMsg.buf);
1411 /* Set up ancillary data describing file descriptor to send */
1413 cmsgp = CMSG_FIRSTHDR(&msgh);
1414 cmsgp\->cmsg_level = SOL_SOCKET;
1415 cmsgp\->cmsg_type = SCM_RIGHTS;
1416 cmsgp\->cmsg_len = CMSG_LEN(sizeof(int));
1417 memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
1419 /* Send real plus ancillary data */
1421 if (sendmsg(sockfd, &msgh, 0) == \-1)
1427 /* Receive a file descriptor on a connected UNIX domain socket. Returns
1428 the received file descriptor on success, or \-1 on error. */
1438 /* Allocate a char buffer for the ancillary data. See the comments
1441 char buf[CMSG_SPACE(sizeof(int))];
1442 struct cmsghdr align;
1444 struct cmsghdr *cmsgp;
1446 /* The \(aqmsg_name\(aq field can be used to obtain the address of the
1447 sending socket. However, we do not need this information. */
1449 msgh.msg_name = NULL;
1450 msgh.msg_namelen = 0;
1452 /* Specify buffer for receiving real data */
1454 msgh.msg_iov = &iov;
1455 msgh.msg_iovlen = 1;
1456 iov.iov_base = &data; /* Real data is an \(aqint\(aq */
1457 iov.iov_len = sizeof(int);
1459 /* Set \(aqmsghdr\(aq fields that describe ancillary data */
1461 msgh.msg_control = controlMsg.buf;
1462 msgh.msg_controllen = sizeof(controlMsg.buf);
1464 /* Receive real plus ancillary data; real data is ignored */
1466 nr = recvmsg(sockfd, &msgh, 0);
1470 cmsgp = CMSG_FIRSTHDR(&msgh);
1472 /* Check the validity of the \(aqcmsghdr\(aq */
1474 if (cmsgp == NULL ||
1475 cmsgp\->cmsg_len != CMSG_LEN(sizeof(int)) ||
1476 cmsgp\->cmsg_level != SOL_SOCKET ||
1477 cmsgp\->cmsg_type != SCM_RIGHTS) {
1482 /* Return the received file descriptor to our caller */
1484 memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
1489 sigchldHandler(int sig)
1491 char msg[] = "\etS: target has terminated; bye\en";
1493 write(STDOUT_FILENO, msg, sizeof(msg) - 1);
1494 _exit(EXIT_SUCCESS);
1498 seccomp(unsigned int operation, unsigned int flags, void *args)
1500 return syscall(__NR_seccomp, operation, flags, args);
1503 /* The following is the x86\-64\-specific BPF boilerplate code for checking
1504 that the BPF program is running on the right architecture + ABI. At
1505 completion of these instructions, the accumulator contains the system
1508 /* For the x32 ABI, all system call numbers have bit 30 set */
1510 #define X32_SYSCALL_BIT 0x40000000
1512 #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \e
1513 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \e
1514 (offsetof(struct seccomp_data, arch))), \e
1515 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \e
1516 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \e
1517 (offsetof(struct seccomp_data, nr))), \e
1518 BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \e
1519 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
1521 /* installNotifyFilter() installs a seccomp filter that generates
1522 user\-space notifications (SECCOMP_RET_USER_NOTIF) when the process
1523 calls mkdir(2); the filter allows all other system calls.
1525 The function return value is a file descriptor from which the
1526 user\-space notifications can be fetched. */
1529 installNotifyFilter(void)
1531 struct sock_filter filter[] = {
1532 X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1534 /* mkdir() triggers notification to user\-space supervisor */
1536 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
1537 BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1539 /* Every other system call is allowed */
1541 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1544 struct sock_fprog prog = {
1545 .len = sizeof(filter) / sizeof(filter[0]),
1549 /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1550 as a result, seccomp() returns a notification file descriptor. */
1552 int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1553 SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1554 if (notifyFd == \-1)
1555 errExit("seccomp\-install\-notify\-filter");
1560 /* Close a pair of sockets created by socketpair() */
1563 closeSocketPair(int sockPair[2])
1565 if (close(sockPair[0]) == \-1)
1566 errExit("closeSocketPair\-close\-0");
1567 if (close(sockPair[1]) == \-1)
1568 errExit("closeSocketPair\-close\-1");
1571 /* Implementation of the target process; create a child process that:
1573 (1) installs a seccomp filter with the
1574 SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1575 (2) writes the seccomp notification file descriptor returned from
1576 the previous step onto the UNIX domain socket, \(aqsockPair[0]\(aq;
1577 (3) calls mkdir(2) for each element of \(aqargv\(aq.
1579 The function return value in the parent is the PID of the child
1580 process; the child does not return from this function. */
1583 targetProcess(int sockPair[2], char *argv[])
1585 pid_t targetPid = fork();
1586 if (targetPid == \-1)
1589 if (targetPid > 0) /* In parent, return PID of child */
1592 /* Child falls through to here */
1594 printf("T: PID = %ld\en", (long) getpid());
1596 /* Install seccomp filter(s) */
1598 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1601 int notifyFd = installNotifyFilter();
1603 /* Pass the notification file descriptor to the tracing process over
1604 a UNIX domain socket */
1606 if (sendfd(sockPair[0], notifyFd) == \-1)
1609 /* Notification and socket FDs are no longer needed in target */
1611 if (close(notifyFd) == \-1)
1612 errExit("close\-target\-notify\-fd");
1614 closeSocketPair(sockPair);
1616 /* Perform a mkdir() call for each of the command\-line arguments */
1618 for (char **ap = argv; *ap != NULL; ap++) {
1619 printf("\enT: about to mkdir(\e"%s\e")\en", *ap);
1621 int s = mkdir(*ap, 0700);
1623 perror("T: ERROR: mkdir(2)");
1625 printf("T: SUCCESS: mkdir(2) returned %d\en", s);
1628 printf("\enT: terminating\en");
1632 /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1633 operation is still valid. It will no longer be valid if the target
1634 process has terminated or is no longer blocked in the system call that
1635 generated the notification (because it was interrupted by a signal).
1637 This operation can be used when doing such things as accessing
1638 /proc/PID files in the target process in order to avoid TOCTOU race
1639 conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1640 terminates and is reused by another process. */
1643 cookieIsValid(int notifyFd, uint64_t id)
1645 return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1648 /* Access the memory of the target process in order to fetch the
1649 pathname referred to by the system call argument \(aqargNum\(aq in
1650 \(aqreq\->data.args[]\(aq. The pathname is returned in \(aqpath\(aq,
1651 a buffer of \(aqlen\(aq bytes allocated by the caller.
1653 Returns true if the pathname is successfully fetched, and false
1654 otherwise. For possible causes of failure, see the comments below. */
1657 getTargetPathname(struct seccomp_notif *req, int notifyFd,
1658 int argNum, char *path, size_t len)
1660 char procMemPath[PATH_MAX];
1662 snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req\->pid);
1664 int procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1665 if (procMemFd == \-1)
1668 /* Check that the process whose info we are accessing is still alive
1669 and blocked in the system call that caused the notification.
1670 If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1671 cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1672 descriptor that we opened corresponded to the process for which we
1673 received a notification. If that process subsequently terminates,
1674 then read() on that file descriptor will return 0 (EOF). */
1676 if (!cookieIsValid(notifyFd, req\->id)) {
1681 /* Read bytes at the location containing the pathname argument */
1683 ssize_t nread = pread(procMemFd, path, len, req\->data.args[argNum]);
1690 /* Once again check that the notification ID is still valid. The
1691 case we are particularly concerned about here is that just
1692 before we fetched the pathname, the target\(aqs blocked system
1693 call was interrupted by a signal handler, and after the handler
1694 returned, the target carried on execution (past the interrupted
1695 system call). In that case, we have no guarantees about what we
1696 are reading, since the target\(aqs memory may have been arbitrarily
1697 changed by subsequent operations. */
1699 if (!cookieIsValid(notifyFd, req\->id)) {
1700 perror("\etS: notification ID check failed!!!");
1704 /* Even if the target\(aqs system call was not interrupted by a signal,
1705 we have no guarantees about what was in the memory of the target
1706 process. (The memory may have been modified by another thread, or
1707 even by an external attacking process.) We therefore treat the
1708 buffer returned by pread() as untrusted input. The buffer should
1709 contain a terminating null byte; if not, then we will trigger an
1710 error for the target process. */
1712 if (strnlen(path, nread) < nread)
1718 /* Allocate buffers for the seccomp user\-space notification request and
1719 response structures. It is the caller\(aqs responsibility to free the
1720 buffers returned via \(aqreq\(aq and \(aqresp\(aq. */
1723 allocSeccompNotifBuffers(struct seccomp_notif **req,
1724 struct seccomp_notif_resp **resp,
1725 struct seccomp_notif_sizes *sizes)
1727 /* Discover the sizes of the structures that are used to receive
1728 notifications and send notification responses, and allocate
1729 buffers of those sizes. */
1731 if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == \-1)
1732 errExit("seccomp\-SECCOMP_GET_NOTIF_SIZES");
1734 *req = malloc(sizes\->seccomp_notif);
1736 errExit("malloc\-seccomp_notif");
1738 /* When allocating the response buffer, we must allow for the fact
1739 that the user\-space binary may have been built with user\-space
1740 headers where \(aqstruct seccomp_notif_resp\(aq is bigger than the
1741 response buffer expected by the (older) kernel. Therefore, we
1742 allocate a buffer that is the maximum of the two sizes. This
1743 ensures that if the supervisor places bytes into the response
1744 structure that are past the response size that the kernel expects,
1745 then the supervisor is not touching an invalid memory location. */
1747 size_t resp_size = sizes\->seccomp_notif_resp;
1748 if (sizeof(struct seccomp_notif_resp) > resp_size)
1749 resp_size = sizeof(struct seccomp_notif_resp);
1751 *resp = malloc(resp_size);
1753 errExit("malloc\-seccomp_notif_resp");
1757 /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1758 descriptor, \(aqnotifyFd\(aq. */
1761 handleNotifications(int notifyFd)
1763 struct seccomp_notif_sizes sizes;
1764 struct seccomp_notif *req;
1765 struct seccomp_notif_resp *resp;
1766 char path[PATH_MAX];
1768 allocSeccompNotifBuffers(&req, &resp, &sizes);
1770 /* Loop handling notifications */
1774 /* Wait for next notification, returning info in \(aq*req\(aq */
1776 memset(req, 0, sizes.seccomp_notif);
1777 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == \-1) {
1780 errExit("\etS: ioctl\-SECCOMP_IOCTL_NOTIF_RECV");
1783 printf("\etS: got notification (ID %#llx) for PID %d\en",
1784 req\->id, req\->pid);
1786 /* The only system call that can generate a notification event
1787 is mkdir(2). Nevertheless, we check that the notified system
1788 call is indeed mkdir() as kind of future\-proofing of this
1789 code in case the seccomp filter is later modified to
1790 generate notifications for other system calls. */
1792 if (req\->data.nr != __NR_mkdir) {
1793 printf("\etS: notification contained unexpected "
1794 "system call number; bye!!!\en");
1798 bool pathOK = getTargetPathname(req, notifyFd, 0, path,
1801 /* Prepopulate some fields of the response */
1803 resp\->id = req\->id; /* Response includes notification ID */
1807 /* If getTargetPathname() failed, trigger an EINVAL error
1808 response (sending this response may yield an error if the
1809 failure occurred because the notification ID was no longer
1810 valid); if the directory is in /tmp, then create it on behalf
1811 of the supervisor; if the pathname starts with \(aq.\(aq, tell the
1812 kernel to let the target process execute the mkdir();
1813 otherwise, give an error for a directory pathname in any other
1817 resp->error = -EINVAL;
1818 printf("\etS: spoofing error for invalid pathname (%s)\en",
1819 strerror(-resp->error));
1820 } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1821 printf("\etS: executing: mkdir(\e"%s\e", %#llo)\en",
1822 path, req\->data.args[1]);
1824 if (mkdir(path, req\->data.args[1]) == 0) {
1825 resp\->error = 0; /* "Success" */
1826 resp\->val = strlen(path); /* Used as return value of
1827 mkdir() in target */
1828 printf("\etS: success! spoofed return = %lld\en",
1832 /* If mkdir() failed in the supervisor, pass the error
1833 back to the target */
1835 resp\->error = \-errno;
1836 printf("\etS: failure! (errno = %d; %s)\en", errno,
1839 } else if (strncmp(path, "./", strlen("./")) == 0) {
1840 resp\->error = resp\->val = 0;
1841 resp\->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1842 printf("\etS: target can execute system call\en");
1844 resp\->error = \-EOPNOTSUPP;
1845 printf("\etS: spoofing error response (%s)\en",
1846 strerror(\-resp\->error));
1849 /* Send a response to the notification */
1851 printf("\etS: sending response "
1852 "(flags = %#x; val = %lld; error = %d)\en",
1853 resp\->flags, resp\->val, resp\->error);
1855 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == \-1) {
1856 if (errno == ENOENT)
1857 printf("\etS: response failed with ENOENT; "
1858 "perhaps target process\(aqs syscall was "
1859 "interrupted by a signal?\en");
1861 perror("ioctl\-SECCOMP_IOCTL_NOTIF_SEND");
1864 /* If the pathname is just "/bye", then the supervisor breaks out
1865 of the loop and terminates. This allows us to see what happens
1866 if the target process makes further calls to mkdir(2). */
1868 if (strcmp(path, "/bye") == 0)
1874 printf("\etS: terminating **********\en");
1878 /* Implementation of the supervisor process:
1880 (1) obtains the notification file descriptor from \(aqsockPair[1]\(aq
1881 (2) handles notifications that arrive on that file descriptor. */
1884 supervisor(int sockPair[2])
1886 int notifyFd = recvfd(sockPair[1]);
1887 if (notifyFd == \-1)
1890 closeSocketPair(sockPair); /* We no longer need the socket pair */
1892 handleNotifications(notifyFd);
1896 main(int argc, char *argv[])
1900 setbuf(stdout, NULL);
1903 fprintf(stderr, "At least one pathname argument is required\en");
1907 /* Create a UNIX domain socket that is used to pass the seccomp
1908 notification file descriptor from the target process to the
1909 supervisor process. */
1911 if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == \-1)
1912 errExit("socketpair");
1914 /* Create a child process\-\-the "target"\-\-that installs seccomp
1915 filtering. The target process writes the seccomp notification
1916 file descriptor onto \(aqsockPair[0]\(aq and then calls mkdir(2) for
1917 each directory in the command\-line arguments. */
1919 (void) targetProcess(sockPair, &argv[optind]);
1921 /* Catch SIGCHLD when the target terminates, so that the
1922 supervisor can also terminate. */
1924 struct sigaction sa;
1925 sa.sa_handler = sigchldHandler;
1927 sigemptyset(&sa.sa_mask);
1928 if (sigaction(SIGCHLD, &sa, NULL) == \-1)
1929 errExit("sigaction");
1931 supervisor(sockPair);
1939 .BR pidfd_getfd (2),
1942 A further example program can be found in the kernel source file
1943 .IR samples/seccomp/user-trap.c .