1 .\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com>
2 .\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com>
4 .\" Permission is granted to make and distribute verbatim copies of this
5 .\" manual provided the copyright notice and this permission notice are
6 .\" preserved on all copies.
8 .\" Permission is granted to copy and distribute modified versions of this
9 .\" manual under the conditions for verbatim copying, provided that the
10 .\" entire resulting derived work is distributed under the terms of a
11 .\" permission notice identical to this one.
13 .\" Since the Linux kernel and libraries are constantly changing, this
14 .\" manual page may be incorrect or out-of-date. The author(s) assume no
15 .\" responsibility for errors or omissions, or for damages resulting from
16 .\" the use of the information contained herein. The author(s) may not
17 .\" have taken the same level of care in the production of this manual,
18 .\" which is licensed free of charge, as they might when working
21 .\" Formatted or processed versions of this manual, if unaccompanied by
22 .\" the source, must acknowledge the copyright and authors of this work.
25 .TH USER_NAMESPACES 7 2013-01-14 "Linux" "Linux Programmer's Manual"
27 user_namespaces \- overview of Linux user_namespaces
29 For an overview of namespaces, see
32 User namespaces isolate security-related identifiers, in particular,
33 user IDs and group IDs (see
37 .\" FIXME: This page says very little about the interaction
38 .\" of user namespaces and keys. Add something on this topic.
40 .BR capabilities (7)).
41 A process's user and group IDs can be different
42 inside and outside a user namespace.
44 a process can have a normal unprivileged user ID outside a user namespace
45 while at the same time having a user ID of 0 inside the namespace;
47 the process has full privileges for operations inside the user namespace,
48 but is unprivileged for operations outside the namespace.
50 .\" ============================================================
52 .SS Nested namespaces, namespace membership
53 User namespaces can be nested;
54 that is, each user namespace\(emexcept the initial ("root")
55 namespace\(emhas a parent user namespace,
56 and can have zero or more child user namespaces.
57 The parent user namespace is the user namespace
58 of the process that creates the user namespace via a call to
66 Each process is a member of exactly one user namespace.
73 flag is a member of the same user namespace as its parent.
74 A process can join another user namespace with
79 upon doing so, it gains a full set of capabilities in that namespace.
87 flag makes the new child process (for
91 a member of the new user namespace created by the call.
93 .\" ============================================================
96 The child process created by
100 flag starts out with a complete set
101 of capabilities in the new user namespace.
102 Likewise, a process that creates a new user namespace using
104 or joins an existing user namespace using
106 gains a full set of capabilities in that namespace.
108 that process has no capabilities in the parent (in the case of
110 or previous (in the case of
115 even if the new namespace is created or joined by the root user
116 (i.e., a process with user ID 0 in the root namespace).
117 Nevertheless, a process owned by the root user
118 will be able to access resources such as
119 files that are owned by user ID 0,
120 and will be able to do things such as sending signals
121 to processes belonging to user ID 0.
125 will cause a process to lose any capabilities that it has,
126 unless it has a user ID of 0 within the namespace.
129 a user ID mapping for ID 0 must be defined,
130 and the caller may also need to use
132 or similar to set its user ID to 0.
141 flag sets the "securebits" flags
143 .BR capabilities (7))
144 to their default values (all flags disabled) in the child (for
150 Note that because the caller no longer has capabilities
151 in its original user namespace after a call to
153 it is not possible for a process to reset its "securebits" flags while
154 retaining its user namespace membership by using a pair of
156 calls to move to another user namespace and then return to
157 its original user namespace.
159 Having a capability inside a user namespace
160 permits a process to perform operations (that require privilege)
161 only on resources governed by that namespace.
162 The rules for determining whether or not a process has a capability
163 in a particular user namespace are as follows:
165 A process has a capability inside a user namespace
166 if it is a member of that namespace and
167 it has the capability in its effective capability set.
168 A process can gain capabilities in its effective capability
170 For example, it may execute a set-user-ID program or an
171 executable with associated file capabilities.
173 a process may gain capabilities via the effect of
178 as already described.
179 .\" In the 3.8 sources, see security/commoncap.c::cap_capable():
181 If a process has a capability in a user namespace,
182 then it has that capability in all child (and further removed descendant)
185 .\" * The owner of the user namespace in the parent of the
186 .\" * user namespace has all caps.
187 When a user namespace is created, the kernel records the effective
188 user ID of the creating process as being the "owner" of the namespace.
189 .\" (and likewise associates the effective group ID of the creating process
190 .\" with the namespace).
191 A process that resides
192 in the parent of the user namespace
193 .\" See kernel commit 520d9eabce18edfef76a60b7b839d54facafe1f9 for a fix
195 and whose effective user ID matches the owner of the namespace
196 has all capabilities in the namespace.
197 .\" This includes the case where the process executes a set-user-ID
198 .\" program that confers the effective UID of the creator of the namespace.
199 By virtue of the previous rule,
200 this means that the process has all capabilities in all
201 further removed descendant user namespaces as well.
203 .\" ============================================================
205 .SS Interaction of user namespaces and other types of namespaces
206 Starting in Linux 3.8, unprivileged processes can create user namespaces,
207 and mount, PID, IPC, network, and UTS namespaces can be created with just the
209 capability in the caller's user namespace.
213 is specified along with other
219 call, the user namespace is guaranteed to be created first,
224 privileges over the remaining namespaces created by the call.
225 Thus, it is possible for an unprivileged caller to specify this combination
228 When a new IPC, mount, network, PID, or UTS namespace is created via
232 the kernel records the user namespace of the creating process against
234 (This association can't be changed.)
235 When a process in the new namespace subsequently performs
236 privileged operations that operate on global
237 resources isolated by the namespace,
238 the permission checks are performed according to the process's capabilities
239 in the user namespace that the kernel associated with the new namespace.
241 .\" ============================================================
243 .SS User and group ID mappings: uid_map and gid_map
244 When a user namespace is created,
245 it starts out without a mapping of user IDs (group IDs)
246 to the parent user namespace.
248 .IR /proc/[pid]/uid_map
250 .IR /proc/[pid]/gid_map
251 files (available since Linux 3.5)
252 .\" commit 22d917d80e842829d0ca0a561967d728eb1d6303
253 expose the mappings for user and group IDs
254 inside the user namespace for the process
256 These files can be read to view the mappings in a user namespace and
257 written to (once) to define the mappings.
259 The description in the following paragraphs explains the details for
263 but each instance of "user ID" is replaced by "group ID".
267 file exposes the mapping of user IDs from the user namespace
270 to the user namespace of the process that opened
272 (but see a qualification to this point below).
273 In other words, processes that are in different user namespaces
274 will potentially see different values when reading from a particular
276 file, depending on the user ID mappings for the user namespaces
277 of the reading processes.
281 file specifies a 1-to-1 mapping of a range of contiguous
282 user IDs between two user namespaces.
283 (When a user namespace is first created, this file is empty.)
284 The specification in each line takes the form of
285 three numbers delimited by white space.
286 The first two numbers specify the starting user ID in
287 each of the two user namespaces.
288 The third number specifies the length of the mapped range.
289 In detail, the fields are interpreted as follows:
291 The start of the range of user IDs in
292 the user namespace of the process
295 The start of the range of user
296 IDs to which the user IDs specified by field one map.
297 How field two is interpreted depends on whether the process that opened
301 are in the same user namespace, as follows:
304 If the two processes are in different user namespaces:
305 field two is the start of a range of
306 user IDs in the user namespace of the process that opened
309 If the two processes are in the same user namespace:
310 field two is the start of the range of
311 user IDs in the parent user namespace of the process
313 This case enables the opener of
315 (the common case here is opening
316 .IR /proc/self/uid_map )
317 to see the mapping of user IDs into the user namespace of the process
318 that created this user namespace.
321 The length of the range of user IDs that is mapped between the two
324 System calls that return user IDs (group IDs)\(emfor example,
327 and the credential fields in the structure returned by
328 .BR stat (2)\(emreturn
329 the user ID (group ID) mapped into the caller's user namespace.
331 When a process accesses a file, its user and group IDs
332 are mapped into the initial user namespace for the purpose of permission
333 checking and assigning IDs when creating a file.
334 When a process retrieves file user and group IDs via
336 the IDs are mapped in the opposite direction,
337 to produce values relative to the process user and group ID mappings.
339 The initial user namespace has no parent namespace,
340 but, for consistency, the kernel provides dummy user and group
341 ID mapping files for this namespace.
346 is the same) from a shell in the initial namespace shows:
350 $ \fBcat /proc/$$/uid_map\fP
355 This mapping tells us
356 that the range starting at user ID 0 in this namespace
357 maps to a range starting at 0 in the (nonexistent) parent namespace,
358 and the length of the range is the largest 32-bit unsigned integer.
360 .\" ============================================================
362 .SS Defining user and group ID mappings: writing to uid_map and gid_map
364 After the creation of a new user namespace, the
368 of the processes in the namespace may be written to
370 to define the mapping of user IDs in the new user namespace.
371 An attempt to write more than once to a
373 file in a user namespace fails with the error
375 Similar rules apply for
382 must conform to the following rules:
384 The three fields must be valid numbers,
385 and the last field must be greater than 0.
387 Lines are terminated by newline characters.
389 There is an (arbitrary) limit on the number of lines in the file.
390 As at Linux 3.8, the limit is five lines.
391 In addition, the number of bytes written to
392 the file must be less than the system page size,
393 .\" FIXME(Eric): the restriction "less than" rather than "less than or equal"
394 .\" seems strangely arbitrary. Furthermore, the comment does not agree
395 .\" with the code in kernel/user_namespace.c. Which is correct.
396 and the write must be performed at the start of the file (i.e.,
400 can't be used to write to nonzero offsets in the file).
402 The range of user IDs (group IDs)
403 specified in each line cannot overlap with the ranges
405 In the initial implementation (Linux 3.8), this requirement was
406 satisfied by a simplistic implementation that imposed the further
408 the values in both field 1 and field 2 of successive lines must be
409 in ascending numerical order,
410 which prevented some otherwise valid maps from being created.
412 .\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba
413 fix this limitation, allowing any valid set of nonoverlapping maps.
415 At least one line must be written to the file.
417 Writes that violate the above rules fail with the error
420 In order for a process to write to the
421 .I /proc/[pid]/uid_map
422 .RI ( /proc/[pid]/gid_map )
423 file, all of the following requirements must be met:
425 The writing process must have the
428 capability in the user namespace of the process
431 The writing process must be in either the user namespace of the process
433 or inside the parent user namespace of the process
436 The mapped user IDs (group IDs) must in turn have a mapping
437 in the parent user namespace.
439 One of the following is true:
445 consists of a single line that maps the writing process's file system user ID
446 (group ID) in the parent user namespace to a user ID (group ID)
447 in the user namespace.
448 The usual case here is that this single line provides a mapping for user ID
449 of the process that created the namespace.
454 capability in the parent user namespace.
455 Thus, a privileged process can make mappings to arbitrary user IDs (group IDs)
456 in the parent user namespace.
459 Writes that violate the above rules fail with the error
462 .\" ============================================================
464 .SS Unmapped user and group IDs
466 There are various places where an unmapped user ID (group ID)
467 may be exposed to user space.
468 For example, the first process in a new user namespace may call
470 before a user ID mapping has been defined for the namespace.
471 In most such cases, an unmapped user ID is converted
472 .\" from_kuid_munged(), from_kgid_munged()
473 to the overflow user ID (group ID);
474 the default value for the overflow user ID (group ID) is 65534.
475 See the descriptions of
476 .IR /proc/sys/kernel/overflowuid
478 .IR /proc/sys/kernel/overflowgid
482 The cases where unmapped IDs are mapped in this fashion include
483 system calls that return user IDs
487 credentials passed over a UNIX domain socket,
489 credentials returned by
492 and the System V IPC "ctl"
495 credentials exposed by
498 .IR /proc/sysvipc/* ,
499 credentials returned via the
503 received with a signal (see
505 credentials written to the process accounting file (see
507 and credentials returned with POSIX message queue notifications (see
510 There is one notable case where unmapped user and group IDs are
512 .\" from_kuid(), from_kgid()
513 .\" Also F_GETOWNER_UIDS is an exception
514 converted to the corresponding overflow ID value.
519 file in which there is no mapping for the second field,
520 that field is displayed as 4294967295 (\-1 as an unsigned integer);
522 .\" ============================================================
524 .SS Set-user-ID and set-group-ID programs
526 When a process inside a user namespace executes
527 a set-user-ID (set-group-ID) program,
528 the process's effective user (group) ID inside the namespace is changed
529 to whatever value is mapped for the user (group) ID of the file.
530 However, if either the user
532 the group ID of the file has no mapping inside the namespace,
533 the set-user-ID (set-group-ID) bit is silently ignored:
534 the new program is executed,
535 but the process's effective user (group) ID is left unchanged.
536 (This mirrors the semantics of executing a set-user-ID or set-group-ID
537 program that resides on a file system that was mounted with the
539 flag, as described in
542 .\" ============================================================
546 When a process's user and group IDs are passed over a UNIX domain socket
547 to a process in a different user namespace (see the description of
551 they are translated into the corresponding values as per the
552 receiving process's user and group ID mappings.
555 Namespaces are a Linux-specific feature.
558 Over the years, there have been a lot of features that have been added
559 to the Linux kernel that have been made available only to privileged users
560 because of their potential to confuse set-user-ID-root applications.
561 In general, it becomes safe to allow the root user in a user namespace to
562 use those features because it is impossible, while in a user namespace,
563 to gain more privilege than the root user of a user namespace has.
565 Use of user namespaces requires a kernel that is configured with the
568 User namespaces require support in a range of subsystems across
570 When an unsupported subsystem is configured into the kernel,
571 it is not possible to configure user namespaces support.
572 As at Linux 3.8, most relevant subsystems support user namespaces,
573 but there are a number of file systems that do not.
574 Linux 3.9 added user namespaces support for many of the remaining
575 unsupported file systems:
576 Plan 9 (9P), Andrew File System (AFS), Ceph, CIFS, CODA, NFS, and OCFS2.
577 XFS support for user namespaces is not yet available.
580 The program below is designed to allow experimenting with
581 user namespaces, as well as other types of namespaces.
582 It creates namespaces as specified by command-line options and then executes
583 a command inside those namespaces.
586 function inside the program provide a full explanation of the program.
587 The following shell session demonstrates its use.
589 First, we look at the run-time environment:
593 $ \fBuname -rs\fP # Need Linux 3.8 or later
595 $ \fBid -u\fP # Running as unprivileged user
602 Now start a new shell in new user
608 namespaces, with user ID
612 1000 mapped to 0 inside the user namespace:
616 $ \fB./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash\fP
620 The shell has PID 1, because it is the first process in the new
630 Inside the user namespace, the shell has user and group ID 0,
631 and a full set of permitted and effective capabilities:
635 bash$ \fBcat /proc/$$/status | egrep '^[UG]id'\fP
638 bash$ \fBcat /proc/$$/status | egrep '^Cap(Prm|Inh|Eff)'\fP
639 CapInh: 0000000000000000
640 CapPrm: 0000001fffffffff
641 CapEff: 0000001fffffffff
647 file system and listing all of the processes visible
648 in the new PID namespace shows that the shell can't see
649 any processes outside the PID namespace:
653 bash$ \fBmount -t proc proc /proc\fP
655 PID TTY STAT TIME COMMAND
657 22 pts/3 R+ 0:00 ps ax
663 /* userns_child_exec.c
665 Licensed under GNU General Public License v2 or later
667 Create a child process that executes a shell command in new
668 namespace(s); allow UID and GID mappings to be specified when
669 creating a user namespace.
675 #include <sys/wait.h>
683 /* A simple error\-handling function: print an error message based
684 on the value in \(aqerrno\(aq and terminate the calling process */
686 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\
690 char **argv; /* Command to be executed by child, with args */
691 int pipe_fd[2]; /* Pipe used to synchronize parent and child */
699 fprintf(stderr, "Usage: %s [options] cmd [arg...]\\n\\n", pname);
700 fprintf(stderr, "Create a child process that executes a shell "
701 "command in a new user namespace,\\n"
702 "and possibly also other new namespace(s).\\n\\n");
703 fprintf(stderr, "Options can be:\\n\\n");
704 #define fpe(str) fprintf(stderr, " %s", str);
705 fpe("\-i New IPC namespace\\n");
706 fpe("\-m New mount namespace\\n");
707 fpe("\-n New network namespace\\n");
708 fpe("\-p New PID namespace\\n");
709 fpe("\-u New UTS namespace\\n");
710 fpe("\-U New user namespace\\n");
711 fpe("\-M uid_map Specify UID map for user namespace\\n");
712 fpe("\-G gid_map Specify GID map for user namespace\\n");
713 fpe("\-z Map user\(aqs UID and GID to 0 in user namespace\\n");
714 fpe(" (equivalent to: \-M \(aq0 <uid> 1\(aq \-G \(aq0 <gid> 1\(aq)\\n");
715 fpe("\-v Display verbose messages\\n");
717 fpe("If \-z, \-M, or \-G is specified, \-U is required.\\n");
718 fpe("It is not permitted to specify both \-z and either \-M or \-G.\\n");
720 fpe("Map strings for \-M and \-G consist of records of the form:\\n");
722 fpe(" ID\-inside\-ns ID\-outside\-ns len\\n");
724 fpe("A map string can contain multiple records, separated"
726 fpe("the commas are replaced by newlines before writing"
727 " to map files.\\n");
732 /* Update the mapping file \(aqmap_file\(aq, with the value provided in
733 \(aqmapping\(aq, a string that defines a UID or GID mapping. A UID or
734 GID mapping consists of one or more newline\-delimited records
737 ID_inside\-ns ID\-outside\-ns length
739 Requiring the user to supply a string that contains newlines is
740 of course inconvenient for command\-line use. Thus, we permit the
741 use of commas to delimit records in this string, and replace them
742 with newlines before writing the string to the file. */
745 update_map(char *mapping, char *map_file)
748 size_t map_len; /* Length of \(aqmapping\(aq */
750 /* Replace commas in mapping string with newlines */
752 map_len = strlen(mapping);
753 for (j = 0; j < map_len; j++)
754 if (mapping[j] == \(aq,\(aq)
755 mapping[j] = \(aq\\n\(aq;
757 fd = open(map_file, O_RDWR);
759 fprintf(stderr, "ERROR: open %s: %s\\n", map_file, strerror(errno));
761 //exit(EXIT_FAILURE);
764 if (write(fd, mapping, map_len) != map_len) {
765 fprintf(stderr, "ERROR: write %s: %s\\n", map_file, strerror(errno));
766 //exit(EXIT_FAILURE);
772 static int /* Start function for cloned child */
775 struct child_args *args = (struct child_args *) arg;
778 /* Wait until the parent has updated the UID and GID mappings.
779 See the comment in main(). We wait for end of file on a
780 pipe that will be closed by the parent process once it has
781 updated the mappings. */
783 close(args\->pipe_fd[1]); /* Close our descriptor for the write
784 end of the pipe so that we see EOF
785 when parent closes its descriptor */
786 if (read(args\->pipe_fd[0], &ch, 1) != 0) {
788 "Failure in child: read from pipe returned != 0\\n");
792 /* Execute a shell command */
794 printf("About to exec %s\\n", args\->argv[0]);
795 execvp(args\->argv[0], args\->argv);
799 #define STACK_SIZE (1024 * 1024)
801 static char child_stack[STACK_SIZE]; /* Space for child\(aqs stack */
804 main(int argc, char *argv[])
806 int flags, opt, map_zero;
808 struct child_args args;
809 char *uid_map, *gid_map;
810 const int MAP_BUF_SIZE = 100;
811 char map_buf[MAP_BUF_SIZE];
812 char map_path[PATH_MAX];
814 /* Parse command\-line options. The initial \(aq+\(aq character in
815 the final getopt() argument prevents GNU\-style permutation
816 of command\-line options. That\(aqs useful, since sometimes
817 the \(aqcommand\(aq to be executed by this program itself
818 has command\-line options. We don\(aqt want getopt() to treat
819 those as options to this program. */
826 while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != \-1) {
828 case \(aqi\(aq: flags |= CLONE_NEWIPC; break;
829 case \(aqm\(aq: flags |= CLONE_NEWNS; break;
830 case \(aqn\(aq: flags |= CLONE_NEWNET; break;
831 case \(aqp\(aq: flags |= CLONE_NEWPID; break;
832 case \(aqu\(aq: flags |= CLONE_NEWUTS; break;
833 case \(aqv\(aq: verbose = 1; break;
834 case \(aqz\(aq: map_zero = 1; break;
835 case \(aqM\(aq: uid_map = optarg; break;
836 case \(aqG\(aq: gid_map = optarg; break;
837 case \(aqU\(aq: flags |= CLONE_NEWUSER; break;
838 default: usage(argv[0]);
842 /* \-M or \-G without \-U is nonsensical */
844 if (((uid_map != NULL || gid_map != NULL || map_zero) &&
845 !(flags & CLONE_NEWUSER)) ||
846 (map_zero && (uid_map != NULL || gid_map != NULL)))
849 args.argv = &argv[optind];
851 /* We use a pipe to synchronize the parent and child, in order to
852 ensure that the parent sets the UID and GID maps before the child
853 calls execve(). This ensures that the child maintains its
854 capabilities during the execve() in the common case where we
855 want to map the child\(aqs effective user ID to 0 in the new user
856 namespace. Without this synchronization, the child would lose
857 its capabilities if it performed an execve() with nonzero
858 user IDs (see the capabilities(7) man page for details of the
859 transformation of a process\(aqs capabilities during execve()). */
861 if (pipe(args.pipe_fd) == \-1)
864 /* Create the child in new namespace(s) */
866 child_pid = clone(childFunc, child_stack + STACK_SIZE,
867 flags | SIGCHLD, &args);
868 if (child_pid == \-1)
871 /* Parent falls through to here */
874 printf("%s: PID of child created by clone() is %ld\\n",
875 argv[0], (long) child_pid);
877 /* Update the UID and GID maps in the child */
879 if (uid_map != NULL || map_zero) {
880 snprintf(map_path, PATH_MAX, "/proc/%ld/uid_map",
883 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getuid());
886 update_map(uid_map, map_path);
888 if (gid_map != NULL || map_zero) {
889 snprintf(map_path, PATH_MAX, "/proc/%ld/gid_map",
892 snprintf(map_buf, MAP_BUF_SIZE, "0 %ld 1", (long) getgid());
895 update_map(gid_map, map_path);
898 /* Close the write end of the pipe, to signal to the child that we
899 have updated the UID and GID maps */
901 close(args.pipe_fd[1]);
903 if (waitpid(child_pid, NULL, 0) == \-1) /* Wait for child */
907 printf("%s: terminating\\n", argv[0]);
913 .BR newgidmap (1), \" From the shadow package
914 .BR newuidmap (1), \" From the shadow package
919 .BR subgid (5), \" From the shadow package
920 .BR subuid (5), \" From the shadow package
922 .BR capabilities (7),
924 .BR pid_namespaces (7)
926 The kernel source file
927 .IR Documentation/namespaces/resource-control.txt .