man/man7/sched.7

   1 .\" Copyright (C) 2014 Michael Kerrisk <mtk.manpages@gmail.com>
   2 .\" and Copyright (C) 2014 Peter Zijlstra <peterz@infradead.org>
   3 .\" and Copyright (C) 2014 Juri Lelli <juri.lelli@gmail.com>
   4 .\" Various pieces from the old sched_setscheduler(2) page
   5 .\"     Copyright (C) Tom Bjorkholm, Markus Kuhn & David A. Wheeler 1996-1999
   6 .\"     and Copyright (C) 2007 Carsten Emde <Carsten.Emde@osadl.org>
   7 .\"     and Copyright (C) 2008 Michael Kerrisk <mtk.manpages@gmail.com>
   8 .\"
   9 .\" SPDX-License-Identifier: GPL-2.0-or-later
  10 .\"
  11 .\" Worth looking at: http://rt.wiki.kernel.org/index.php
  12 .\"
  13 .TH sched 7 (date) "Linux man-pages (unreleased)"
  14 .SH NAME
  15 sched \- overview of CPU scheduling
  16 .SH DESCRIPTION
  17 Since Linux 2.6.23, the default scheduler is CFS,
  18 the "Completely Fair Scheduler".
  19 The CFS scheduler replaced the earlier "O(1)" scheduler.
  20 .\"
  21 .SS API summary
  22 Linux provides the following system calls for controlling
  23 the CPU scheduling behavior, policy, and priority of processes
  24 (or, more precisely, threads).
  25 .TP
  26 .BR nice (2)
  27 Set a new nice value for the calling thread,
  28 and return the new nice value.
  29 .TP
  30 .BR getpriority (2)
  31 Return the nice value of a thread, a process group,
  32 or the set of threads owned by a specified user.
  33 .TP
  34 .BR setpriority (2)
  35 Set the nice value of a thread, a process group,
  36 or the set of threads owned by a specified user.
  37 .TP
  38 .BR sched_setscheduler (2)
  39 Set the scheduling policy and parameters of a specified thread.
  40 .TP
  41 .BR sched_getscheduler (2)
  42 Return the scheduling policy of a specified thread.
  43 .TP
  44 .BR sched_setparam (2)
  45 Set the scheduling parameters of a specified thread.
  46 .TP
  47 .BR sched_getparam (2)
  48 Fetch the scheduling parameters of a specified thread.
  49 .TP
  50 .BR sched_get_priority_max (2)
  51 Return the maximum priority available in a specified scheduling policy.
  52 .TP
  53 .BR sched_get_priority_min (2)
  54 Return the minimum priority available in a specified scheduling policy.
  55 .TP
  56 .BR sched_rr_get_interval (2)
  57 Fetch the quantum used for threads that are scheduled under
  58 the "round-robin" scheduling policy.
  59 .TP
  60 .BR sched_yield (2)
  61 Cause the caller to relinquish the CPU,
  62 so that some other thread be executed.
  63 .TP
  64 .BR sched_setaffinity (2)
  65 (Linux-specific)
  66 Set the CPU affinity of a specified thread.
  67 .TP
  68 .BR sched_getaffinity (2)
  69 (Linux-specific)
  70 Get the CPU affinity of a specified thread.
  71 .TP
  72 .BR sched_setattr (2)
  73 Set the scheduling policy and parameters of a specified thread.
  74 This (Linux-specific) system call provides a superset of the functionality of
  75 .BR sched_setscheduler (2)
  76 and
  77 .BR sched_setparam (2).
  78 .TP
  79 .BR sched_getattr (2)
  80 Fetch the scheduling policy and parameters of a specified thread.
  81 This (Linux-specific) system call provides a superset of the functionality of
  82 .BR sched_getscheduler (2)
  83 and
  84 .BR sched_getparam (2).
  85 .\"
  86 .SS Scheduling policies
  87 The scheduler is the kernel component that decides which runnable thread
  88 will be executed by the CPU next.
  89 Each thread has an associated scheduling policy and a \fIstatic\fP
  90 scheduling priority,
  91 .IR sched_priority .
  92 The scheduler makes its decisions based on knowledge of the scheduling
  93 policy and static priority of all threads on the system.
  94 .P
  95 For threads scheduled under one of the normal scheduling policies
  96 (\fBSCHED_OTHER\fP, \fBSCHED_IDLE\fP, \fBSCHED_BATCH\fP),
  97 \fIsched_priority\fP is not used in scheduling
  98 decisions (it must be specified as 0).
  99 .P
 100 Processes scheduled under one of the real-time policies
 101 (\fBSCHED_FIFO\fP, \fBSCHED_RR\fP) have a
 102 \fIsched_priority\fP value in the range 1 (low) to 99 (high).
 103 (As the numbers imply, real-time threads always have higher priority
 104 than normal threads.)
 105 Note well: POSIX.1 requires an implementation to support only a
 106 minimum 32 distinct priority levels for the real-time policies,
 107 and some systems supply just this minimum.
 108 Portable programs should use
 109 .BR sched_get_priority_min (2)
 110 and
 111 .BR sched_get_priority_max (2)
 112 to find the range of priorities supported for a particular policy.
 113 .P
 114 Conceptually, the scheduler maintains a list of runnable
 115 threads for each possible \fIsched_priority\fP value.
 116 In order to determine which thread runs next, the scheduler looks for
 117 the nonempty list with the highest static priority and selects the
 118 thread at the head of this list.
 119 .P
 120 A thread's scheduling policy determines
 121 where it will be inserted into the list of threads
 122 with equal static priority and how it will move inside this list.
 123 .P
 124 All scheduling is preemptive: if a thread with a higher static
 125 priority becomes ready to run, the currently running thread
 126 will be preempted and
 127 returned to the wait list for its static priority level.
 128 The scheduling policy determines the
 129 ordering only within the list of runnable threads with equal static
 130 priority.
 131 .SS SCHED_FIFO: First in-first out scheduling
 132 \fBSCHED_FIFO\fP can be used only with static priorities higher than
 133 0, which means that when a \fBSCHED_FIFO\fP thread becomes runnable,
 134 it will always immediately preempt any currently running
 135 \fBSCHED_OTHER\fP, \fBSCHED_BATCH\fP, or \fBSCHED_IDLE\fP thread.
 136 \fBSCHED_FIFO\fP is a simple scheduling
 137 algorithm without time slicing.
 138 For threads scheduled under the
 139 \fBSCHED_FIFO\fP policy, the following rules apply:
 140 .IP \[bu] 3
 141 A running \fBSCHED_FIFO\fP thread that has been preempted by another thread of
 142 higher priority will stay at the head of the list for its priority and
 143 will resume execution as soon as all threads of higher priority are
 144 blocked again.
 145 .IP \[bu]
 146 When a blocked \fBSCHED_FIFO\fP thread becomes runnable, it
 147 will be inserted at the end of the list for its priority.
 148 .IP \[bu]
 149 If a call to
 150 .BR sched_setscheduler (2),
 151 .BR sched_setparam (2),
 152 .BR sched_setattr (2),
 153 .BR pthread_setschedparam (3),
 154 or
 155 .BR pthread_setschedprio (3)
 156 changes the priority of the running or runnable
 157 .B SCHED_FIFO
 158 thread identified by
 159 .I pid
 160 the effect on the thread's position in the list depends on
 161 the direction of the change to the thread's priority:
 162 .RS
 163 .IP (a) 5
 164 If the thread's priority is raised,
 165 it is placed at the end of the list for its new priority.
 166 As a consequence,
 167 it may preempt a currently running thread with the same priority.
 168 .IP (b)
 169 If the thread's priority is unchanged,
 170 its position in the run list is unchanged.
 171 .IP (c)
 172 If the thread's priority is lowered,
 173 it is placed at the front of the list for its new priority.
 174 .RE
 175 .IP
 176 According to POSIX.1-2008,
 177 changes to a thread's priority (or policy) using any mechanism other than
 178 .BR pthread_setschedprio (3)
 179 should result in the thread being placed at the end of
 180 the list for its priority.
 181 .\" In Linux 2.2.x and Linux 2.4.x, the thread is placed at the front of the queue
 182 .\" In Linux 2.0.x, the Right Thing happened: the thread went to the back -- MTK
 183 .IP \[bu]
 184 A thread calling
 185 .BR sched_yield (2)
 186 will be put at the end of the list.
 187 .P
 188 No other events will move a thread
 189 scheduled under the \fBSCHED_FIFO\fP policy in the wait list of
 190 runnable threads with equal static priority.
 191 .P
 192 A \fBSCHED_FIFO\fP
 193 thread runs until either it is blocked by an I/O request, it is
 194 preempted by a higher priority thread, or it calls
 195 .BR sched_yield (2).
 196 .SS SCHED_RR: Round-robin scheduling
 197 \fBSCHED_RR\fP is a simple enhancement of \fBSCHED_FIFO\fP.
 198 Everything
 199 described above for \fBSCHED_FIFO\fP also applies to \fBSCHED_RR\fP,
 200 except that each thread is allowed to run only for a maximum time
 201 quantum.
 202 If a \fBSCHED_RR\fP thread has been running for a time
 203 period equal to or longer than the time quantum, it will be put at the
 204 end of the list for its priority.
 205 A \fBSCHED_RR\fP thread that has
 206 been preempted by a higher priority thread and subsequently resumes
 207 execution as a running thread will complete the unexpired portion of
 208 its round-robin time quantum.
 209 The length of the time quantum can be
 210 retrieved using
 211 .BR sched_rr_get_interval (2).
 212 .\" On Linux 2.4, the length of the RR interval is influenced
 213 .\" by the process nice value -- MTK
 214 .\"
 215 .SS SCHED_DEADLINE: Sporadic task model deadline scheduling
 216 Since Linux 3.14, Linux provides a deadline scheduling policy
 217 .RB ( SCHED_DEADLINE ).
 218 This policy is currently implemented using
 219 GEDF (Global Earliest Deadline First)
 220 in conjunction with CBS (Constant Bandwidth Server).
 221 To set and fetch this policy and associated attributes,
 222 one must use the Linux-specific
 223 .BR sched_setattr (2)
 224 and
 225 .BR sched_getattr (2)
 226 system calls.
 227 .P
 228 A sporadic task is one that has a sequence of jobs, where each
 229 job is activated at most once per period.
 230 Each job also has a
 231 .IR "relative deadline" ,
 232 before which it should finish execution, and a
 233 .IR "computation time" ,
 234 which is the CPU time necessary for executing the job.
 235 The moment when a task wakes up
 236 because a new job has to be executed is called the
 237 .I arrival time
 238 (also referred to as the request time or release time).
 239 The
 240 .I start time
 241 is the time at which a task starts its execution.
 242 The
 243 .I absolute deadline
 244 is thus obtained by adding the relative deadline to the arrival time.
 245 .P
 246 The following diagram clarifies these terms:
 247 .P
 248 .in +4n
 249 .EX
 250 arrival/wakeup                    absolute deadline
 251      |    start time                    |
 252      |        |                         |
 253      v        v                         v
 254 -----x--------xooooooooooooooooo--------x--------x---
 255               |<- comp. time ->|
 256      |<------- relative deadline ------>|
 257      |<-------------- period ------------------->|
 258 .EE
 259 .in
 260 .P
 261 When setting a
 262 .B SCHED_DEADLINE
 263 policy for a thread using
 264 .BR sched_setattr (2),
 265 one can specify three parameters:
 266 .IR Runtime ,
 267 .IR Deadline ,
 268 and
 269 .IR Period .
 270 These parameters do not necessarily correspond to the aforementioned terms:
 271 usual practice is to set Runtime to something bigger than the average
 272 computation time (or worst-case execution time for hard real-time tasks),
 273 Deadline to the relative deadline, and Period to the period of the task.
 274 Thus, for
 275 .B SCHED_DEADLINE
 276 scheduling, we have:
 277 .P
 278 .in +4n
 279 .EX
 280 arrival/wakeup                    absolute deadline
 281      |    start time                    |
 282      |        |                         |
 283      v        v                         v
 284 -----x--------xooooooooooooooooo--------x--------x---
 285               |<-- Runtime ------->|
 286      |<----------- Deadline ----------->|
 287      |<-------------- Period ------------------->|
 288 .EE
 289 .in
 290 .P
 291 The three deadline-scheduling parameters correspond to the
 292 .IR sched_runtime ,
 293 .IR sched_deadline ,
 294 and
 295 .I sched_period
 296 fields of the
 297 .I sched_attr
 298 structure; see
 299 .BR sched_setattr (2).
 300 These fields express values in nanoseconds.
 301 .\" FIXME It looks as though specifying sched_period as 0 means
 302 .\" "make sched_period the same as sched_deadline".
 303 .\" This needs to be documented.
 304 If
 305 .I sched_period
 306 is specified as 0, then it is made the same as
 307 .IR sched_deadline .
 308 .P
 309 The kernel requires that:
 310 .P
 311 .in +4n
 312 .EX
 313 sched_runtime <= sched_deadline <= sched_period
 314 .EE
 315 .in
 316 .P
 317 .\" See __checkparam_dl in kernel/sched/core.c
 318 In addition, under the current implementation,
 319 all of the parameter values must be at least 1024
 320 (i.e., just over one microsecond,
 321 which is the resolution of the implementation), and less than 2\[ha]63.
 322 If any of these checks fails,
 323 .BR sched_setattr (2)
 324 fails with the error
 325 .BR EINVAL .
 326 .P
 327 The CBS guarantees non-interference between tasks, by throttling
 328 threads that attempt to over-run their specified Runtime.
 329 .P
 330 To ensure deadline scheduling guarantees,
 331 the kernel must prevent situations where the set of
 332 .B SCHED_DEADLINE
 333 threads is not feasible (schedulable) within the given constraints.
 334 The kernel thus performs an admittance test when setting or changing
 335 .B SCHED_DEADLINE
 336 policy and attributes.
 337 This admission test calculates whether the change is feasible;
 338 if it is not,
 339 .BR sched_setattr (2)
 340 fails with the error
 341 .BR EBUSY .
 342 .P
 343 For example, it is required (but not necessarily sufficient) for
 344 the total utilization to be less than or equal to the total number of
 345 CPUs available, where, since each thread can maximally run for
 346 Runtime per Period, that thread's utilization is its
 347 Runtime divided by its Period.
 348 .P
 349 In order to fulfill the guarantees that are made when
 350 a thread is admitted to the
 351 .B SCHED_DEADLINE
 352 policy,
 353 .B SCHED_DEADLINE
 354 threads are the highest priority (user controllable) threads in the
 355 system; if any
 356 .B SCHED_DEADLINE
 357 thread is runnable,
 358 it will preempt any thread scheduled under one of the other policies.
 359 .P
 360 A call to
 361 .BR fork (2)
 362 by a thread scheduled under the
 363 .B SCHED_DEADLINE
 364 policy fails with the error
 365 .BR EAGAIN ,
 366 unless the thread has its reset-on-fork flag set (see below).
 367 .P
 368 A
 369 .B SCHED_DEADLINE
 370 thread that calls
 371 .BR sched_yield (2)
 372 will yield the current job and wait for a new period to begin.
 373 .\"
 374 .\" FIXME Calling sched_getparam() on a SCHED_DEADLINE thread
 375 .\" fails with EINVAL, but sched_getscheduler() succeeds.
 376 .\" Is that intended? (Why?)
 377 .\"
 378 .SS SCHED_OTHER: Default Linux time-sharing scheduling
 379 \fBSCHED_OTHER\fP can be used at only static priority 0
 380 (i.e., threads under real-time policies always have priority over
 381 .B SCHED_OTHER
 382 processes).
 383 \fBSCHED_OTHER\fP is the standard Linux time-sharing scheduler that is
 384 intended for all threads that do not require the special
 385 real-time mechanisms.
 386 .P
 387 The thread to run is chosen from the static
 388 priority 0 list based on a \fIdynamic\fP priority that is determined only
 389 inside this list.
 390 The dynamic priority is based on the nice value (see below)
 391 and is increased for each time quantum the thread is ready to run,
 392 but denied to run by the scheduler.
 393 This ensures fair progress among all \fBSCHED_OTHER\fP threads.
 394 .P
 395 In the Linux kernel source code, the
 396 .B SCHED_OTHER
 397 policy is actually named
 398 .BR SCHED_NORMAL .
 399 .\"
 400 .SS The nice value
 401 The nice value is an attribute
 402 that can be used to influence the CPU scheduler to
 403 favor or disfavor a process in scheduling decisions.
 404 It affects the scheduling of
 405 .B SCHED_OTHER
 406 and
 407 .B SCHED_BATCH
 408 (see below) processes.
 409 The nice value can be modified using
 410 .BR nice (2),
 411 .BR setpriority (2),
 412 or
 413 .BR sched_setattr (2).
 414 .P
 415 According to POSIX.1, the nice value is a per-process attribute;
 416 that is, the threads in a process should share a nice value.
 417 However, on Linux, the nice value is a per-thread attribute:
 418 different threads in the same process may have different nice values.
 419 .P
 420 The range of the nice value
 421 varies across UNIX systems.
 422 On modern Linux, the range is \-20 (high priority) to +19 (low priority).
 423 On some other systems, the range is \-20..20.
 424 Very early Linux kernels (before Linux 2.0) had the range \-infinity..15.
 425 .\" Linux before 1.3.36 had \-infinity..15.
 426 .\" Since Linux 1.3.43, Linux has the range \-20..19.
 427 .P
 428 The degree to which the nice value affects the relative scheduling of
 429 .B SCHED_OTHER
 430 processes likewise varies across UNIX systems and
 431 across Linux kernel versions.
 432 .P
 433 With the advent of the CFS scheduler in Linux 2.6.23,
 434 Linux adopted an algorithm that causes
 435 relative differences in nice values to have a much stronger effect.
 436 In the current implementation, each unit of difference in the
 437 nice values of two processes results in a factor of 1.25
 438 in the degree to which the scheduler favors the higher priority process.
 439 This causes very low nice values (+19) to truly provide little CPU
 440 to a process whenever there is any other
 441 higher priority load on the system,
 442 and makes high nice values (\-20) deliver most of the CPU to applications
 443 that require it (e.g., some audio applications).
 444 .P
 445 On Linux, the
 446 .B RLIMIT_NICE
 447 resource limit can be used to define a limit to which
 448 an unprivileged process's nice value can be raised; see
 449 .BR setrlimit (2)
 450 for details.
 451 .P
 452 For further details on the nice value, see the subsections on
 453 the autogroup feature and group scheduling, below.
 454 .\"
 455 .SS SCHED_BATCH: Scheduling batch processes
 456 (Since Linux 2.6.16.)
 457 \fBSCHED_BATCH\fP can be used only at static priority 0.
 458 This policy is similar to \fBSCHED_OTHER\fP in that it schedules
 459 the thread according to its dynamic priority
 460 (based on the nice value).
 461 The difference is that this policy
 462 will cause the scheduler to always assume
 463 that the thread is CPU-intensive.
 464 Consequently, the scheduler will apply a small scheduling
 465 penalty with respect to wakeup behavior,
 466 so that this thread is mildly disfavored in scheduling decisions.
 467 .P
 468 .\" The following paragraph is drawn largely from the text that
 469 .\" accompanied Ingo Molnar's patch for the implementation of
 470 .\" SCHED_BATCH.
 471 .\" commit b0a9499c3dd50d333e2aedb7e894873c58da3785
 472 This policy is useful for workloads that are noninteractive,
 473 but do not want to lower their nice value,
 474 and for workloads that want a deterministic scheduling policy without
 475 interactivity causing extra preemptions (between the workload's tasks).
 476 .\"
 477 .SS SCHED_IDLE: Scheduling very low priority jobs
 478 (Since Linux 2.6.23.)
 479 \fBSCHED_IDLE\fP can be used only at static priority 0;
 480 the process nice value has no influence for this policy.
 481 .P
 482 This policy is intended for running jobs at extremely low
 483 priority (lower even than a +19 nice value with the
 484 .B SCHED_OTHER
 485 or
 486 .B SCHED_BATCH
 487 policies).
 488 .\"
 489 .SS Resetting scheduling policy for child processes
 490 Each thread has a reset-on-fork scheduling flag.
 491 When this flag is set, children created by
 492 .BR fork (2)
 493 do not inherit privileged scheduling policies.
 494 The reset-on-fork flag can be set by either:
 495 .IP \[bu] 3
 496 ORing the
 497 .B SCHED_RESET_ON_FORK
 498 flag into the
 499 .I policy
 500 argument when calling
 501 .BR sched_setscheduler (2)
 502 (since Linux 2.6.32);
 503 or
 504 .IP \[bu]
 505 specifying the
 506 .B SCHED_FLAG_RESET_ON_FORK
 507 flag in
 508 .I attr.sched_flags
 509 when calling
 510 .BR sched_setattr (2).
 511 .P
 512 Note that the constants used with these two APIs have different names.
 513 The state of the reset-on-fork flag can analogously be retrieved using
 514 .BR sched_getscheduler (2)
 515 and
 516 .BR sched_getattr (2).
 517 .P
 518 The reset-on-fork feature is intended for media-playback applications,
 519 and can be used to prevent applications evading the
 520 .B RLIMIT_RTTIME
 521 resource limit (see
 522 .BR getrlimit (2))
 523 by creating multiple child processes.
 524 .P
 525 More precisely, if the reset-on-fork flag is set,
 526 the following rules apply for subsequently created children:
 527 .IP \[bu] 3
 528 If the calling thread has a scheduling policy of
 529 .B SCHED_FIFO
 530 or
 531 .BR SCHED_RR ,
 532 the policy is reset to
 533 .B SCHED_OTHER
 534 in child processes.
 535 .IP \[bu]
 536 If the calling process has a negative nice value,
 537 the nice value is reset to zero in child processes.
 538 .P
 539 After the reset-on-fork flag has been enabled,
 540 it can be reset only if the thread has the
 541 .B CAP_SYS_NICE
 542 capability.
 543 This flag is disabled in child processes created by
 544 .BR fork (2).
 545 .\"
 546 .SS Privileges and resource limits
 547 Before Linux 2.6.12, only privileged
 548 .RB ( CAP_SYS_NICE )
 549 threads can set a nonzero static priority (i.e., set a real-time
 550 scheduling policy).
 551 The only change that an unprivileged thread can make is to set the
 552 .B SCHED_OTHER
 553 policy, and this can be done only if the effective user ID of the caller
 554 matches the real or effective user ID of the target thread
 555 (i.e., the thread specified by
 556 .IR pid )
 557 whose policy is being changed.
 558 .P
 559 A thread must be privileged
 560 .RB ( CAP_SYS_NICE )
 561 in order to set or modify a
 562 .B SCHED_DEADLINE
 563 policy.
 564 .P
 565 Since Linux 2.6.12, the
 566 .B RLIMIT_RTPRIO
 567 resource limit defines a ceiling on an unprivileged thread's
 568 static priority for the
 569 .B SCHED_RR
 570 and
 571 .B SCHED_FIFO
 572 policies.
 573 The rules for changing scheduling policy and priority are as follows:
 574 .IP \[bu] 3
 575 If an unprivileged thread has a nonzero
 576 .B RLIMIT_RTPRIO
 577 soft limit, then it can change its scheduling policy and priority,
 578 subject to the restriction that the priority cannot be set to a
 579 value higher than the maximum of its current priority and its
 580 .B RLIMIT_RTPRIO
 581 soft limit.
 582 .IP \[bu]
 583 If the
 584 .B RLIMIT_RTPRIO
 585 soft limit is 0, then the only permitted changes are to lower the priority,
 586 or to switch to a non-real-time policy.
 587 .IP \[bu]
 588 Subject to the same rules,
 589 another unprivileged thread can also make these changes,
 590 as long as the effective user ID of the thread making the change
 591 matches the real or effective user ID of the target thread.
 592 .IP \[bu]
 593 Special rules apply for the
 594 .B SCHED_IDLE
 595 policy.
 596 Before Linux 2.6.39,
 597 an unprivileged thread operating under this policy cannot
 598 change its policy, regardless of the value of its
 599 .B RLIMIT_RTPRIO
 600 resource limit.
 601 Since Linux 2.6.39,
 602 .\" commit c02aa73b1d18e43cfd79c2f193b225e84ca497c8
 603 an unprivileged thread can switch to either the
 604 .B SCHED_BATCH
 605 or the
 606 .B SCHED_OTHER
 607 policy so long as its nice value falls within the range permitted by its
 608 .B RLIMIT_NICE
 609 resource limit (see
 610 .BR getrlimit (2)).
 611 .P
 612 Privileged
 613 .RB ( CAP_SYS_NICE )
 614 threads ignore the
 615 .B RLIMIT_RTPRIO
 616 limit; as with older kernels,
 617 they can make arbitrary changes to scheduling policy and priority.
 618 See
 619 .BR getrlimit (2)
 620 for further information on
 621 .BR RLIMIT_RTPRIO .
 622 .SS Limiting the CPU usage of real-time and deadline processes
 623 A nonblocking infinite loop in a thread scheduled under the
 624 .BR SCHED_FIFO ,
 625 .BR SCHED_RR ,
 626 or
 627 .B SCHED_DEADLINE
 628 policy can potentially block all other threads from accessing
 629 the CPU forever.
 630 Before Linux 2.6.25, the only way of preventing a runaway real-time
 631 process from freezing the system was to run (at the console)
 632 a shell scheduled under a higher static priority than the tested application.
 633 This allows an emergency kill of tested
 634 real-time applications that do not block or terminate as expected.
 635 .P
 636 Since Linux 2.6.25, there are other techniques for dealing with runaway
 637 real-time and deadline processes.
 638 One of these is to use the
 639 .B RLIMIT_RTTIME
 640 resource limit to set a ceiling on the CPU time that
 641 a real-time process may consume.
 642 See
 643 .BR getrlimit (2)
 644 for details.
 645 .P
 646 Since Linux 2.6.25, Linux also provides two
 647 .I /proc
 648 files that can be used to reserve a certain amount of CPU time
 649 to be used by non-real-time processes.
 650 Reserving CPU time in this fashion allows some CPU time to be
 651 allocated to (say) a root shell that can be used to kill a runaway process.
 652 Both of these files specify time values in microseconds:
 653 .TP
 654 .I /proc/sys/kernel/sched_rt_period_us
 655 This file specifies a scheduling period that is equivalent to
 656 100% CPU bandwidth.
 657 The value in this file can range from 1 to
 658 .BR INT_MAX ,
 659 giving an operating range of 1 microsecond to around 35 minutes.
 660 The default value in this file is 1,000,000 (1 second).
 661 .TP
 662 .I /proc/sys/kernel/sched_rt_runtime_us
 663 The value in this file specifies how much of the "period" time
 664 can be used by all real-time and deadline scheduled processes
 665 on the system.
 666 The value in this file can range from \-1 to
 667 .BR INT_MAX \-1.
 668 Specifying \-1 makes the run time the same as the period;
 669 that is, no CPU time is set aside for non-real-time processes
 670 (which was the behavior before Linux 2.6.25).
 671 The default value in this file is 950,000 (0.95 seconds),
 672 meaning that 5% of the CPU time is reserved for processes that
 673 don't run under a real-time or deadline scheduling policy.
 674 .SS Response time
 675 A blocked high priority thread waiting for I/O has a certain
 676 response time before it is scheduled again.
 677 The device driver writer
 678 can greatly reduce this response time by using a "slow interrupt"
 679 interrupt handler.
 680 .\" as described in
 681 .\" .BR request_irq (9).
 682 .SS Miscellaneous
 683 Child processes inherit the scheduling policy and parameters across a
 684 .BR fork (2).
 685 The scheduling policy and parameters are preserved across
 686 .BR execve (2).
 687 .P
 688 Memory locking is usually needed for real-time processes to avoid
 689 paging delays; this can be done with
 690 .BR mlock (2)
 691 or
 692 .BR mlockall (2).
 693 .\"
 694 .SS The autogroup feature
 695 .\" commit 5091faa449ee0b7d73bc296a93bca9540fc51d0a
 696 Since Linux 2.6.38,
 697 the kernel provides a feature known as autogrouping to improve interactive
 698 desktop performance in the face of multiprocess, CPU-intensive
 699 workloads such as building the Linux kernel with large numbers of
 700 parallel build processes (i.e., the
 701 .BR make (1)
 702 .B \-j
 703 flag).
 704 .P
 705 This feature operates in conjunction with the
 706 CFS scheduler and requires a kernel that is configured with
 707 .BR CONFIG_SCHED_AUTOGROUP .
 708 On a running system, this feature is enabled or disabled via the file
 709 .IR /proc/sys/kernel/sched_autogroup_enabled ;
 710 a value of 0 disables the feature, while a value of 1 enables it.
 711 The default value in this file is 1, unless the kernel was booted with the
 712 .I noautogroup
 713 parameter.
 714 .P
 715 A new autogroup is created when a new session is created via
 716 .BR setsid (2);
 717 this happens, for example, when a new terminal window is started.
 718 A new process created by
 719 .BR fork (2)
 720 inherits its parent's autogroup membership.
 721 Thus, all of the processes in a session are members of the same autogroup.
 722 An autogroup is automatically destroyed when the last process
 723 in the group terminates.
 724 .P
 725 When autogrouping is enabled, all of the members of an autogroup
 726 are placed in the same kernel scheduler "task group".
 727 The CFS scheduler employs an algorithm that equalizes the
 728 distribution of CPU cycles across task groups.
 729 The benefits of this for interactive desktop performance
 730 can be described via the following example.
 731 .P
 732 Suppose that there are two autogroups competing for the same CPU
 733 (i.e., presume either a single CPU system or the use of
 734 .BR taskset (1)
 735 to confine all the processes to the same CPU on an SMP system).
 736 The first group contains ten CPU-bound processes from
 737 a kernel build started with
 738 .IR "make\~\-j10" .
 739 The other contains a single CPU-bound process: a video player.
 740 The effect of autogrouping is that the two groups will
 741 each receive half of the CPU cycles.
 742 That is, the video player will receive 50% of the CPU cycles,
 743 rather than just 9% of the cycles,
 744 which would likely lead to degraded video playback.
 745 The situation on an SMP system is more complex,
 746 .\" Mike Galbraith, 25 Nov 2016:
 747 .\"     I'd say something more wishy-washy here, like cycles are
 748 .\"     distributed fairly across groups and leave it at that, as your
 749 .\"     detailed example is incorrect due to SMP fairness (which I don't
 750 .\"     like much because [very unlikely] worst case scenario
 751 .\"     renders a box sized group incapable of utilizing more that
 752 .\"     a single CPU total).  For example, if a group of NR_CPUS
 753 .\"     size competes with a singleton, load balancing will try to give
 754 .\"     the singleton a full CPU of its very own.  If groups intersect for
 755 .\"     whatever reason on say my quad lappy, distribution is 80/20 in
 756 .\"     favor of the singleton.
 757 but the general effect is the same:
 758 the scheduler distributes CPU cycles across task groups such that
 759 an autogroup that contains a large number of CPU-bound processes
 760 does not end up hogging CPU cycles at the expense of the other
 761 jobs on the system.
 762 .P
 763 A process's autogroup (task group) membership can be viewed via the file
 764 .IR /proc/ pid /autogroup :
 765 .P
 766 .in +4n
 767 .EX
 768 $ \fBcat /proc/1/autogroup\fP
 769 /autogroup\-1 nice 0
 770 .EE
 771 .in
 772 .P
 773 This file can also be used to modify the CPU bandwidth allocated
 774 to an autogroup.
 775 This is done by writing a number in the "nice" range to the file
 776 to set the autogroup's nice value.
 777 The allowed range is from +19 (low priority) to \-20 (high priority).
 778 (Writing values outside of this range causes
 779 .BR write (2)
 780 to fail with the error
 781 .BR EINVAL .)
 782 .\" FIXME .
 783 .\" Because of a bug introduced in Linux 4.7
 784 .\" (commit 2159197d66770ec01f75c93fb11dc66df81fd45b made changes
 785 .\" that exposed the fact that autogroup didn't call scale_load()),
 786 .\" it happened that *all* values in this range caused a task group
 787 .\" to be further disfavored by the scheduler, with \-20 resulting
 788 .\" in the scheduler mildly disfavoring the task group and +19 greatly
 789 .\" disfavoring it.
 790 .\"
 791 .\" A patch was posted on 23 Nov 2016
 792 .\" ("sched/autogroup: Fix 64bit kernel nice adjustment";
 793 .\" check later to see in which kernel version it lands.
 794 .P
 795 The autogroup nice setting has the same meaning as the process nice value,
 796 but applies to distribution of CPU cycles to the autogroup as a whole,
 797 based on the relative nice values of other autogroups.
 798 For a process inside an autogroup, the CPU cycles that it receives
 799 will be a product of the autogroup's nice value
 800 (compared to other autogroups)
 801 and the process's nice value
 802 (compared to other processes in the same autogroup.
 803 .P
 804 The use of the
 805 .BR cgroups (7)
 806 CPU controller to place processes in cgroups other than the
 807 root CPU cgroup overrides the effect of autogrouping.
 808 .P
 809 The autogroup feature groups only processes scheduled under
 810 non-real-time policies
 811 .RB ( SCHED_OTHER ,
 812 .BR SCHED_BATCH ,
 813 and
 814 .BR SCHED_IDLE ).
 815 It does not group processes scheduled under real-time and
 816 deadline policies.
 817 Those processes are scheduled according to the rules described earlier.
 818 .\"
 819 .SS The nice value and group scheduling
 820 When scheduling non-real-time processes (i.e., those scheduled under the
 821 .BR SCHED_OTHER ,
 822 .BR SCHED_BATCH ,
 823 and
 824 .B SCHED_IDLE
 825 policies), the CFS scheduler employs a technique known as "group scheduling",
 826 if the kernel was configured with the
 827 .B CONFIG_FAIR_GROUP_SCHED
 828 option (which is typical).
 829 .P
 830 Under group scheduling, threads are scheduled in "task groups".
 831 Task groups have a hierarchical relationship,
 832 rooted under the initial task group on the system,
 833 known as the "root task group".
 834 Task groups are formed in the following circumstances:
 835 .IP \[bu] 3
 836 All of the threads in a CPU cgroup form a task group.
 837 The parent of this task group is the task group of the
 838 corresponding parent cgroup.
 839 .IP \[bu]
 840 If autogrouping is enabled,
 841 then all of the threads that are (implicitly) placed in an autogroup
 842 (i.e., the same session, as created by
 843 .BR setsid (2))
 844 form a task group.
 845 Each new autogroup is thus a separate task group.
 846 The root task group is the parent of all such autogroups.
 847 .IP \[bu]
 848 If autogrouping is enabled, then the root task group consists of
 849 all processes in the root CPU cgroup that were not
 850 otherwise implicitly placed into a new autogroup.
 851 .IP \[bu]
 852 If autogrouping is disabled, then the root task group consists of
 853 all processes in the root CPU cgroup.
 854 .IP \[bu]
 855 If group scheduling was disabled (i.e., the kernel was configured without
 856 .BR CONFIG_FAIR_GROUP_SCHED ),
 857 then all of the processes on the system are notionally placed
 858 in a single task group.
 859 .P
 860 Under group scheduling,
 861 a thread's nice value has an effect for scheduling decisions
 862 .IR "only relative to other threads in the same task group" .
 863 This has some surprising consequences in terms of the traditional semantics
 864 of the nice value on UNIX systems.
 865 In particular, if autogrouping
 866 is enabled (which is the default in various distributions), then employing
 867 .BR setpriority (2)
 868 or
 869 .BR nice (1)
 870 on a process has an effect only for scheduling relative
 871 to other processes executed in the same session
 872 (typically: the same terminal window).
 873 .P
 874 Conversely, for two processes that are (for example)
 875 the sole CPU-bound processes in different sessions
 876 (e.g., different terminal windows,
 877 each of whose jobs are tied to different autogroups),
 878 .I modifying the nice value of the process in one of the sessions
 879 .I has no effect
 880 in terms of the scheduler's decisions relative to the
 881 process in the other session.
 882 .\" More succinctly: the nice(1) command is in many cases a no-op since
 883 .\" Linux 2.6.38.
 884 .\"
 885 A possibly useful workaround here is to use a command such as
 886 the following to modify the autogroup nice value for
 887 .I all
 888 of the processes in a terminal session:
 889 .P
 890 .in +4n
 891 .EX
 892 $ \fBecho 10 > /proc/self/autogroup\fP
 893 .EE
 894 .in
 895 .SS Real-time features in the mainline Linux kernel
 896 .\" FIXME . Probably this text will need some minor tweaking
 897 .\" ask Carsten Emde about this.
 898 Since Linux 2.6.18, Linux is gradually
 899 becoming equipped with real-time capabilities,
 900 most of which are derived from the former
 901 .I realtime\-preempt
 902 patch set.
 903 Until the patches have been completely merged into the
 904 mainline kernel,
 905 they must be installed to achieve the best real-time performance.
 906 These patches are named:
 907 .P
 908 .in +4n
 909 .EX
 910 patch\-\fIkernelversion\fP\-rt\fIpatchversion\fP
 911 .EE
 912 .in
 913 .P
 914 and can be downloaded from
 915 .UR http://www.kernel.org\:/pub\:/linux\:/kernel\:/projects\:/rt/
 916 .UE .
 917 .P
 918 Without the patches and prior to their full inclusion into the mainline
 919 kernel, the kernel configuration offers only the three preemption classes
 920 .BR CONFIG_PREEMPT_NONE ,
 921 .BR CONFIG_PREEMPT_VOLUNTARY ,
 922 and
 923 .B CONFIG_PREEMPT_DESKTOP
 924 which respectively provide no, some, and considerable
 925 reduction of the worst-case scheduling latency.
 926 .P
 927 With the patches applied or after their full inclusion into the mainline
 928 kernel, the additional configuration item
 929 .B CONFIG_PREEMPT_RT
 930 becomes available.
 931 If this is selected, Linux is transformed into a regular
 932 real-time operating system.
 933 The FIFO and RR scheduling policies are then used to run a thread
 934 with true real-time priority and a minimum worst-case scheduling latency.
 935 .SH NOTES
 936 The
 937 .BR cgroups (7)
 938 CPU controller can be used to limit the CPU consumption of
 939 groups of processes.
 940 .P
 941 Originally, Standard Linux was intended as a general-purpose operating
 942 system being able to handle background processes, interactive
 943 applications, and less demanding real-time applications (applications that
 944 need to usually meet timing deadlines).
 945 Although the Linux 2.6
 946 allowed for kernel preemption and the newly introduced O(1) scheduler
 947 ensures that the time needed to schedule is fixed and deterministic
 948 irrespective of the number of active tasks, true real-time computing
 949 was not possible up to Linux 2.6.17.
 950 .SH SEE ALSO
 951 .ad l
 952 .nh
 953 .BR chcpu (1),
 954 .BR chrt (1),
 955 .BR lscpu (1),
 956 .BR ps (1),
 957 .BR taskset (1),
 958 .BR top (1),
 959 .BR getpriority (2),
 960 .BR mlock (2),
 961 .BR mlockall (2),
 962 .BR munlock (2),
 963 .BR munlockall (2),
 964 .BR nice (2),
 965 .BR sched_get_priority_max (2),
 966 .BR sched_get_priority_min (2),
 967 .BR sched_getaffinity (2),
 968 .BR sched_getparam (2),
 969 .BR sched_getscheduler (2),
 970 .BR sched_rr_get_interval (2),
 971 .BR sched_setaffinity (2),
 972 .BR sched_setparam (2),
 973 .BR sched_setscheduler (2),
 974 .BR sched_yield (2),
 975 .BR setpriority (2),
 976 .BR pthread_getaffinity_np (3),
 977 .BR pthread_getschedparam (3),
 978 .BR pthread_setaffinity_np (3),
 979 .BR sched_getcpu (3),
 980 .BR capabilities (7),
 981 .BR cpuset (7)
 982 .ad
 983 .P
 984 .I Programming for the real world \- POSIX.4
 985 by Bill O.\& Gallmeister, O'Reilly & Associates, Inc., ISBN 1-56592-074-0.
 986 .P
 987 The Linux kernel source files
 988 .IR \%Documentation/\:scheduler/\:sched\-deadline\:.txt ,
 989 .IR \%Documentation/\:scheduler/\:sched\-rt\-group\:.txt ,
 990 .IR \%Documentation/\:scheduler/\:sched\-design\-CFS\:.txt ,
 991 and
 992 .I \%Documentation/\:scheduler/\:sched\-nice\-design\:.txt