Documentation/DocBook/kernel-locking.tmpl

   1 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]>
   2
   3 <book id="LKLockingGuide">
   4  <bookinfo>
   5   <title>Unreliable Guide To Locking</title>
   6
   7   <authorgroup>
   8    <author>
   9     <firstname>Paul</firstname>
  10     <othername>Rusty</othername>
  11     <surname>Russell</surname>
  12     <affiliation>
  13      <address>
  14       <email>rusty@linuxcare.com</email>
  15      </address>
  16     </affiliation>
  17    </author>
  18   </authorgroup>
  19
  20   <copyright>
  21    <year>2000</year>
  22    <holder>Paul Russell</holder>
  23   </copyright>
  24
  25   <legalnotice>
  26    <para>
  27      This documentation is free software; you can redistribute
  28      it and/or modify it under the terms of the GNU General Public
  29      License as published by the Free Software Foundation; either
  30      version 2 of the License, or (at your option) any later
  31      version.
  32    </para>
  33
  34    <para>
  35      This program is distributed in the hope that it will be
  36      useful, but WITHOUT ANY WARRANTY; without even the implied
  37      warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  38      See the GNU General Public License for more details.
  39    </para>
  40
  41    <para>
  42      You should have received a copy of the GNU General Public
  43      License along with this program; if not, write to the Free
  44      Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
  45      MA 02111-1307 USA
  46    </para>
  47
  48    <para>
  49      For more details see the file COPYING in the source
  50      distribution of Linux.
  51    </para>
  52   </legalnotice>
  53  </bookinfo>
  54
  55  <toc></toc>
  56   <chapter id="intro">
  57    <title>Introduction</title>
  58    <para>
  59      Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
  60      Locking issues.  This document describes the locking systems in
  61      the Linux Kernel as we approach 2.4.
  62    </para>
  63    <para>
  64      It looks like <firstterm linkend="gloss-smp"><acronym>SMP</acronym>
  65      </firstterm> is here to stay; so everyone hacking on the kernel
  66      these days needs to know the fundamentals of concurrency and locking
  67      for SMP.
  68    </para>
  69
  70    <sect1 id="races">
  71     <title>The Problem With Concurrency</title>
  72     <para>
  73       (Skip this if you know what a Race Condition is).
  74     </para>
  75     <para>
  76       In a normal program, you can increment a counter like so:
  77     </para>
  78     <programlisting>
  79       very_important_count++;
  80     </programlisting>
  81
  82     <para>
  83       This is what they would expect to happen:
  84     </para>
  85
  86     <table>
  87      <title>Expected Results</title>
  88
  89      <tgroup cols=2 align=left>
  90
  91       <thead>
  92        <row>
  93         <entry>Instance 1</entry>
  94         <entry>Instance 2</entry>
  95        </row>
  96       </thead>
  97
  98       <tbody>
  99        <row>
 100         <entry>read very_important_count (5)</entry>
 101         <entry></entry>
 102        </row>
 103        <row>
 104         <entry>add 1 (6)</entry>
 105         <entry></entry>
 106        </row>
 107        <row>
 108         <entry>write very_important_count (6)</entry>
 109         <entry></entry>
 110        </row>
 111        <row>
 112         <entry></entry>
 113         <entry>read very_important_count (6)</entry>
 114        </row>
 115        <row>
 116         <entry></entry>
 117         <entry>add 1 (7)</entry>
 118        </row>
 119        <row>
 120         <entry></entry>
 121         <entry>write very_important_count (7)</entry>
 122        </row>
 123       </tbody>
 124
 125      </tgroup>
 126     </table>
 127
 128     <para>
 129      This is what might happen:
 130     </para>
 131
 132     <table>
 133      <title>Possible Results</title>
 134
 135      <tgroup cols=2 align=left>
 136       <thead>
 137        <row>
 138         <entry>Instance 1</entry>
 139         <entry>Instance 2</entry>
 140        </row>
 141       </thead>
 142
 143       <tbody>
 144        <row>
 145         <entry>read very_important_count (5)</entry>
 146         <entry></entry>
 147        </row>
 148        <row>
 149         <entry></entry>
 150         <entry>read very_important_count (5)</entry>
 151        </row>
 152        <row>
 153         <entry>add 1 (6)</entry>
 154         <entry></entry>
 155        </row>
 156        <row>
 157         <entry></entry>
 158         <entry>add 1 (6)</entry>
 159        </row>
 160        <row>
 161         <entry>write very_important_count (6)</entry>
 162         <entry></entry>
 163        </row>
 164        <row>
 165         <entry></entry>
 166         <entry>write very_important_count (6)</entry>
 167        </row>
 168       </tbody>
 169      </tgroup>
 170     </table>
 171
 172     <para>
 173       This overlap, where what actually happens depends on the
 174       relative timing of multiple tasks, is called a race condition.
 175       The piece of code containing the concurrency issue is called a
 176       critical region.  And especially since Linux starting running
 177       on SMP machines, they became one of the major issues in kernel
 178       design and implementation.
 179     </para>
 180     <para>
 181       The solution is to recognize when these simultaneous accesses
 182       occur, and use locks to make sure that only one instance can
 183       enter the critical region at any time.  There are many
 184       friendly primitives in the Linux kernel to help you do this.
 185       And then there are the unfriendly primitives, but I'll pretend
 186       they don't exist.
 187     </para>
 188    </sect1>
 189   </chapter>
 190
 191   <chapter id="locks">
 192    <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
 193
 194    <para>
 195      There are two main types of kernel locks.  The fundamental type
 196      is the spinlock
 197      (<filename class=headerfile>include/asm/spinlock.h</filename>),
 198      which is a very simple single-holder lock: if you can't get the
 199      spinlock, you keep trying (spinning) until you can.  Spinlocks are
 200      very small and fast, and can be used anywhere.
 201    </para>
 202    <para>
 203      The second type is a semaphore
 204      (<filename class=headerfile>include/asm/semaphore.h</filename>): it
 205      can have more than one holder at any time (the number decided at
 206      initialization time), although it is most commonly used as a
 207      single-holder lock (a mutex).  If you can't get a semaphore,
 208      your task will put itself on the queue, and be woken up when the
 209      semaphore is released.  This means the CPU will do something
 210      else while you are waiting, but there are many cases when you
 211      simply can't sleep (see <xref linkend="sleeping-things">), and so
 212      have to use a spinlock instead.
 213    </para>
 214    <para>
 215      Neither type of lock is recursive: see
 216      <xref linkend="techniques-deadlocks">.
 217    </para>
 218
 219    <sect1 id="uniprocessor">
 220     <title>Locks and Uniprocessor Kernels</title>
 221
 222     <para>
 223       For kernels compiled without <symbol>CONFIG_SMP</symbol>, spinlocks
 224       do not exist at all.  This is an excellent design decision: when
 225       no-one else can run at the same time, there is no reason to
 226       have a lock at all.
 227     </para>
 228
 229     <para>
 230       You should always test your locking code with <symbol>CONFIG_SMP</symbol>
 231       enabled, even if you don't have an SMP test box, because it
 232       will still catch some (simple) kinds of deadlock.
 233     </para>
 234
 235     <para>
 236       Semaphores still exist, because they are required for
 237       synchronization between <firstterm linkend="gloss-usercontext">user
 238       contexts</firstterm>, as we will see below.
 239     </para>
 240    </sect1>
 241
 242    <sect1 id="rwlocks">
 243     <title>Read/Write Lock Variants</title>
 244
 245     <para>
 246       Both spinlocks and semaphores have read/write variants:
 247       <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>.
 248       These divide users into two classes: the readers and the writers.  If
 249       you are only reading the data, you can get a read lock, but to write to
 250       the data you need the write lock.  Many people can hold a read lock,
 251       but a writer must be sole holder.
 252     </para>
 253
 254     <para>
 255       This means much smoother locking if your code divides up
 256       neatly along reader/writer lines.  All the discussions below
 257       also apply to read/write variants.
 258     </para>
 259    </sect1>
 260
 261     <sect1 id="usercontextlocking">
 262      <title>Locking Only In User Context</title>
 263
 264      <para>
 265        If you have a data structure which is only ever accessed from
 266        user context, then you can use a simple semaphore
 267        (<filename>linux/asm/semaphore.h</filename>) to protect it.  This
 268        is the most trivial case: you initialize the semaphore to the number
 269        of resources available (usually 1), and call
 270        <function>down_interruptible()</function> to grab the semaphore, and
 271        <function>up()</function> to release it.  There is also a
 272        <function>down()</function>, which should be avoided, because it
 273        will not return if a signal is received.
 274      </para>
 275
 276      <para>
 277        Example: <filename>linux/net/core/netfilter.c</filename> allows
 278        registration of new <function>setsockopt()</function> and
 279        <function>getsockopt()</function> calls, with
 280        <function>nf_register_sockopt()</function>.  Registration and
 281        de-registration are only done on module load and unload (and boot
 282        time, where there is no concurrency), and the list of registrations
 283        is only consulted for an unknown <function>setsockopt()</function>
 284        or <function>getsockopt()</function> system call.  The
 285        <varname>nf_sockopt_mutex</varname> is perfect to protect this,
 286        especially since the setsockopt and getsockopt calls may well
 287        sleep.
 288      </para>
 289    </sect1>
 290
 291    <sect1 id="lock-user-bh">
 292     <title>Locking Between User Context and BHs</title>
 293
 294     <para>
 295       If a <firstterm linkend="gloss-bh">bottom half</firstterm> shares
 296       data with user context, you have two problems.  Firstly, the current
 297       user context can be interrupted by a bottom half, and secondly, the
 298       critical region could be entered from another CPU.  This is where
 299       <function>spin_lock_bh()</function>
 300       (<filename class=headerfile>include/linux/spinlock.h</filename>) is
 301       used.  It disables bottom halves on that CPU, then grabs the lock.
 302       <function>spin_unlock_bh()</function> does the reverse.
 303     </para>
 304
 305     <para>
 306       This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
 307       </acronym></firstterm> as well: the spin lock vanishes, and this macro
 308       simply becomes <function>local_bh_disable()</function>
 309       (<filename class=headerfile>include/asm/softirq.h</filename>), which
 310       protects you from the bottom half being run.
 311     </para>
 312    </sect1>
 313
 314    <sect1 id="lock-user-tasklet">
 315     <title>Locking Between User Context and Tasklets/Soft IRQs</title>
 316
 317     <para>
 318       This is exactly the same as above, because
 319       <function>local_bh_disable()</function> actually disables all
 320       softirqs and <firstterm linkend="gloss-tasklet">tasklets</firstterm>
 321       on that CPU as well.  It should really be
 322       called `local_softirq_disable()', but the name has been preserved
 323       for historical reasons.  Similarly,
 324       <function>spin_lock_bh()</function> would now be called
 325       spin_lock_softirq() in a perfect world.
 326     </para>
 327    </sect1>
 328
 329    <sect1 id="lock-bh">
 330     <title>Locking Between Bottom Halves</title>
 331
 332     <para>
 333       Sometimes a bottom half might want to share data with
 334       another bottom half (especially remember that timers are run
 335       off a bottom half).
 336     </para>
 337
 338     <sect2 id="lock-bh-same">
 339      <title>The Same BH</title>
 340
 341      <para>
 342        Since a bottom half is never run on two CPUs at once, you
 343        don't need to worry about your bottom half being run twice
 344        at once, even on SMP.
 345      </para>
 346     </sect2>
 347
 348     <sect2 id="lock-bh-different">
 349      <title>Different BHs</title>
 350
 351      <para>
 352        Since only one bottom half ever runs at a time once, you
 353        don't need to worry about race conditions with other bottom
 354        halves.  Beware that things might change under you, however,
 355        if someone changes your bottom half to a tasklet.  If you
 356        want to make your code future-proof, pretend you're already
 357        running from a tasklet (see below), and doing the extra
 358        locking.  Of course, if it's five years before that happens,
 359        you're gonna look like a damn fool.
 360      </para>
 361     </sect2>
 362    </sect1>
 363
 364    <sect1 id="lock-tasklets">
 365     <title>Locking Between Tasklets</title>
 366
 367     <para>
 368       Sometimes a tasklet might want to share data with another
 369       tasklet, or a bottom half.
 370     </para>
 371
 372     <sect2 id="lock-tasklets-same">
 373      <title>The Same Tasklet</title>
 374      <para>
 375        Since a tasklet is never run on two CPUs at once, you don't
 376        need to worry about your tasklet being reentrant (running
 377        twice at once), even on SMP.
 378      </para>
 379     </sect2>
 380
 381     <sect2 id="lock-tasklets-different">
 382      <title>Different Tasklets</title>
 383      <para>
 384        If another tasklet (or bottom half, such as a timer) wants
 385        to share data with your tasklet, you will both need to use
 386        <function>spin_lock()</function> and
 387        <function>spin_unlock()</function> calls.
 388        <function>spin_lock_bh()</function> is
 389        unnecessary here, as you are already in a a tasklet, and
 390        none will be run on the same CPU.
 391      </para>
 392     </sect2>
 393    </sect1>
 394
 395    <sect1 id="lock-softirqs">
 396     <title>Locking Between Softirqs</title>
 397
 398     <para>
 399       Often a <firstterm linkend="gloss-softirq">softirq</firstterm> might
 400       want to share data with itself, a tasklet, or a bottom half.
 401     </para>
 402
 403     <sect2 id="lock-softirqs-same">
 404      <title>The Same Softirq</title>
 405
 406      <para>
 407        The same softirq can run on the other CPUs: you can use a
 408        per-CPU array (see <xref linkend="per-cpu">) for better
 409        performance.  If you're going so far as to use a softirq,
 410        you probably care about scalable performance enough
 411        to justify the extra complexity.
 412      </para>
 413
 414      <para>
 415        You'll need to use <function>spin_lock()</function> and
 416        <function>spin_unlock()</function> for shared data.
 417      </para>
 418     </sect2>
 419
 420     <sect2 id="lock-softirqs-different">
 421      <title>Different Softirqs</title>
 422
 423      <para>
 424        You'll need to use <function>spin_lock()</function> and
 425        <function>spin_unlock()</function> for shared data, whether it
 426        be a timer (which can be running on a different CPU), bottom half,
 427        tasklet or the same or another softirq.
 428      </para>
 429     </sect2>
 430    </sect1>
 431   </chapter>
 432
 433   <chapter id="hardirq-context">
 434    <title>Hard IRQ Context</title>
 435
 436    <para>
 437      Hardware interrupts usually communicate with a bottom half,
 438      tasklet or softirq.  Frequently this involves putting work in a
 439      queue, which the BH/softirq will take out.
 440    </para>
 441
 442    <sect1 id="hardirq-softirq">
 443     <title>Locking Between Hard IRQ and Softirqs/Tasklets/BHs</title>
 444
 445     <para>
 446       If a hardware irq handler shares data with a softirq, you have
 447       two concerns.  Firstly, the softirq processing can be
 448       interrupted by a hardware interrupt, and secondly, the
 449       critical region could be entered by a hardware interrupt on
 450       another CPU.  This is where <function>spin_lock_irq()</function> is
 451       used.  It is defined to disable interrupts on that cpu, then grab
 452       the lock. <function>spin_unlock_irq()</function> does the reverse.
 453     </para>
 454
 455     <para>
 456       This works perfectly for UP as well: the spin lock vanishes,
 457       and this macro simply becomes <function>local_irq_disable()</function>
 458       (<filename class=headerfile>include/asm/smp.h</filename>), which
 459       protects you from the softirq/tasklet/BH being run.
 460     </para>
 461
 462     <para>
 463       <function>spin_lock_irqsave()</function>
 464       (<filename>include/linux/spinlock.h</filename>) is a variant
 465       which saves whether interrupts were on or off in a flags word,
 466       which is passed to <function>spin_lock_irqrestore()</function>.  This
 467       means that the same code can be used inside an hard irq handler (where
 468       interrupts are already off) and in softirqs (where the irq
 469       disabling is required).
 470     </para>
 471    </sect1>
 472   </chapter>
 473
 474   <chapter id="common-techniques">
 475    <title>Common Techniques</title>
 476
 477    <para>
 478      This section lists some common dilemmas and the standard
 479      solutions used in the Linux kernel code.  If you use these,
 480      people will find your code simpler to understand.
 481    </para>
 482
 483    <para>
 484      If I could give you one piece of advice: never sleep with anyone
 485      crazier than yourself.  But if I had to give you advice on
 486      locking: <emphasis>keep it simple</emphasis>.
 487    </para>
 488
 489    <para>
 490      Lock data, not code.
 491    </para>
 492
 493    <para>
 494      Be reluctant to introduce new locks.
 495    </para>
 496
 497    <para>
 498      Strangely enough, this is the exact reverse of my advice when
 499      you <emphasis>have</emphasis> slept with someone crazier than yourself.
 500    </para>
 501
 502    <sect1 id="techniques-no-writers">
 503     <title>No Writers in Interrupt Context</title>
 504
 505     <para>
 506       There is a fairly common case where an interrupt handler needs
 507       access to a critical region, but does not need write access.
 508       In this case, you do not need to use
 509       <function>read_lock_irq()</function>, but only
 510       <function>read_lock()</function> everywhere (since if an interrupt
 511       occurs, the irq handler will only try to grab a read lock, which
 512       won't deadlock).  You will still need to use
 513       <function>write_lock_irq()</function>.
 514     </para>
 515
 516     <para>
 517       Similar logic applies to locking between softirqs/tasklets/BHs
 518       which never need a write lock, and user context:
 519       <function>read_lock()</function> and
 520       <function>write_lock_bh()</function>.
 521     </para>
 522    </sect1>
 523
 524    <sect1 id="techniques-deadlocks">
 525     <title>Deadlock: Simple and Advanced</title>
 526
 527     <para>
 528       There is a coding bug where a piece of code tries to grab a
 529       spinlock twice: it will spin forever, waiting for the lock to
 530       be released (spinlocks, rwlocks and semaphores are not
 531       recursive in Linux).  This is trivial to diagnose: not a
 532       stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
 533       problem.
 534     </para>
 535
 536     <para>
 537       For a slightly more complex case, imagine you have a region
 538       shared by a bottom half and user context.  If you use a
 539       <function>spin_lock()</function> call to protect it, it is
 540       possible that the user context will be interrupted by the bottom
 541       half while it holds the lock, and the bottom half will then spin
 542       forever trying to get the same lock.
 543     </para>
 544
 545     <para>
 546       Both of these are called deadlock, and as shown above, it can
 547       occur even with a single CPU (although not on UP compiles,
 548       since spinlocks vanish on kernel compiles with
 549       <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption
 550       in the second example).
 551     </para>
 552
 553     <para>
 554       This complete lockup is easy to diagnose: on SMP boxes the
 555       watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set
 556       (<filename>include/linux/spinlock.h</filename>) will show this up
 557       immediately when it happens.
 558     </para>
 559
 560     <para>
 561       A more complex problem is the so-called `deadly embrace',
 562       involving two or more locks.  Say you have a hash table: each
 563       entry in the table is a spinlock, and a chain of hashed
 564       objects.  Inside a softirq handler, you sometimes want to
 565       alter an object from one place in the hash to another: you
 566       grab the spinlock of the old hash chain and the spinlock of
 567       the new hash chain, and delete the object from the old one,
 568       and insert it in the new one.
 569     </para>
 570
 571     <para>
 572       There are two problems here.  First, if your code ever
 573       tries to move the object to the same chain, it will deadlock
 574       with itself as it tries to lock it twice.  Secondly, if the
 575       same softirq on another CPU is trying to move another object
 576       in the reverse direction, the following could happen:
 577     </para>
 578
 579     <table>
 580      <title>Consequences</title>
 581
 582      <tgroup cols=2 align=left>
 583
 584       <thead>
 585        <row>
 586         <entry>CPU 1</entry>
 587         <entry>CPU 2</entry>
 588        </row>
 589       </thead>
 590
 591       <tbody>
 592        <row>
 593         <entry>Grab lock A -&gt; OK</entry>
 594         <entry>Grab lock B -&gt; OK</entry>
 595        </row>
 596        <row>
 597         <entry>Grab lock B -&gt; spin</entry>
 598         <entry>Grab lock A -&gt; spin</entry>
 599        </row>
 600       </tbody>
 601      </tgroup>
 602     </table>
 603
 604     <para>
 605       The two CPUs will spin forever, waiting for the other to give up
 606       their lock.  It will look, smell, and feel like a crash.
 607     </para>
 608
 609     <sect2 id="techs-deadlock-prevent">
 610      <title>Preventing Deadlock</title>
 611
 612      <para>
 613        Textbooks will tell you that if you always lock in the same
 614        order, you will never get this kind of deadlock.  Practice
 615        will tell you that this approach doesn't scale: when I
 616        create a new lock, I don't understand enough of the kernel
 617        to figure out where in the 5000 lock hierarchy it will fit.
 618      </para>
 619
 620      <para>
 621        The best locks are encapsulated: they never get exposed in
 622        headers, and are never held around calls to non-trivial
 623        functions outside the same file.  You can read through this
 624        code and see that it will never deadlock, because it never
 625        tries to grab another lock while it has that one.  People
 626        using your code don't even need to know you are using a
 627        lock.
 628      </para>
 629
 630      <para>
 631        A classic problem here is when you provide callbacks or
 632        hooks: if you call these with the lock held, you risk simple
 633        deadlock, or a deadly embrace (who knows what the callback
 634        will do?).  Remember, the other programmers are out to get
 635        you, so don't do this.
 636      </para>
 637     </sect2>
 638
 639     <sect2 id="techs-deadlock-overprevent">
 640      <title>Overzealous Prevention Of Deadlocks</title>
 641
 642      <para>
 643        Deadlocks are problematic, but not as bad as data
 644        corruption.  Code which grabs a read lock, searches a list,
 645        fails to find what it wants, drops the read lock, grabs a
 646        write lock and inserts the object has a race condition.
 647      </para>
 648
 649      <para>
 650        If you don't see why, please stay the fuck away from my code.
 651      </para>
 652     </sect2>
 653    </sect1>
 654
 655    <sect1 id="per-cpu">
 656     <title>Per-CPU Data</title>
 657
 658     <para>
 659       A great technique for avoiding locking which is used fairly
 660       widely is to duplicate information for each CPU.  For example,
 661       if you wanted to keep a count of a common condition, you could
 662       use a spin lock and a single counter.  Nice and simple.
 663     </para>
 664
 665     <para>
 666       If that was too slow [it's probably not], you could instead
 667       use a counter for each CPU [don't], then none of them need an
 668       exclusive lock [you're wasting your time here].  To make sure
 669       the CPUs don't have to synchronize caches all the time, align
 670       the counters to cache boundaries by appending
 671       `__cacheline_aligned' to the declaration
 672       (<filename class=headerfile>include/linux/cache.h</filename>).
 673       [Can't you think of anything better to do?]
 674     </para>
 675
 676     <para>
 677       They will need a read lock to access their own counters,
 678       however.  That way you can use a write lock to grant exclusive
 679       access to all of them at once, to tally them up.
 680     </para>
 681    </sect1>
 682
 683    <sect1 id="brlock">
 684     <title>Big Reader Locks</title>
 685
 686     <para>
 687       A classic example of per-CPU information is Ingo's `big
 688       reader' locks
 689       (<filename class=headerfile>linux/include/brlock.h</filename>).  These
 690       use the Per-CPU Data techniques described above to create a lock which
 691       is very fast to get a read lock, but agonizingly slow for a write
 692       lock.
 693     </para>
 694
 695     <para>
 696       Fortunately, there are a limited number of these locks
 697       available: you have to go through a strict interview process
 698       to get one.
 699     </para>
 700    </sect1>
 701
 702    <sect1 id="lock-avoidance-rw">
 703     <title>Avoiding Locks: Read And Write Ordering</title>
 704
 705     <para>
 706       Sometimes it is possible to avoid locking.  Consider the
 707       following case from the 2.2 firewall code, which inserted an
 708       element into a single linked list in user context:
 709     </para>
 710
 711     <programlisting>
 712         new-&gt;next = i-&gt;next;
 713         i-&gt;next = new;
 714     </programlisting>
 715
 716     <para>
 717       Here the author (Alan Cox, who knows what he's doing) assumes
 718       that the pointer assignments are atomic.  This is important,
 719       because networking packets would traverse this list on bottom
 720       halves without a lock.  Depending on their exact timing, they
 721       would either see the new element in the list with a valid
 722       <structfield>next</structfield> pointer, or it would not be in the
 723       list yet.
 724     </para>
 725
 726     <para>
 727       Of course, the writes <emphasis>must</emphasis> be in this
 728       order, otherwise the new element appears in the list with an
 729       invalid <structfield>next</structfield> pointer, and any other
 730       CPU iterating at the wrong time will jump through it into garbage.
 731       Because modern CPUs reorder, Alan's code actually read as follows:
 732     </para>
 733
 734     <programlisting>
 735         new-&gt;next = i-&gt;next;
 736         wmb();
 737         i-&gt;next = new;
 738     </programlisting>
 739
 740     <para>
 741       The <function>wmb()</function> is a write memory barrier
 742       (<filename class=headerfile>include/asm/system.h</filename>): neither
 743       the compiler nor the CPU will allow any writes to memory after the
 744       <function>wmb()</function> to be visible to other hardware
 745       before any of the writes before the <function>wmb()</function>.
 746     </para>
 747
 748     <para>
 749       As i386 does not do write reordering, this bug would never
 750       show up on that platform.  On other SMP platforms, however, it
 751       will.
 752     </para>
 753
 754     <para>
 755       There is also <function>rmb()</function> for read ordering: to ensure
 756       any previous variable reads occur before following reads.  The simple
 757       <function>mb()</function> macro combines both
 758       <function>rmb()</function> and <function>wmb()</function>.
 759     </para>
 760
 761     <para>
 762       Any atomic operation is defined to act as a memory barrier
 763       (ie. as per the <function>mb()</function> macro).  Also,
 764       spinlock operations act as partial barriers: operations after
 765       gaining a spinlock will never be moved to precede the
 766       <function>spin_lock()</function> call, and operations before
 767       releasing a spinlock will never be moved after the
 768       <function>spin_unlock()</function> call.
 769       <!-- Manfred Spraul <manfreds@colorfullife.com>
 770            24 May 2000 2.3.99-pre9 -->
 771     </para>
 772    </sect1>
 773
 774    <sect1 id="lock-avoidance-atomic-ops">
 775     <title>Avoiding Locks: Atomic Operations</title>
 776
 777     <para>
 778       There are a number of atomic operations defined in
 779       <filename class=headerfile>include/asm/atomic.h</filename>: these
 780       are guaranteed to be seen atomically from all CPUs in the system, thus
 781       avoiding races. If your shared data consists of a single counter, say,
 782       these operations might be simpler than using spinlocks (although for
 783       anything non-trivial using spinlocks is clearer).
 784     </para>
 785
 786     <para>
 787       Note that the atomic operations are defined to act as both
 788       read and write barriers on all platforms.
 789     </para>
 790    </sect1>
 791
 792    <sect1 id="ref-counts">
 793     <title>Protecting A Collection of Objects: Reference Counts</title>
 794
 795     <para>
 796       Locking a collection of objects is fairly easy: you get a
 797       single spinlock, and you make sure you grab it before
 798       searching, adding or deleting an object.
 799     </para>
 800
 801     <para>
 802       The purpose of this lock is not to protect the individual
 803       objects: you might have a separate lock inside each one for
 804       that.  It is to protect the <emphasis>data structure
 805       containing the objects</emphasis> from race conditions.  Often
 806       the same lock is used to protect the contents of all the
 807       objects as well, for simplicity, but they are inherently
 808       orthogonal (and many other big words designed to confuse).
 809     </para>
 810
 811     <para>
 812       Changing this to a read-write lock will often help markedly if
 813       reads are far more common that writes.  If not, there is
 814       another approach you can use to reduce the time the lock is
 815       held: reference counts.
 816     </para>
 817
 818     <para>
 819       In this approach, an object has an owner, who sets the
 820       reference count to one.  Whenever you get a pointer to the
 821       object, you increment the reference count (a `get' operation).
 822       Whenever you relinquish a pointer, you decrement the reference
 823       count (a `put' operation).  When the owner wants to destroy
 824       it, they mark it dead, and do a put.
 825     </para>
 826
 827     <para>
 828       Whoever drops the reference count to zero (usually implemented
 829       with <function>atomic_dec_and_test()</function>) actually cleans
 830       up and frees the object.
 831     </para>
 832
 833     <para>
 834       This means that you are guaranteed that the object won't
 835       vanish underneath you, even though you no longer have a lock
 836       for the collection.
 837     </para>
 838
 839     <para>
 840       Here's some skeleton code:
 841     </para>
 842
 843     <programlisting>
 844         void create_foo(struct foo *x)
 845         {
 846                 atomic_set(&amp;x-&gt;use, 1);
 847                 spin_lock_bh(&amp;list_lock);
 848                 ... insert in list ...
 849                 spin_unlock_bh(&amp;list_lock);
 850         }
 851
 852         struct foo *get_foo(int desc)
 853         {
 854                 struct foo *ret;
 855
 856                 spin_lock_bh(&amp;list_lock);
 857                 ... find in list ...
 858                 if (ret) atomic_inc(&amp;ret-&gt;use);
 859                 spin_unlock_bh(&amp;list_lock);
 860
 861                 return ret;
 862         }
 863
 864         void put_foo(struct foo *x)
 865         {
 866                 if (atomic_dec_and_test(&amp;x-&gt;use))
 867                         kfree(foo);
 868         }
 869
 870         void destroy_foo(struct foo *x)
 871         {
 872                 spin_lock_bh(&amp;list_lock);
 873                 ... remove from list ...
 874                 spin_unlock_bh(&amp;list_lock);
 875
 876                 put_foo(x);
 877         }
 878     </programlisting>
 879
 880     <sect2 id="helpful-macros">
 881      <title>Macros To Help You</title>
 882      <para>
 883        There are a set of debugging macros tucked inside
 884        <filename class=headerfile>include/linux/netfilter_ipv4/lockhelp.h</filename>
 885        and <filename class=headerfile>listhelp.h</filename>: these are very
 886        useful for ensuring that locks are held in the right places to protect
 887        infrastructure.
 888      </para>
 889     </sect2>
 890    </sect1>
 891
 892    <sect1 id="sleeping-things">
 893     <title>Things Which Sleep</title>
 894
 895     <para>
 896       You can never call the following routines while holding a
 897       spinlock, as they may sleep.  This also means you need to be in
 898       user context.
 899     </para>
 900
 901     <itemizedlist>
 902      <listitem>
 903       <para>
 904         Accesses to
 905         <firstterm linkend="gloss-userspace">userspace</firstterm>:
 906       </para>
 907       <itemizedlist>
 908        <listitem>
 909         <para>
 910           <function>copy_from_user()</function>
 911         </para>
 912        </listitem>
 913        <listitem>
 914         <para>
 915           <function>copy_to_user()</function>
 916         </para>
 917        </listitem>
 918        <listitem>
 919         <para>
 920           <function>get_user()</function>
 921         </para>
 922        </listitem>
 923        <listitem>
 924         <para>
 925           <function> put_user()</function>
 926         </para>
 927        </listitem>
 928       </itemizedlist>
 929      </listitem>
 930
 931      <listitem>
 932       <para>
 933         <function>kmalloc(GFP_KERNEL)</function>
 934       </para>
 935      </listitem>
 936
 937      <listitem>
 938       <para>
 939       <function>down_interruptible()</function> and
 940       <function>down()</function>
 941       </para>
 942       <para>
 943        There is a <function>down_trylock()</function> which can be
 944        used inside interrupt context, as it will not sleep.
 945        <function>up()</function> will also never sleep.
 946       </para>
 947      </listitem>
 948     </itemizedlist>
 949
 950     <para>
 951      <function>printk()</function> can be called in
 952      <emphasis>any</emphasis> context, interestingly enough.
 953     </para>
 954    </sect1>
 955
 956    <sect1 id="sparc">
 957     <title>The Fucked Up Sparc</title>
 958
 959     <para>
 960       Alan Cox says <quote>the irq disable/enable is in the register
 961       window on a sparc</quote>.  Andi Kleen says <quote>when you do
 962       restore_flags in a different function you mess up all the
 963       register windows</quote>.
 964     </para>
 965
 966     <para>
 967       So never pass the flags word set by
 968       <function>spin_lock_irqsave()</function> and brethren to another
 969       function (unless it's declared <type>inline</type>.  Usually no-one
 970       does this, but now you've been warned.  Dave Miller can never do
 971       anything in a straightforward manner (I can say that, because I have
 972       pictures of him and a certain PowerPC maintainer in a compromising
 973       position).
 974     </para>
 975    </sect1>
 976
 977    <sect1 id="racing-timers">
 978     <title>Racing Timers: A Kernel Pastime</title>
 979
 980     <para>
 981       Timers can produce their own special problems with races.
 982       Consider a collection of objects (list, hash, etc) where each
 983       object has a timer which is due to destroy it.
 984     </para>
 985
 986     <para>
 987       If you want to destroy the entire collection (say on module
 988       removal), you might do the following:
 989     </para>
 990
 991     <programlisting>
 992         /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
 993            HUNGARIAN NOTATION */
 994         spin_lock_bh(&amp;list_lock);
 995
 996         while (list) {
 997                 struct foo *next = list-&gt;next;
 998                 del_timer(&amp;list-&gt;timer);
 999                 kfree(list);
1000                 list = next;
1001         }
1002
1003         spin_unlock_bh(&amp;list_lock);
1004     </programlisting>
1005
1006     <para>
1007       Sooner or later, this will crash on SMP, because a timer can
1008       have just gone off before the <function>spin_lock_bh()</function>,
1009       and it will only get the lock after we
1010       <function>spin_unlock_bh()</function>, and then try to free
1011       the element (which has already been freed!).
1012     </para>
1013
1014     <para>
1015       This can be avoided by checking the result of
1016       <function>del_timer()</function>: if it returns
1017       <returnvalue>1</returnvalue>, the timer has been deleted.
1018       If <returnvalue>0</returnvalue>, it means (in this
1019       case) that it is currently running, so we can do:
1020     </para>
1021
1022     <programlisting>
1023         retry:
1024                 spin_lock_bh(&amp;list_lock);
1025
1026                 while (list) {
1027                         struct foo *next = list-&gt;next;
1028                         if (!del_timer(&amp;list-&gt;timer)) {
1029                                 /* Give timer a chance to delete this */
1030                                 spin_unlock_bh(&amp;list_lock);
1031                                 goto retry;
1032                         }
1033                         kfree(list);
1034                         list = next;
1035                 }
1036
1037                 spin_unlock_bh(&amp;list_lock);
1038     </programlisting>
1039
1040     <para>
1041       Another common problem is deleting timers which restart
1042       themselves (by calling <function>add_timer()</function> at the end
1043       of their timer function).  Because this is a fairly common case
1044       which is prone to races, you can put a call to
1045       <function>timer_exit()</function> at the very end of your timer function,
1046       and user <function>del_timer_sync()</function>
1047       (<filename class=headerfile>include/linux/timer.h</filename>)
1048       to handle this case.  It returns the number of times the timer
1049       had to be deleted before we finally stopped it from adding itself back
1050       in.
1051     </para>
1052    </sect1>
1053   </chapter>
1054
1055   <chapter id="references">
1056    <title>Further reading</title>
1057
1058    <itemizedlist>
1059     <listitem>
1060      <para>
1061        <filename>Documentation/spinlocks.txt</filename>:
1062        Linus Torvalds' spinlocking tutorial in the kernel sources.
1063      </para>
1064     </listitem>
1065
1066     <listitem>
1067      <para>
1068        Unix Systems for Modern Architectures: Symmetric
1069        Multiprocessing and Caching for Kernel Programmers:
1070      </para>
1071
1072      <para>
1073        Curt Schimmel's very good introduction to kernel level
1074        locking (not written for Linux, but nearly everything
1075        applies).  The book is expensive, but really worth every
1076        penny to understand SMP locking. [ISBN: 0201633388]
1077      </para>
1078     </listitem>
1079    </itemizedlist>
1080   </chapter>
1081
1082   <chapter id="thanks">
1083     <title>Thanks</title>
1084
1085     <para>
1086       Thanks to Telsa Gwynne for DocBooking, neatening and adding
1087       style.
1088     </para>
1089
1090     <para>
1091       Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
1092       Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul and Tim
1093       Waugh for proofreading, correcting, flaming, commenting.
1094     </para>
1095
1096     <para>
1097       Thanks to the cabal for having no influence on this document.
1098     </para>
1099   </chapter>
1100
1101   <glossary id="glossary">
1102    <title>Glossary</title>
1103
1104    <glossentry id="gloss-bh">
1105     <glossterm>bh</glossterm>
1106      <glossdef>
1107       <para>
1108         Bottom Half: for historical reasons, functions with
1109         `_bh' in them often now refer to any software interrupt, e.g.
1110         <function>spin_lock_bh()</function> blocks any software interrupt
1111         on the current CPU.  Bottom halves are deprecated, and will
1112         eventually be replaced by tasklets.  Only one bottom half will be
1113         running at any time.
1114      </para>
1115     </glossdef>
1116    </glossentry>
1117
1118    <glossentry id="gloss-hwinterrupt">
1119     <glossterm>Hardware Interrupt / Hardware IRQ</glossterm>
1120     <glossdef>
1121      <para>
1122        Hardware interrupt request.  <function>in_irq()</function> returns
1123        <returnvalue>true</returnvalue> in a hardware interrupt handler (it
1124        also returns true when interrupts are blocked).
1125      </para>
1126     </glossdef>
1127    </glossentry>
1128
1129    <glossentry id="gloss-interruptcontext">
1130     <glossterm>Interrupt Context</glossterm>
1131     <glossdef>
1132      <para>
1133        Not user context: processing a hardware irq or software irq.
1134        Indicated by the <function>in_interrupt()</function> macro
1135        returning <returnvalue>true</returnvalue> (although it also
1136        returns true when interrupts or BHs are blocked).
1137      </para>
1138     </glossdef>
1139    </glossentry>
1140
1141    <glossentry id="gloss-smp">
1142     <glossterm><acronym>SMP</acronym></glossterm>
1143     <glossdef>
1144      <para>
1145        Symmetric Multi-Processor: kernels compiled for multiple-CPU
1146        machines.  (CONFIG_SMP=y).
1147      </para>
1148     </glossdef>
1149    </glossentry>
1150
1151    <glossentry id="gloss-softirq">
1152     <glossterm>softirq</glossterm>
1153     <glossdef>
1154      <para>
1155        Strictly speaking, one of up to 32 enumerated software
1156        interrupts which can run on multiple CPUs at once.
1157        Sometimes used to refer to tasklets and bottom halves as
1158        well (ie. all software interrupts).
1159      </para>
1160     </glossdef>
1161    </glossentry>
1162
1163    <glossentry id="gloss-swinterrupt">
1164     <glossterm>Software Interrupt / Software IRQ</glossterm>
1165     <glossdef>
1166      <para>
1167        Software interrupt handler.  <function>in_irq()</function> returns
1168        <returnvalue>false</returnvalue>; <function>in_softirq()</function>
1169        returns <returnvalue>true</returnvalue>.  Tasklets, softirqs and
1170        bottom halves all fall into the category of `software interrupts'.
1171      </para>
1172     </glossdef>
1173    </glossentry>
1174
1175    <glossentry id="gloss-tasklet">
1176     <glossterm>tasklet</glossterm>
1177     <glossdef>
1178      <para>
1179        A dynamically-registrable software interrupt,
1180        which is guaranteed to only run on one CPU at a time.
1181      </para>
1182     </glossdef>
1183    </glossentry>
1184
1185    <glossentry id="gloss-up">
1186     <glossterm><acronym>UP</acronym></glossterm>
1187     <glossdef>
1188      <para>
1189        Uni-Processor: Non-SMP.  (CONFIG_SMP=n).
1190      </para>
1191     </glossdef>
1192    </glossentry>
1193
1194    <glossentry id="gloss-usercontext">
1195     <glossterm>User Context</glossterm>
1196     <glossdef>
1197      <para>
1198        The kernel executing on behalf of a particular
1199        process or kernel thread (given by the <function>current()</function>
1200        macro.)  Not to be confused with userspace.  Can be interrupted by
1201        software  or hardware interrupts.
1202      </para>
1203     </glossdef>
1204    </glossentry>
1205
1206    <glossentry id="gloss-userspace">
1207     <glossterm>Userspace</glossterm>
1208     <glossdef>
1209      <para>
1210        A process executing its own code outside the kernel.
1211      </para>
1212     </glossdef>
1213    </glossentry>
1214
1215   </glossary>
1216 </book>
1217