db2/mutex/README

   1 # @(#)README    10.2 (Sleepycat) 11/25/97
   2
   3 Resource locking routines: lock based on a db_mutex_t.  All this gunk
   4 (including trying to make assembly code portable), is necessary because
   5 System V semaphores require system calls for uncontested locks and we
   6 don't want to make two system calls per resource lock.
   7
   8 First, this is how it works.  The db_mutex_t structure contains a resource
   9 test-and-set lock (tsl), a file offset, a pid for debugging and statistics
  10 information.
  11
  12 If HAVE_SPINLOCKS is defined (i.e. we know how to do test-and-sets for
  13 this compiler/architecture combination), we try and lock the resource tsl
  14 __db_tsl_spins times.  If we can't acquire the lock that way, we use a
  15 system call to sleep for 10ms, 20ms, 40ms, etc.  (The time is bounded at
  16 1 second, just in case.)  Using the timer backoff means that there are
  17 two assumptions: that locks are held for brief periods (never over system
  18 calls or I/O) and that locks are not hotly contested.
  19
  20 If HAVE_SPINLOCKS is not defined, i.e. we can't do test-and-sets, we use
  21 a file descriptor to do byte locking on a file at a specified offset.  In
  22 this case, ALL of the locking is done in the kernel.  Because file
  23 descriptors are allocated per process, we have to provide the file
  24 descriptor as part of the lock/unlock call.  We still have to do timer
  25 backoff because we need to be able to block ourselves, i.e. the lock
  26 manager causes processes to wait by having the process acquire a mutex
  27 and then attempting to re-acquire the mutex.  There's no way to use kernel
  28 locking to block yourself, i.e. if you hold a lock and attempt to
  29 re-acquire it, the attempt will succeed.
  30
  31 Next, let's talk about why it doesn't work the way a reasonable person
  32 would think it should work.
  33
  34 Ideally, we'd have the ability to try to lock the resource tsl, and if
  35 that fails, increment a counter of waiting processes, then block in the
  36 kernel until the tsl is released.  The process holding the resource tsl
  37 would see the wait counter when it went to release the resource tsl, and
  38 would wake any waiting processes up after releasing the lock.  This would
  39 actually require both another tsl (call it the mutex tsl) and
  40 synchronization between the call that blocks in the kernel and the actual
  41 resource tsl.  The mutex tsl would be used to protect accesses to the
  42 db_mutex_t itself.  Locking the mutex tsl would be done by a busy loop,
  43 which is safe because processes would never block holding that tsl (all
  44 they would do is try to obtain the resource tsl and set/check the wait
  45 count).  The problem in this model is that the blocking call into the
  46 kernel requires a blocking semaphore, i.e. one whose normal state is
  47 locked.
  48
  49 The only portable forms of locking under UNIX are fcntl(2) on a file
  50 descriptor/offset, and System V semaphores.  Neither of these locking
  51 methods are sufficient to solve the problem.
  52
  53 The problem with fcntl locking is that only the process that obtained the
  54 lock can release it.  Remember, we want the normal state of the kernel
  55 semaphore to be locked.  So, if the creator of the db_mutex_t were to
  56 initialize the lock to "locked", then a second process locks the resource
  57 tsl, and then a third process needs to block, waiting for the resource
  58 tsl, when the second process wants to wake up the third process, it can't
  59 because it's not the holder of the lock!  For the second process to be
  60 the holder of the lock, we would have to make a system call per
  61 uncontested lock, which is what we were trying to get away from in the
  62 first place.
  63
  64 There are some hybrid schemes, such as signaling the holder of the lock,
  65 or using a different blocking offset depending on which process is
  66 holding the lock, but it gets complicated fairly quickly.  I'm open to
  67 suggestions, but I'm not holding my breath.
  68
  69 Regardless, we use this form of locking when HAVE_SPINLOCKS is not
  70 defined, (i.e. we're locking in the kernel) because it doesn't have the
  71 limitations found in System V semaphores, and because the normal state of
  72 the kernel object in that case is unlocked, so the process releasing the
  73 lock is also the holder of the lock.
  74
  75 The System V semaphore design has a number of other limitations that make
  76 it inappropriate for this task.  Namely:
  77
  78 First, the semaphore key name space is separate from the file system name
  79 space (although there exist methods for using file names to create
  80 semaphore keys).  If we use a well-known key, there's no reason to believe
  81 that any particular key will not already be in use, either by another
  82 instance of the DB application or some other application, in which case
  83 the DB application will fail.  If we create a key, then we have to use a
  84 file system name to rendezvous and pass around the key.
  85
  86 Second, System V semaphores traditionally have compile-time, system-wide
  87 limits on the number of semaphore keys that you can have.  Typically, that
  88 number is far too low for any practical purpose.  Since the semaphores
  89 permit more than a single slot per semaphore key, we could try and get
  90 around that limit by using multiple slots, but that means that the file
  91 that we're using for rendezvous is going to have to contain slot
  92 information as well as semaphore key information, and we're going to be
  93 reading/writing it on every db_mutex_t init or destroy operation.  Anyhow,
  94 similar compile-time, system-wide limits on the numbers of slots per
  95 semaphore key kick in, and you're right back where you started.
  96
  97 My fantasy is that once POSIX.1 standard mutexes are in wide-spread use,
  98 we can switch to them.  My guess is that it won't happen, because the
  99 POSIX semaphores are only required to work for threads within a process,
 100 and not independent processes.
 101
 102 Note: there are races in the statistics code, but since it's just that,
 103 I didn't bother fixing them.  (The fix requires a mutex tsl, so, when/if
 104 this code is fixed to do rational locking (see above), then change the
 105 statistics update code to acquire/release the mutex tsl.