Documentation/vm/locking

   1 Started Oct 1999 by Kanoj Sarcar <kanoj@sgi.com>
   2
   3 The intent of this file is to have an uptodate, running commentary
   4 from different people about how locking and synchronization is done
   5 in the Linux vm code.
   6
   7 vmlist_access_lock/vmlist_modify_lock
   8 --------------------------------------
   9
  10 Page stealers pick processes out of the process pool and scan for
  11 the best process to steal pages from. To guarantee the existance
  12 of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
  13 Page stealers hold kernel_lock to protect against a bunch of races.
  14 The vma list of the victim mm is also scanned by the stealer,
  15 and the vmlist_lock is used to preserve list sanity against the
  16 process adding/deleting to the list. This also gurantees existance
  17 of the vma. Vma existance is not guranteed once try_to_swap_out()
  18 drops the vmlist lock. To gurantee the existance of the underlying
  19 file structure, a get_file is done before the swapout() method is
  20 invoked. The page passed into swapout() is guaranteed not to be reused
  21 for a different purpose because the page reference count due to being
  22 present in the user's pte is not released till after swapout() returns.
  23
  24 Any code that modifies the vmlist, or the vm_start/vm_end/
  25 vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
  26 kswapd from looking at the chain. This does not include driver mmap()
  27 methods, for example, since the vma is still not in the list.
  28
  29 The rules are:
  30 1. To modify the vmlist (add/delete or change fields in an element),
  31 you must hold mmap_sem to guard against clones doing mmap/munmap/faults,
  32 (ie all vm system calls and faults), and from ptrace, swapin due to
  33 swap deletion etc.
  34 2. To modify the vmlist (add/delete or change fields in an element),
  35 you must also hold vmlist_modify_lock, to guard against page stealers
  36 scanning the list.
  37 3. To scan the vmlist (find_vma()), you must either
  38         a. grab mmap_sem, which should be done by all cases except
  39            page stealer.
  40 or
  41         b. grab vmlist_access_lock, only done by page stealer.
  42 4. While holding the vmlist_modify_lock, you must be able to guarantee
  43 that no code path will lead to page stealing. A better guarantee is
  44 to claim non sleepability, which ensures that you are not sleeping
  45 for a lock, whose holder might in turn be doing page stealing.
  46 5. You must be able to guarantee that while holding vmlist_modify_lock
  47 or vmlist_access_lock of mm A, you will not try to get either lock
  48 for mm B.
  49
  50 The caveats are:
  51 1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
  52 The update of mmap_cache is racy (page stealer can race with other code
  53 that invokes find_vma with mmap_sem held), but that is okay, since it
  54 is a hint. This can be fixed, if desired, by having find_vma grab the
  55 vmlist lock.
  56
  57
  58 Code that add/delete elements from the vmlist chain are
  59 1. callers of insert_vm_struct
  60 2. callers of merge_segments
  61 3. callers of avl_remove
  62
  63 Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
  64 the list:
  65 1. expand_stack
  66 2. mprotect
  67 3. mlock
  68 4. mremap
  69
  70 It is advisable that changes to vm_start/vm_end be protected, although
  71 in some cases it is not really needed. Eg, vm_start is modified by
  72 expand_stack(), it is hard to come up with a destructive scenario without
  73 having the vmlist protection in this case.
  74
  75 The vmlist lock nests with the inode i_shared_lock and the kmem cache
  76 c_spinlock spinlocks. This is okay, since code that holds i_shared_lock
  77 never asks for memory, and the kmem code asks for pages after dropping
  78 c_spinlock. The vmlist lock also nests with pagecache_lock and
  79 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
  80 held.
  81
  82 The vmlist lock is grabbed while holding the kernel_lock spinning monitor.
  83
  84 The vmlist lock can be a sleeping or spin lock. In either case, care
  85 must be taken that it is not held on entry to the driver methods, since
  86 those methods might sleep or ask for memory, causing deadlocks.
  87
  88 The current implementation of the vmlist lock uses the page_table_lock,
  89 which is also the spinlock that page stealers use to protect changes to
  90 the victim process' ptes. Thus we have a reduction in the total number
  91 of locks.
  92
  93 swap_list_lock/swap_device_lock
  94 -------------------------------
  95 The swap devices are chained in priority order from the "swap_list" header.
  96 The "swap_list" is used for the round-robin swaphandle allocation strategy.
  97 The #free swaphandles is maintained in "nr_swap_pages". These two together
  98 are protected by the swap_list_lock.
  99
 100 The swap_device_lock, which is per swap device, protects the reference
 101 counts on the corresponding swaphandles, maintained in the "swap_map"
 102 array, and the "highest_bit" and "lowest_bit" fields.
 103
 104 Both of these are spinlocks, and are never acquired from intr level. The
 105 locking heirarchy is swap_list_lock -> swap_device_lock.
 106
 107 To prevent races between swap space deletion or async readahead swapins
 108 deciding whether a swap handle is being used, ie worthy of being read in
 109 from disk, and an unmap -> swap_free making the handle unused, the swap
 110 delete and readahead code grabs a temp reference on the swaphandle to
 111 prevent warning messages from swap_duplicate <- read_swap_cache_async.
 112
 113 Swap cache locking
 114 ------------------
 115 Pages are added into the swap cache with kernel_lock held, to make sure
 116 that multiple pages are not being added (and hence lost) by associating
 117 all of them with the same swaphandle.
 118
 119 Pages are guaranteed not to be removed from the scache if the page is
 120 "shared": ie, other processes hold reference on the page or the associated
 121 swap handle. The only code that does not follow this rule is shrink_mmap,
 122 which deletes pages from the swap cache if no process has a reference on
 123 the page (multiple processes might have references on the corresponding
 124 swap handle though). lookup_swap_cache() races with shrink_mmap, when
 125 establishing a reference on a scache page, so, it must check whether the
 126 page it located is still in the swapcache, or shrink_mmap deleted it.
 127 (This race is due to the fact that shrink_mmap looks at the page ref
 128 count with pagecache_lock, but then drops pagecache_lock before deleting
 129 the page from the scache).
 130
 131 do_wp_page and do_swap_page have MP races in them while trying to figure
 132 out whether a page is "shared", by looking at the page_count + swap_count.
 133 To preserve the sum of the counts, the page lock _must_ be acquired before
 134 calling is_page_shared (else processes might switch their swap_count refs
 135 to the page count refs, after the page count ref has been snapshotted).
 136
 137 Swap device deletion code currently breaks all the scache assumptions,
 138 since it grabs neither mmap_sem nor page_table_lock.