share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a
  18 mechanism which allows the system to use fast swap to cache filesystem
  19 data and meta-data.
  20 .Sh SYNOPSIS (defaults shown)
  21 .Cd sysctl vm.swapcache.accrate=100000
  22 .Cd sysctl vm.swapcache.maxfilesize=0
  23 .Cd sysctl vm.swapcache.maxburst=2000000000
  24 .Cd sysctl vm.swapcache.curburst=4000000000
  25 40TB (40 terabyte) write endurance, but see later
  26 notes on this, it is more a minimum value.
  27 Limiting the long term average bandwidth to 100K/sec leads to no more
  28 than ~9G/day writing which calculates approximately to a 12 year
  29 endurance.
  30 Endurance scales linearly with size.  The 80G version of this SSD
  31 will have a write endurance of approximately 80TB.
  32 .Pp
  33 MLC SSDs have a 1000-10000x write endurance, while the lower density
  34 higher-cost SLC SSDs have an approximately 10000-100000x write endurance.
  35 MLC SSDs can be used for the swapcache (and swap) as long as the system
  36 manager is cognizant of its limitations.
  37 .Pp
  38 .It Cd vm.swapcache.meta_enable
  39 Turning on just
  40 .Cd meta_enable
  41 causes only filesystem meta-data to be cached and will result
  42 in very fast directory operations even over millions of inodes
  43 and even in the face of other invasive operations being run
  44 by other processes.
  45 .Pp
  46 .It Cd vm.swapcache.data_enable
  47 Turning on
  48 .Cd data_enable
  49 (with or without other features) allows bulk file data to be
  50 cached.
  51 This feature is very useful for web server operation when the
  52 operational data set fits in swap.
  53 The usefulness is somewhat mitigated by the maximum number
  54 of vnodes supported by the system via
  55 .Cd kern.maxfiles ,
  56 because the bulk data in the cache is lost when the related
  57 vnode is recycled.  In this case it might be desireable to
  58 take the plunge into running a 64-bit kernel which can support
  59 far more vnodes.  32-bit kernels have limited kernel virtual
  60 memory (KVM) and cannot reliably support more than around
  61 100,000 active vnodes.  64-bit kernels can support 300,000+
  62 active vnodes.
  63 .Pp
  64 Data caching is definitely more wasteful of SSD write bandwidth
  65 than meta-data caching.  It doesn't hurt performance per se,
  66 but may cause the
  67 .Nm
  68 to exhaust its burst and smack against the long term average
  69 bandwidth limit, causing the SSD to wear out at the maximum rate you
  70 programmed.  Data caching is far less wasteful and more efficient
  71 if (on a 64-bit system only) you provide a sufficiently large SSD and
  72 increase
  73 .Cd kern.maxvnodes
  74 to cover the entire directory topology being served.
  75 Each vnode requires about 1K of physical ram.
  76 .Pp
  77 When data caching is turned on you generally want to use
  78 .Xr chflags 1
  79 with the
  80 .Cm cache
  81 flag to enable data caching on a directory.
  82 This flag is tracked by the namecache and does not need to be
  83 recursively set in the directory tree.
  84 Simply setting the flag in a top level directory is sufficient.
  85 A typical setup is something like this:
  86 .Pp
  87 .Dl chflags cache /etc /sbin /bin /usr /home
  88 .Dl chflags noscache /usr/obj
  89 .Pp
  90 Alternatively if you have NFS mounts where chflags does not work you
  91 can enable caching in some parent directory, then selectively disable
  92 it.
  93 .Pp
  94 .Dl chflags cache /
  95 .Dl chflags noscache /usr/obj /tmp /var/tmp
  96 .Pp
  97 If that doesn't work you can turn off
  98 .Cd vm.swapcache.use_chflags
  99 entirely and not bother with any chflagging.
 100 .Pp
 101 .It Cd vm.swapcache.maxfilesize
 102 This may be used to reduce cache thrashing when a focus on a small
 103 potentially fragmented filespace is desired, leaving the
 104 larger files alone.
 105 .Pp
 106 .It Cd vm.swapcache.minburst
 107 This controls hysteresis and prevents nickel-and-dime write bursting.
 108 Once
 109 .Cd curburst
 110 drops to zero, writing to the swapcache ceases until it has recovered
 111 past
 112 .Cd minburst .
 113 The idea here is to avoid creating a heavily fragmented swapcache where
 114 reading data from a file must alternate between the cache and the primary
 115 filesystem.  Doing so does not save disk seeks on the primary filesystem
 116 so we want to avoid doing small bursts.  This parameter allows us to do
 117 larger bursts.
 118 The larger bursts also tend to improve SSD performance as the SSD itself
 119 can do a better job write-combining and erasing blocks.
 120 .Pp
 121 .It Cd vm_swapcache.maxswappct
 122 This controls the maximum amount of swapspace
 123 .Nm
 124 may use, in percentage terms.
 125 .El
 126 .Pp
 127 It is important to note that you should always use
 128 .Xr disklabel64 8
 129 to label your SSD.  Disklabel64 will properly align the base of the
 130 partition space relative to the physical drive regardless of how badly
 131 aligned the fdisk slice is.
 132 This will significantly reduce write amplification and write combining
 133 inefficiencies on the SSD.
 134 .Pp
 135 Finally, interleaved swap (multiple SSDs) may be used to increase
 136 performance even further.  A single SATA SSD is typically capable of
 137 reading 120-220MB/sec.  Configuring two SSDs for your swap will
 138 improve aggregate swapcache read performance by 1.5x to 1.8x.
 139 In tests with two Intel 40G SSDs 300MB/sec was easily achieved.
 140 .Pp
 141 At this point you will be configuring more swap space than a 32 bit
 142 .Dx
 143 kernel can handle (due to KVM limitations).  By default, 32 bit
 144 .Dx
 145 systems only support 32G of configured swap and while this limit
 146 can be increased somewhat in
 147 .Pa /boot/loader.conf
 148 you should really be using a 64-bit
 149 .Dx
 150 kernel instead.  64-bit systems support up to 512G of swap by default
 151 and can be boosted to up to 8TB if you are really crazy and have enough ram.
 152 Each 1GB of swap requires around 1MB of physical memory to manage it so
 153 the practical limit is more around 1TB of swap.
 154 .Pp
 155 Of course, a 1TB SSD is something on the order of $3000+ as of this writing.
 156 Even though a 1TB configuration might not be cost effective, storage levels
 157 more in the 100-200G range certainly are.  If the machine has only a 1GigE
 158 ethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
 159 A single SSD of the desired size would be sufficient.
 160 .Sh INITIAL BURSTING & REPEATED BURSTING
 161 Even though the average write bandwidth is limited it is desireable
 162 to have a large initial burst after boot to load the cache.
 163 .Cd curburst
 164 is initialized to 4GB by default and you can force rebursting
 165 by adjusting it with a sysctl.
 166 Remember that
 167 .Cd curburst
 168 dynamically tracks burst and will go up and down depending.
 169 .Pp
 170 In addition there will be periods of time where the system is in
 171 steady state and not writing to the swapcache.  During these periods
 172 .Cd curburst
 173 will inch back up but will not exceed
 174 .Cd maxburst .
 175 Thus the
 176 .Cd maxburst
 177 value controls how large a repeated burst can be.
 178 .Pp
 179 A second bursting parameter called
 180 .Cd vm.swapcache.minburst
 181 controls bursting when the maximum write bandwidth has been reached.
 182 When
 183 .Cd minburst
 184 reaches zero write activity ceases and
 185 .Cd curburst
 186 is allowed to recover up to
 187 .Cd minburst
 188 before write activity resumes.  The recommended range for the
 189 .Cd minburst
 190 parameter is 1MB to 50MB.  This parameter has a relationship to
 191 how fragmented the swapcache gets when not in a steady state.
 192 Large bursts reduce fragmentation and reduce incidences of
 193 excessive seeking on the hard drive.  If set too low the
 194 swapcache will become fragmented within a single regular file
 195 and the constant back-and-forth between the swapcache and the
 196 hard drive will result in excessive seeking on the hard drive.
 197 .Sh SWAPCACHE SIZE & MANAGEMENT
 198 The swapcache feature will use up to 75% of configured swap space
 199 by default.
 200 The remaining 25% is reserved for normal paging operation.
 201 The system operator should configure at least 4 times the SWAP space
 202 versus main memory and no less than 8G of swap space.
 203 If a 40G SSD is used the recommendation is to configure 16G to 32G of
 204 swap (note: 32-bit is limited to 32G of swap by default, for 64-bit
 205 it is 512G of swap), and to leave the remainder unwritten and unused.
 206 .Pp
 207 The
 208 .Cd vm_swapcache.maxswappct
 209 sysctl may be used to change the default.
 210 You may have to change this default if you also use
 211 .Xr tmpfs 5 ,
 212 .Xr vn 4 ,
 213 or if you have not allocated enough swap for reasonable normal paging
 214 activity to occur (in which case you probably shouldn't be using
 215 .Nm
 216 anyway).
 217 .Pp
 218 If swapcache reaches the 75% limit it will begin tearing down swap
 219 in linear bursts by iterating through available VM objects, until
 220 swap space use drops to 70%.  The tear-down is limited by the rate at
 221 which new data is written and this rate in turn is often limited
 222 by
 223 .Cd vm.swapcache.accrate ,
 224 resulting in an orderly replacement of cached data and meta-data.
 225 The limit is typically only reached when doing full data+meta-data
 226 caching with no file size limitations and serving primarily large
 227 files, or (on a 64-bit system) bumping kern.maxvnodes up to very
 228 high values.
 229 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 230 This is not a function of
 231 .Nm
 232 per se but instead a normal function of the system.  Most systems have
 233 sufficient memory that they do not need to page memory to swap.  These
 234 types of systems are the ones best suited for MLC SSD configured swap
 235 running with a
 236 .Nm
 237 configuration.
 238 Systems which modestly page to swap, in the range of a few hundred
 239 megabytes a day worth of writing, are also well suited for MLC SSD
 240 configured swap.  Desktops usually fall into this category even if they
 241 page out a bit more because swap activity is governed by the actions of
 242 a single person.
 243 .Pp
 244 Systems which page anonymous memory heavily when
 245 .Nm
 246 would otherwise be turned off are not usually well suited for MLC SSD
 247 configured swap.  Heavy paging activity is not governed by
 248 .Nm
 249 bandwidth control parameters and can lead to excessive uncontrolled
 250 writing to the MLC SSD, causing premature wearout.  You would have to
 251 use the lower density, more expensive SLC SSD technology (which has 10x
 252 the durability).  This isn't to say that
 253 .Nm
 254 would be ineffective, just that the aggregate write bandwidth required
 255 to support the system would be too large for MLC flash technologies.
 256 .Pp
 257 With this caveat in mind, SSD based paging on systems with insufficient
 258 ram can be extremely effective in extending the useful life of the system.
 259 For example, a system with a measly 192MB of ram and SSD swap can run
 260 a -j 8 parallel build world in a little less than twice the time it
 261 would take if the system had 2G of ram, whereas it would take 5x to 10x
 262 as long with normal HD based swap.
 263 .Sh WARNINGS
 264 I am going to repeat and expand a bit on SSD wear.
 265 Wear on SSDs is a function of the write durability of the cells,
 266 whether the SSD implements static or dynamic wear leveling, and
 267 write amplification effects based on the type of write activity.
 268 Write amplification occurs due to wasted space when the SSD must
 269 erase and rewrite the underlying flash blocks.  e.g. MLC flash uses
 270 128KB erase/write blocks.
 271 .Pp
 272 .Nm
 273 parameters should be carefully chosen to avoid early wearout.
 274 For example, the Intel X25V 40G SSD has a minimum write durability
 275 of 40TB and an actual durability that can be quite a bit higher.
 276 Generally speaking, you want to select parameters that will give you
 277 at least 10 years of service life.
 278 The most important parameter to control this is
 279 .Cd vm.swapcache.accrate .
 280 .Nm
 281 uses a very conservative 100KB/sec default but even a small X25V
 282 can probably handle 300KB/sec of continuous writing and still last
 283 10 years.
 284 .Pp
 285 Depending on the wear leveling algorithm the drive uses, durability
 286 and performance can sometimes be improved by configuring less
 287 space (in a manufacturer-fresh drive) than the drive's probed capacity.
 288 For example, by only using 32G of a 40G SSD.
 289 SSDs typically implement 10% more storage than advertised and
 290 use this storage to improve wear leveling.  As cells begin to fail
 291 this overallotment slowly becomes part of the primary storage
 292 until it has been exhausted.  After that the SSD has basically failed.
 293 Keep in mind that if you use a larger portion of the SSD's advertised
 294 storage the SSD will not know if/when you decide to use less unless
 295 appropriate TRIM commands are sent (if supported), or a low level
 296 factory erase is issued.
 297 .Pp
 298 The swapcache is designed for use with SSDs configured as swap and
 299 will generally not improve performance when a normal hard drive is used
 300 for swap.
 301 .Pp
 302 .Nm smartctl
 303 (from pkgsrc's sysutils/smartmontools) may be used to retrieve
 304 the wear indicator from the drive.
 305 One usually runs something like 'smartctl -d sat -a /dev/daXX'
 306 (for AHCI/SILI/SCSI), or 'smartctl -a /dev/adXX' for NATA.  Some SSDs
 307 (particularly the Intels) will brick the SATA port when smart operations
 308 are done while the drive is busy with normal activity, so the tool should
 309 only be run when the SSD is idle.
 310 .Pp
 311 ID 232 (0xe8) in the SMART data dump indicates available reserved
 312 space and ID 233 (0xe9) is the wear-out meter.  Reserved space
 313 typically starts at 100 and decrements to 10, after which the SSD
 314 is considered to operate in a degraded mode.  The wear-out meter
 315 typically starts at 99 and decrements to 0, after which the SSD
 316 has failed.
 317 .Pp
 318 .Nm
 319 tends to use large 64K writes and tends to cluster multiple writes
 320 linearly.  The SSD is able to take significant advantage of this
 321 and write amplification effects are greatly reduced.  If we
 322 take a 40G Intel X25V as an example the vendor specifies a write
 323 durability of approximately 40TB, but
 324 .Nm
 325 should be able to squeeze out upwards of 200TB due the fairly optimal
 326 write clustering it does.
 327 The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
 328 per MLC cell, 40G drive), but the firmware doesn't do perfect static
 329 wear leveling so the actual durability is less.
 330 .Pp
 331 In contrast, most filesystems directly stored on a SSD have
 332 fairly severe write amplification effects and will have durabilities
 333 ranging closer to the vendor-specified limit.
 334 Power-on hours, power cycles, and read operations do not really affect
 335 wear.
 336 .Pp
 337 SSD's with MLC-based flash technology are high-density, low-cost solutions
 338 with limited write durability.  SLC-based flash technology is a low-density,
 339 higher-cost solution with 10x the write durability as MLC.  The durability
 340 also scales with the amount of flash storage.  SLC based flash is typically
 341 twice as expensive per gigabyte.  From a cost perspective, SLC based flash
 342 is at least 5x more cost effective in situations where high write
 343 bandwidths are required (because it lasts 10x longer).  MLC is at least
 344 2x more cost effective in situations where high write bandwidth is not
 345 required.
 346 When wear calculations are in years, these differences become huge, but
 347 often the quantity of storage needed trumps the wear life so we expect most
 348 people will be using MLC.
 349 .Nm
 350 is usable with both technologies.
 351 .Sh SEE ALSO
 352 .Xr swapon 8 ,
 353 .Xr disklabel64 8 ,
 354 .Xr fstab 5
 355 .Sh HISTORY
 356 .Nm
 357 first appeared in
 358 .Dx 2.5 .
 359 .Sh AUTHORS
 360 .An Matthew Dillon