share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a
  18 mechanism which allows the system to use fast swap to cache filesystem
  19 data and meta-data.
  20 .Sh SYNOPSIS (defaults shown)
  21 .Cd sysctl vm.swapcache.accrate=100000
  22 .Cd sysctl vm.swapcache.maxfilesize=0
  23 .Cd sysctl vm.swapcache.maxburst=2000000000
  24 .Cd sysctl vm.swapcache.curburst=4000000000
  25 .Cd sysctl vm.swapcache.minburst=10000000
  26 .Cd sysctl vm.swapcache.read_enable=0
  27 .Cd sysctl vm.swapcache.meta_enable=0
  28 .Cd sysctl vm.swapcache.data_enable=0
  29 .Cd sysctl vm.swapcache.maxlaunder=256
  30 .Sh DESCRIPTION
  31 .Nm
  32 is a system capability which allows a solid state disk (SSD) in a swap
  33 space configuration to be used to cache clean filesystem data and meta-data
  34 in addition to its normal function backing anonymous memory.
  35 .Pp
  36 Sysctls are used to manage operational parameters and can be adjusted at
  37 any time.  Typically a large initial burst is desired after system boot,
  38 controlled by the initial
  39 .Cd vm.swapcache.curburst
  40 parameter.
  41 This parameter is reduced as data is written to swap by the swapcache
  42 and increased at a rate specified by
  43 .Cd vm.swapcache.accrate .
  44 Once this parameter reaches zero write activity ceases until it has
  45 recovered sufficiently for write activity to resume.
  46 .Pp
  47 .Cd vm.swapcache.meta_enable
  48 enables the writing of filesystem meta-data to the swapcache.  Filesystem
  49 metadata is any data which the filesystem accesses via the disk device
  50 using buffercache.
  51 .Pp
  52 .Cd vm.swapcache.data_enable
  53 enables the writing of filesystem file-data to the swapcache.  Filesystem
  54 filedata is any data which the filesystem accesses via a regular file.
  55 In technical terms, when the buffer cache is used to access a regular
  56 file through its vnode.  Please do not blindly turn on this option,
  57 see the PERFORMANCE TUNING section for more information.
  58 .Pp
  59 .Cd vm.swapcache.read_enable
  60 enables reading from the swapcache and should be set to 1 for normal
  61 operation.
  62 .Pp
  63 .Cd vm.swapcache.maxfilesize
  64 controls which files are to be cached based on their size.
  65 If set to non-zero only files smaller than the specified size
  66 will be cached.  Larger files will not be cached.
  67 .Sh PERFORMANCE TUNING
  68 Best operation is achieved when the active data set fits within the
  69 swapcache.
  70 .Pp
  71 .Bl -tag -width 4n -compact
  72 .It Cd vm.swapcache.accrate
  73 This specifies the burst accumulation rate in bytes per second and
  74 ultimately controls the write bandwidth to swap averaged over a long
  75 period of time.
  76 This parameter must be carefully chosen to manage the write endurance of
  77 the SSD in order to avoid wearing it out too quickly.
  78 Even though SSDs have limited write endurance there is massive
  79 cost/performance benefit to using one in a swapcache configuration.
  80 .Pp
  81 Lets use the Intel X25V 40G MLC SATA SSD as an example.  This device
  82 has approximately a 40TB (40 terrabyte) write endurance.
  83 Limiting the long term average bandwidth to 100K/sec leads to no more
  84 than ~9G/day writing which calculates approximately to a 12 year
  85 endurance.
  86 Endurance scales linearly with size.  The 80G version of this SSD
  87 will have a write endurance of approximately 80TB.
  88 .Pp
  89 MLC SSDs have approximately a 1000x write endurance, while the
  90 lower density higher-cost SLC SSDs have an approximately 100000x
  91 write endurance.  MLC SSDs can be used for the swapcache (and swap)
  92 as long as the system manager is cognizant of its limitations.
  93 .Pp
  94 .It Cd vm.swapcache.meta_enable
  95 Turning on just
  96 .Cd meta_enable
  97 causes only filesystem meta-data to be cached and will result
  98 in very fast directory operations even over millions of inodes.
  99 .Pp
 100 .It Cd vm.swapcache.data_enable
 101 Turning on
 102 .Cd data_enable
 103 (with or without other features) allows bulk file data to be
 104 cached.
 105 This feature is very useful for web server operation when the
 106 operational data set fits in swap.
 107 The usefulness is somewhat mitigated by the maximum number
 108 of vnodes supported by the system via
 109 .Cd kern.maxfiles ,
 110 because the bulk data in the cache is lost when the related
 111 vnode is recycled.  In this case it might be desireable to
 112 take the plunge into running a 64-bit kernel which can support
 113 far more vnodes.  32-bit kernels have limited kernel virtual
 114 memory (KVM) and cannot reliably support more than around
 115 100,000 active vnodes.  64-bit kernels can support 300,000+
 116 active vnodes.
 117 .Pp
 118 Data caching is definitely more wasteful of SSD write bandwidth
 119 than meta-data caching.  It doesn't hurt performance per-say,
 120 but may cause the
 121 .Nm
 122 to exhaust its burst and smack against the long term average
 123 bandwidth limit, causing the SSD to wear out at the maximum rate you
 124 programmed.  Data caching is far less wasteful and more efficient
 125 if (on a 64-bit system only) you provide a sufficiently large SSD and
 126 increase
 127 .Cd kern.maxvnodes
 128 to cover the entire directory topology being served.
 129 Each vnode requires about 1K of physical ram.
 130 .Pp
 131 .It Cd vm.swapcache.maxfilesize
 132 This may be used to reduce cache thrashing when a focus on a small
 133 potentially fragmented filespace is desired, leaving the
 134 larger files alone.
 135 .Pp
 136 .It Cd vm.swapcache.minburst
 137 This controls hysteresis and prevents nickle-and-dime write bursting.
 138 Once
 139 .Cd curburst
 140 drops to zero writing to the swapcache ceases until it has recovered
 141 past
 142 .Cd minburst .
 143 The idea here is to avoid creating a heavily fragmented swapcache where
 144 reading data from a file must alternate between the cache and the primary
 145 filesystem.  Doing so does not save disk seeks on the primary filesystem
 146 so we want to avoid doing small bursts.  This parameter allows us to do
 147 larger bursts.
 148 The larger bursts also tend to improve SSD performance as the SSD itself
 149 can do a better job write-combining and erasing blocks.
 150 .Pp
 151 .El
 152 .Pp
 153 Finally, interleaved swap (multiple SSDs) may be used to increase
 154 performance even further.  A single SATA SSD is typically capable of
 155 reading 120-220MB/sec.  Configuring two SSDs for your swap will
 156 improve aggregate swapcache read performance by 1.5x to 1.8x.
 157 In tests with two Intel 40G SSDs 300MB/sec was easily achieved.
 158 .Pp
 159 At this point you will be configuring more swap space than a 32 bit
 160 .Dx
 161 kernel can handle (due to KVM limitations).  By default, 32 bit
 162 .Dx
 163 systems only support 32G of configured swap and while this limit
 164 can be increased somewhat in
 165 .Pa /boot/loader.conf
 166 you should really be using a 64-bit
 167 .Dx
 168 kernel instead.  64-bit systems support up to 512G of swap by default
 169 and can be boosted to up to 8TB if you are really crazy and have enough ram.
 170 Each 1GB of swap requires around 1MB of physical memory to manage it so
 171 the practical 'reasonable' limit is more around 1TB of swap.
 172 .Pp
 173 Of course, a 1TB SSD is something on the order of $3000+ as of this writing.
 174 Even though these quantities might not be cost effective, storage levels
 175 more in the 100-200G range certainly are.  If the machine has only a 1GigE
 176 ethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
 177 A single SSD of the desired size would be sufficient.
 178 .Sh INITIAL BURSTING & REPEATED BURSTING
 179 Even though the average write bandwidth is limited it is desireable
 180 to have a large initial burst after boot to load the cache.
 181 .Cd curburst
 182 is initialized to 4GB by default and you can force rebursting
 183 by adjusting it with a sysctl.
 184 Remember that
 185 .Cd curburst
 186 dynamically tracks burst and will go up and down depending.
 187 .Pp
 188 In addition there will be periods of time where the system is in
 189 steady state and not writing to the swapcache.  During these periods
 190 .Cd curburst
 191 will inch back up but will not exceed
 192 .Cd maxburst .
 193 Thus the
 194 .Cd maxburst
 195 value controls how large a repeated burst can be.
 196 .Pp
 197 A second bursting parameter called
 198 .Cd vm.swapcache.minburst
 199 controls bursting when the maximum write bandwidth has been reached.
 200 When
 201 .Cd minburst
 202 reaches zero write activity ceases and
 203 .Cd curburst
 204 is allowed to recover up to
 205 .Cd minburst
 206 before write activity resumes.  The recommended range for the
 207 .Cd minburst
 208 parameter is 1MB to 50MB.  This parameter has a relationship to
 209 how fragmented the swapcache gets when not in a steady state.
 210 Large bursts reduce fragmentation and reduce incidences of
 211 excessive seeking on the hard drive.  If set too low the
 212 swapcache will become fragmented within a single regular file
 213 and the constant back-and-forth between the swapcache and the
 214 hard drive will result in excessive seeking on the hard drive.
 215 .Sh SWAPCACHE SIZE & MANAGEMENT
 216 The swapcache feature will use up to 75% of configured swap space.
 217 The remaining 25% is reserved for normal paging operation.
 218 The system operator should configure at least 4 x SWAP verses
 219 main memory and no less than 8G of swap space.
 220 If a 40G SSD is used the recommendation is to configure 16G to 32G of
 221 swap (note: 32-bit is limited to 32G of swap by default, for 64-bit
 222 it is 512G of swap).
 223 .Pp
 224 If swapcache reaches the 75% limit it will begin tearing down swap
 225 in linear bursts by iterating through available VM objects, until
 226 swap space use drops to 70%.  The tear-down is limited by the rate at
 227 which new data is written and this rate in turn is often limited
 228 by
 229 .Cd vm.swapcache.accrate ,
 230 resulting in an orderly replacement of cached data and meta-data.
 231 The limit is typically only reached when doing full data+meta-data
 232 caching with no file size limitations and serving primarily large
 233 files, or (on a 64-bit system) bumping kern.maxvnodes up to very
 234 high values.
 235 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 236 This is not a function of
 237 .Nm
 238 per-say but instead a normal function of the system.  Most systems have
 239 sufficient memory that they do not need to page memory to swap.  These
 240 types of systems are the ones best suited for MLC SSD configured swap
 241 running with a
 242 .Nm
 243 configuration.
 244 Systems which modestly page to swap, in the range of a few hundred
 245 megabytes a day worth of writing, are also well suited for MLC SSD
 246 configured swap.  Desktops usually fall into this category even if they
 247 page out a bit more because swap activity is governed by the actions of
 248 a single person.
 249 .Pp
 250 Systems which page anonymous memory heavily when
 251 .Nm
 252 would otherwise be turned off are not usually well suited for MLC SSD
 253 configured swap.  Heavy paging activity is not governed by
 254 .Nm
 255 bandwidth control parameters and can lead to excessive uncontrolled
 256 writing to the MLC SSD, causing premature wearout.  You would have to
 257 use the lower density, more expensive SLC SSD technology (which has 100x
 258 the durability per GB).  This isn't to say that
 259 .Nm
 260 would be ineffective, just that the aggregate write bandwidth required
 261 to support the system would be too large for MLC flash technologies.
 262 .Pp
 263 With this caveats in mind SSD based paging on systems with insufficient
 264 ram can be extremely effective in extending the useful life of the system.
 265 For example, a system with a measily 192MB of ram can run a -j 8 parallel
 266 build world in a little less than twice the time it would take if the system
 267 had 2G of ram when SSD swap is configured, whereas it would take 5x to 10x
 268 as long with normal HD based swap.
 269 .Sh WARNING
 270 SSDs have limited durability and
 271 .Nm
 272 parameters should be carefully chosen to avoid early wearout.
 273 For example, the Intel X25V 40G SSD has a nominal 40TB (terrabyte)
 274 write durability.
 275 Generally speaking you want to select parameters that will give you
 276 at least 5 years of service life.  10 years is a good compromise.
 277 .Pp
 278 Durability typically scales with size and also depends on the
 279 wear-leveling algorithm used by the device.  Durability can often
 280 be improved by configuring less space (in a manufacturer-fresh drive)
 281 than the drive's capacity.  For example, by only using 32G of a 40G
 282 SSD.
 283 .Pp
 284 The swapcache is designed for use with SSDs configured as swap and
 285 will generally not improve performance when a normal hard drive is used
 286 for swap.
 287 .Pp
 288 .Nm smartctl
 289 (from pkgsrc's smartmontools) may be used to retrieve the wear indicator
 290 from the drive.
 291 One usually runs something like 'smartctl -d sat -a /dev/daXX'
 292 (for AHCI/SILI/SCSI), or 'smartctl -a /dev/adXX' for NATA.  Many SSDs
 293 will brick the SATA port when smart operations are done while the drive
 294 is busy with normal activity, so the tool should only be run when the
 295 SSD is idle.
 296 .Pp
 297 The wear-out meter is entry 233 (0xe9) in the list.
 298 It usually starts at 99 and decrements over time until it reaches 0, at
 299 which point writes to the SSD drive will begin failing.
 300 Wear on SSDs is a function only of the write rate... the write durability.
 301 Power-on hours, power cycles, and read operations do not effect wear.
 302 .Pp
 303 SSD's with MLC-based flash technology are high-density, low-cost solutions
 304 with limited write durability.  SLC-based flash technology is a low-density,
 305 higher-cost solution with 100x the write durability as MLC.  The durability
 306 also scales with the amount of flash storage, with SLC based flash typically
 307 twice as expensive per gigabyte.  From a cost perspective SLC based flash
 308 is 50x more cost effective in situations where high write bandwidths are
 309 required.  MLC is at least 2x more cost effective in situations where
 310 high write bandwidths are not required.
 311 .Nm
 312 is usable with both technologies.
 313 .Sh SEE ALSO
 314 .Xr swapon 8 ,
 315 .Xr fstab 5
 316 .Sh HISTORY
 317 .Nm
 318 first appeared in
 319 .Dx 2.5 .
 320 .Sh AUTHORS
 321 .An Matthew Dillon