share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a mechanism to use fast swap to cache filesystem data and meta-data
  18 .Sh SYNOPSIS
  19 .Cd sysctl vm.swapcache.accrate=100000
  20 .Cd sysctl vm.swapcache.maxfilesize=0
  21 .Cd sysctl vm.swapcache.maxburst=2000000000
  22 .Cd sysctl vm.swapcache.curburst=4000000000
  23 .Cd sysctl vm.swapcache.minburst=10000000
  24 .Cd sysctl vm.swapcache.read_enable=0
  25 .Cd sysctl vm.swapcache.meta_enable=0
  26 .Cd sysctl vm.swapcache.data_enable=0
  27 .Cd sysctl vm.swapcache.use_chflags=1
  28 .Cd sysctl vm.swapcache.maxlaunder=256
  29 .Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
  30 .Sh DESCRIPTION
  31 .Nm
  32 is a system capability which allows a solid state disk (SSD) in a swap
  33 space configuration to be used to cache clean filesystem data and meta-data
  34 in addition to its normal function of backing anonymous memory.
  35 .Pp
  36 Sysctls are used to manage operational parameters and can be adjusted at
  37 any time.
  38 Typically a large initial burst is desired after system boot,
  39 controlled by the initial
  40 .Va vm.swapcache.curburst
  41 parameter.
  42 This parameter is reduced as data is written to swap by the swapcache
  43 and increased at a rate specified by
  44 .Va vm.swapcache.accrate .
  45 Once this parameter reaches zero write activity ceases until it has
  46 recovered sufficiently for write activity to resume.
  47 .Pp
  48 .Va vm.swapcache.meta_enable
  49 enables the writing of filesystem meta-data to the swapcache.
  50 Filesystem
  51 metadata is any data which the filesystem accesses via the disk device
  52 using buffercache.
  53 Meta-data is cached globally regardless of file or directory flags.
  54 .Pp
  55 .Va vm.swapcache.data_enable
  56 enables the writing of clean filesystem file-data to the swapcache.
  57 Filesystem filedata is any data which the filesystem accesses via a
  58 regular file.
  59 In technical terms, when the buffer cache is used to access
  60 a regular file through its vnode.
  61 Please do not blindly turn on this option, see the
  62 .Sx PERFORMANCE TUNING
  63 section for more information.
  64 .Pp
  65 .Va vm.swapcache.use_chflags
  66 enables the use of the
  67 .Va cache
  68 and
  69 .Va noscache
  70 .Xr chflags 1
  71 flags to control which files will be data-cached.
  72 If this sysctl is disabled and
  73 .Va data_enable
  74 is enabled, the system will ignore file flags and attempt to
  75 swapcache all regular files.
  76 .Pp
  77 .Va vm.swapcache.read_enable
  78 enables reading from the swapcache and should be set to 1 for normal
  79 operation.
  80 .Pp
  81 .Va vm.swapcache.maxfilesize
  82 controls which files are to be cached based on their size.
  83 If set to non-zero only files smaller than the specified size
  84 will be cached.
  85 Larger files will not be cached.
  86 .Pp
  87 .Va vm.swapcache.maxlaunder
  88 controls the maximum number of clean VM pages which will be added to
  89 the swap cache and written out to swap on each poll.
  90 Swapcache polls ten times a second.
  91 .Pp
  92 .Va vm.swapcache.hysteresis
  93 controls how many pages swapcache waits to be added to the inactive page
  94 queue before continuing its scan.
  95 Once it decides to scan it continues subject to the above limitations
  96 until it reaches the end of the inactive page queue.
  97 This parameter is designed to make swapcache generate more bulky bursts
  98 to swap which helps SSDs reduce write amplification effects.
  99 .Sh PERFORMANCE TUNING
 100 Best operation is achieved when the active data set fits within the
 101 swapcache.
 102 .Pp
 103 .Bl -tag -width 4n -compact
 104 .It Va vm.swapcache.accrate
 105 This specifies the burst accumulation rate in bytes per second and
 106 ultimately controls the write bandwidth to swap averaged over a long
 107 period of time.
 108 This parameter must be carefully chosen to manage the write endurance of
 109 the SSD in order to avoid wearing it out too quickly.
 110 Even though SSDs have limited write endurance, there is massive
 111 cost/performance benefit to using one in a swapcache configuration.
 112 .Pp
 113 Let's use the old Intel X25V 40GB MLC SATA SSD as an example.
 114 This device has approximately a
 115 40TB (40 terabyte) write endurance, but see later
 116 notes on this, it is more a minimum value.
 117 Limiting the long term average bandwidth to 100KB/sec leads to no more
 118 than ~9GB/day writing which calculates approximately to a 12 year endurance.
 119 Endurance scales linearly with size.
 120 The 80GB version of this SSD
 121 will have a write endurance of approximately 80TB.
 122 .Pp
 123 MLC SSDs have a 1000-10000x write endurance, while the lower density
 124 higher-cost SLC SSDs have a 10000-100000x write endurance, approximately.
 125 MLC SSDs can be used for the swapcache (and swap) as long as the system
 126 manager is cognizant of its limitations.
 127 However, over the years tests have shown the SLC SSDs do not really live
 128 up to their hype and are no more reliable than MLC SSDs.  Instead of
 129 worrying about SLC vs MLC, just use MLC (or TLC or whateve), leave
 130 more space unpartitioned which the SSD can utilize to improve durability,
 131 and be cognizant of the SSDs rate of wear.
 132 .Pp
 133 .It Va vm.swapcache.meta_enable
 134 Turning on just
 135 .Va meta_enable
 136 causes only filesystem meta-data to be cached and will result
 137 in very fast directory operations even over millions of inodes
 138 and even in the face of other invasive operations being run
 139 by other processes.
 140 .Pp
 141 For
 142 .Nm HAMMER
 143 filesystems meta-data includes the B-Tree, directory entries,
 144 and data related to tiny files.
 145 Approximately 6 GB of swapcache is needed
 146 for every 14 million or so inodes cached, effectively giving one the
 147 ability to cache all the meta-data in a multi-terabyte filesystem using
 148 a fairly small SSD.
 149 .Pp
 150 .It Va vm.swapcache.data_enable
 151 Turning on
 152 .Va data_enable
 153 (with or without other features) allows bulk file data to be cached.
 154 This feature is very useful for web server operation when the
 155 operational data set fits in swap.
 156 However, care must be taken to avoid thrashing the swapcache.
 157 In almost all cases you will want to leave chflags mode enabled
 158 and use 'chflags cache' on governing directories to control which
 159 directory subtrees file data should be cached for.
 160 .Pp
 161 DragonFly uses generously large kern.maxvnodes values,
 162 typically in excess of 400K vnodes, but large numbers
 163 of small files can still cause problems for swapcache.
 164 When operating on a filesystem containing a large number of
 165 small files, vnode recycling by the kernel will cause related
 166 swapcache data to be lost and also cause the swapcache to
 167 potentially thrash.
 168 Cache thrashing due to vnode recyclement can occur whether chflags
 169 mode is used or not.
 170 .Pp
 171 To solve the thrashing problem you can turn on HAMMER's
 172 double buffering feature via
 173 .Va vfs.hammer.double_buffer .
 174 This causes HAMMER to cache file data via its block device.
 175 HAMMER cannot avoid also caching file data via individual vnodes
 176 but will try to expire the second copy more quickly (hence
 177 why it is called double buffer mode), but the key point here is
 178 that
 179 .Nm
 180 will only cache the data blocks via the block device when
 181 double_buffer mode is used and since the block device is associated
 182 with the mount, vnode recycling will not mess with it.
 183 This allows the data for any number (potentially millions) of files to
 184 be swapcached.
 185 You still should use chflags mode to control the size of the dataset
 186 being cached to remain under 75% of configured swap space.
 187 .Pp
 188 Data caching is definitely more wasteful of the SSD's write durability
 189 than meta-data caching.
 190 If not carefully managed the swapcache may exhaust its burst and smack
 191 against the long term average bandwidth limit, causing the SSD to wear
 192 out at the maximum rate you programmed.
 193 Data caching is far less wasteful and more efficient
 194 if you provide a sufficiently large SSD.
 195 .Pp
 196 When caching large data sets you may want to use a medium-sized SSD
 197 with good write performance instead of a small SSD to accommodate
 198 the higher burst write rate data caching incurs and to reduce
 199 interference between reading and writing.
 200 Write durability also tends to scale with larger SSDs, but keep in mind
 201 that newer flash technologies use smaller feature sizes on-chip
 202 which reduce the write durability of the chips, so pay careful attention
 203 to the type of flash employed by the SSD when making durability
 204 assumptions.
 205 For example, an Intel X25-V only has 40MB/s in write performance
 206 and burst writing by swapcache will seriously interfere with
 207 concurrent read operation on the SSD.
 208 The 80GB X25-M on the otherhand has double the write performance.
 209 Higher-capacity and larger form-factor SSDs tend to have better
 210 write-performance.
 211 But the Intel 310 series SSDs use flash chips with a smaller feature
 212 size so an 80G 310 series SSD will wind up with a durability relative
 213 close to the older 40G X25-V.
 214 .Pp
 215 When data caching is turned on you can fine-tune what gets swapcached
 216 by also turning on swapcache's chflags mode and using
 217 .Xr chflags 1
 218 with the
 219 .Va cache
 220 flag to enable data caching on a directory-tree (recursive) basis.
 221 This flag is tracked by the namecache and does not need to be
 222 recursively set in the directory tree.
 223 Simply setting the flag in a top level directory or mount point
 224 is usually sufficient.
 225 However, the flag does not track across mount points.
 226 A typical setup is something like this:
 227 .Pp
 228 .Dl chflags cache /etc /sbin /bin /usr /home
 229 .Dl chflags noscache /usr/obj
 230 .Pp
 231 It is possible to tell
 232 .Nm
 233 to ignore the cache flag by leaving
 234 .Va vm.swapcache.use_chflags
 235 set to zero.
 236 In many situations it is convenient to simply not use chflags mode, but
 237 if you have numerous mixed SSDs and HDDs you may want to use this flag
 238 to enable swapcache on the HDDs and disable it on the SSDs even if
 239 you do not care about fine-grained control.
 240 .Nm chflag Ns 'ing .
 241 .Pp
 242 Filesystems such as NFS which do not support flags generally
 243 have a
 244 .Va cache
 245 mount option which enables swapcache operation on the mount.
 246 .Pp
 247 .It Va vm.swapcache.maxfilesize
 248 This may be used to reduce cache thrashing when a focus on a small
 249 potentially fragmented filespace is desired, leaving the
 250 larger (more linearly accessed) files alone.
 251 .Pp
 252 .It Va vm.swapcache.minburst
 253 This controls hysteresis and prevents nickel-and-dime write bursting.
 254 Once
 255 .Va curburst
 256 drops to zero, writing to the swapcache ceases until it has recovered past
 257 .Va minburst .
 258 The idea here is to avoid creating a heavily fragmented swapcache where
 259 reading data from a file must alternate between the cache and the primary
 260 filesystem.
 261 Doing so does not save disk seeks on the primary filesystem
 262 so we want to avoid doing small bursts.
 263 This parameter allows us to do larger bursts.
 264 The larger bursts also tend to improve SSD performance as the SSD itself
 265 can do a better job write-combining and erasing blocks.
 266 .Pp
 267 .It Va vm_swapcache.maxswappct
 268 This controls the maximum amount of swapspace
 269 .Nm
 270 may use, in percentage terms.
 271 The default is 75%, leaving the remaining 25% of swap available for normal
 272 paging operations.
 273 .El
 274 .Pp
 275 It is important to ensure that your swap partition is nicely aligned.
 276 The standard DragonFly
 277 .Xr disklabel 8
 278 program guarantees high alignment (~1MB) automatically.
 279 Swap-on HDDs benefit because HDDs tend to use a larger physical sector size
 280 than 512 bytes, and proper alignment for SSDs will reduce write amplification
 281 and write-combining inefficiencies.
 282 .Pp
 283 Finally, interleaved swap (multiple SSDs) may be used to increase
 284 swap and swapcache performance even further.
 285 A single SATA-II SSD is typically capable of reading 120-220MB/sec.
 286 Configuring two SSDs for your swap will
 287 improve aggregate swapcache read performance by 1.5x to 1.8x.
 288 In tests with two Intel 40GB SSDs 300MB/sec was easily achieved.
 289 With two SATA-III SSDs it is possible to achieve 600MB/sec or better
 290 and well over 400MB/sec random-read performance (versus the ~3MB/sec
 291 random read performance a hard drive gives you).
 292 Faster SATA interfaces or newer NVMe technologies have significantly
 293 more read bandwidth (3GB/sec+ for NVMe), but may still lag on the
 294 write bandwidth.
 295 With newer technologies, one swap device is usually plenty.
 296 .Pp
 297 .Dx
 298 defaults to a maximum of 512G of configured swap.
 299 Keep in mind that each 1GB of actually configured swap requires
 300 approximately 1MB of wired ram to manage.
 301 .Pp
 302 In addition there will be periods of time where the system is in
 303 steady state and not writing to the swapcache.
 304 During these periods
 305 .Va curburst
 306 will inch back up but will not exceed
 307 .Va maxburst .
 308 Thus the
 309 .Va maxburst
 310 value controls how large a repeated burst can be.
 311 Remember that
 312 .Va curburst
 313 dynamically tracks burst and will go up and down depending.
 314 .Pp
 315 A second bursting parameter called
 316 .Va vm.swapcache.minburst
 317 controls bursting when the maximum write bandwidth has been reached.
 318 When
 319 .Va minburst
 320 reaches zero write activity ceases and
 321 .Va curburst
 322 is allowed to recover up to
 323 .Va minburst
 324 before write activity resumes.
 325 The recommended range for the
 326 .Va minburst
 327 parameter is 1MB to 50MB.
 328 This parameter has a relationship to
 329 how fragmented the swapcache gets when not in a steady state.
 330 Large bursts reduce fragmentation and reduce incidences of
 331 excessive seeking on the hard drive.
 332 If set too low the
 333 swapcache will become fragmented within a single regular file
 334 and the constant back-and-forth between the swapcache and the
 335 hard drive will result in excessive seeking on the hard drive.
 336 .Sh SWAPCACHE SIZE & MANAGEMENT
 337 The swapcache feature will use up to 75% of configured swap space
 338 by default.
 339 The remaining 25% is reserved for normal paging operations.
 340 The system operator should configure at least 4 times the SWAP space
 341 versus main memory and no less than 8GB of swap space.
 342 A typical 128GB SSD might use 64GB for boot + base and 56GB for
 343 swap, with 8GB left unpartitioned.  The system might then have a large
 344 additional hard drive for bulk data.
 345 Even with many packages installed, 64GB is comfortable for
 346 boot + base.
 347 .Pp
 348 When configuring a SSD that will be used for swap or swapcache
 349 it is a good idea to leave around 10% unpartitioned to improve
 350 the SSDs durability.
 351 .Pp
 352 You do not need to use swapcache if you have no hard drives in the
 353 system, though in fact swapcache can help if you use NFS heavily
 354 as a client.
 355 .Pp
 356 The
 357 .Va vm_swapcache.maxswappct
 358 sysctl may be used to change the default.
 359 You may have to change this default if you also use
 360 .Xr tmpfs 5 ,
 361 .Xr vn 4 ,
 362 or if you have not allocated enough swap for reasonable normal paging
 363 activity to occur (in which case you probably shouldn't be using
 364 .Nm
 365 anyway).
 366 .Pp
 367 If swapcache reaches the 75% limit it will begin tearing down swap
 368 in linear bursts by iterating through available VM objects, until
 369 swap space use drops to 70%.
 370 The tear-down is limited by the rate at
 371 which new data is written and this rate in turn is often limited by
 372 .Va vm.swapcache.accrate ,
 373 resulting in an orderly replacement of cached data and meta-data.
 374 The limit is typically only reached when doing full data+meta-data
 375 caching with no file size limitations and serving primarily large
 376 files, or bumping
 377 .Va kern.maxvnodes
 378 up to very high values.
 379 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 380 This is not a function of
 381 .Nm
 382 per se but instead a normal function of the system.
 383 Most systems have
 384 sufficient memory that they do not need to page memory to swap.
 385 These types of systems are the ones best suited for MLC SSD
 386 configured swap running with a
 387 .Nm
 388 configuration.
 389 Systems which modestly page to swap, in the range of a few hundred
 390 megabytes a day worth of writing, are also well suited for MLC SSD
 391 configured swap.
 392 Desktops usually fall into this category even if they
 393 page out a bit more because swap activity is governed by the actions of
 394 a single person.
 395 .Pp
 396 Systems which page anonymous memory heavily when
 397 .Nm
 398 would otherwise be turned off are not usually well suited for MLC SSD
 399 configured swap.
 400 Heavy paging activity is not governed by
 401 .Nm
 402 bandwidth control parameters and can lead to excessive uncontrolled
 403 writing to the SSD, causing premature wearout.
 404 This isn't to say that
 405 .Nm
 406 would be ineffective, just that the aggregate write bandwidth required
 407 to support the system might be too large to be cost-effective for a SSD.
 408 .Pp
 409 With this caveat in mind, SSD based paging on systems with insufficient
 410 RAM can be extremely effective in extending the useful life of the system.
 411 For example, a system with a measly 192MB of RAM and SSD swap can run
 412 a -j 8 parallel build world in a little less than twice the time it
 413 would take if the system had 2GB of RAM, whereas it would take 5x to 10x
 414 as long with normal HDD based swap.
 415 .Sh USING SWAPCACHE WITH NORMAL HARD DRIVES
 416 Although
 417 .Nm
 418 is designed to work with SSD-based storage it can also be used with
 419 HD-based storage as an aid for offloading the primary storage system.
 420 Here we need to make a distinction between using RAID for fanning out
 421 storage versus using RAID for redundancy.  There are numerous situations
 422 where RAID-based redundancy does not make sense.
 423 .Pp
 424 A good example would be in an environment where the servers themselves
 425 are redundant and can suffer a total failure without effecting
 426 ongoing operations.  When the primary storage requirements easily fit onto
 427 a single large-capacity drive it doesn't make a whole lot of sense to
 428 use RAID if your only desire is to improve performance.  If you had a farm
 429 of, say, 20 servers supporting the same facility adding RAID to each one
 430 would not accomplish anything other than to bloat your deployment and
 431 maintenance costs.
 432 .Pp
 433 In these sorts of situations it may be desirable and convenient to have
 434 the primary filesystem for each machine on a single large drive and then
 435 use the
 436 .Nm
 437 facility to offload the drive and make the machine more effective without
 438 actually distributing the filesystem itself across multiple drives.
 439 For the purposes of offloading while a SSD would be the most effective
 440 from a performance standpoint, a second medium sized HD with its much lower
 441 cost and higher capacity might actually be more cost effective.
 442 .Sh EXPLANATION OF STATIC VS DYNAMIC WEARING LEVELING, AND WRITE-COMBINING
 443 Modern SSDs keep track of space that has never been written to.
 444 This would also include space freed up via TRIM, but simply not
 445 touching a bit of storage in a factory fresh SSD works just as well.
 446 Once you touch (write to) the storage all bets are off, even if
 447 you reformat/repartition later.  It takes sending the SSD a
 448 whole-device TRIM command or special format command to take it back
 449 to its factory-fresh condition (sans wear already present).
 450 .Pp
 451 SSDs have wear leveling algorithms which are responsible for trying
 452 to even out the erase/write cycles across all flash cells in the
 453 storage.  The better a job the SSD can do the longer the SSD will
 454 remain usable.
 455 .Pp
 456 The more unused storage there is from the SSDs point of view the
 457 easier a time the SSD has running its wear leveling algorithms.
 458 Basically the wear leveling algorithm in a modern SSD (say Intel or OCZ)
 459 uses a combination of static and dynamic leveling.  Static is the
 460 best, allowing the SSD to reuse flash cells that have not been
 461 erased very much by moving static (unchanging) data out of them and
 462 into other cells that have more wear.  Dynamic wear leveling involves
 463 writing data to available flash cells and then marking the cells containing
 464 the previous copy of the data as being free/reusable.  Dynamic wear leveling
 465 is the worst kind but the easiest to implement.  Modern SSDs use a combination
 466 of both algorithms plus also do write-combining.
 467 .Pp
 468 USB sticks often use only dynamic wear leveling and have short life spans
 469 because of that.
 470 .Pp
 471 In anycase, any unused space in the SSD effectively makes the dynamic
 472 wear leveling the SSD does more efficient by giving the SSD more 'unused'
 473 space above and beyond the physical space it reserves beyond its stated
 474 storage capacity to cycle data through, so the SSD lasts longer in theory.
 475 .Pp
 476 Write-combining is a feature whereby the SSD is able to reduced write
 477 amplification effects by combining OS writes of smaller, discrete,
 478 non-contiguous logical sectors into a single contiguous 128KB physical
 479 flash block.
 480 .Pp
 481 On the flip side write-combining also results in more complex lookup tables
 482 which can become fragmented over time and reduce the SSDs read performance.
 483 Fragmentation can also occur when write-combined blocks are rewritten
 484 piecemeal.
 485 Modern SSDs can regain the lost performance by de-combining previously
 486 write-combined areas as part of their static wear leveling algorithm, but
 487 at the cost of extra write/erase cycles which slightly increase write
 488 amplification effects.
 489 Operating systems can also help maintain the SSDs performance by utilizing
 490 larger blocks.
 491 Write-combining results in a net-reduction
 492 of write-amplification effects but due to having to de-combine later and
 493 other fragmentary effects it isn't 100%.
 494 From testing with Intel devices write-amplification can be well controlled
 495 in the 2x-4x range with the OS doing 16K writes, versus a worst-case
 496 8x write-amplification with 16K blocks, 32x with 4K blocks, and a truly
 497 horrid worst-case with 512 byte blocks.
 498 .Pp
 499 The
 500 .Dx
 501 .Nm
 502 feature utilizes 64K-128K writes and is specifically designed to minimize
 503 write amplification and write-combining stresses.
 504 In terms of placing an actual filesystem on the SSD, the
 505 .Dx
 506 .Xr hammer 8
 507 filesystem utilizes 16K blocks and is well behaved as long as you limit
 508 reblocking operations.
 509 For UFS you should create the filesystem with at least a 4K fragment
 510 size, versus the default 2K.
 511 Modern Windows filesystems use 4K clusters but it is unclear how SSD-friendly
 512 NTFS is.
 513 .Sh EXPLANATION OF FLASH CHIP FEATURE SIZE VS ERASE/REWRITE CYCLE DURABILITY
 514 Manufacturers continue to produce flash chips with smaller feature sizes.
 515 Smaller flash cells means reduced erase/rewrite cycle durability which in
 516 turn reduces the durability of the SSD.
 517 .Pp
 518 The older 34nm flash typically had a 10,000 cell durability while the newer
 519 25nm flash is closer to 1000.  The newer flash uses larger ECCs and more
 520 sensitive voltage comparators on-chip to increase the durability closer to
 521 3000 cycles.  Generally speaking you should assume a durability of around
 522 1/3 for the same storage capacity using the new chips versus the older
 523 chips.  If you can squeeze out a 400TB durability from an older 40GB X25-V
 524 using 34nm technology then you should assume around a 400TB durability from
 525 a newer 120GB 310 series SSD using 25nm technology.
 526 .Sh WARNINGS
 527 I am going to repeat and expand a bit on SSD wear.
 528 Wear on SSDs is a function of the write durability of the cells,
 529 whether the SSD implements static or dynamic wear leveling (or both),
 530 write amplification effects when the OS does not issue write-aligned 128KB
 531 ops or when the SSD is unable to write-combine adjacent logical sectors,
 532 or if the SSD has a poor write-combining algorithm for non-adjacent sectors.
 533 In addition some additional erase/rewrite activity occurs from cleanup
 534 operations the SSD performs as part of its static wear leveling algorithms
 535 and its write-decombining algorithms (necessary to maintain performance over
 536 time).  MLC flash uses 128KB physical write/erase blocks while SLC flash
 537 typically uses 64KB physical write/erase blocks.
 538 .Pp
 539 The algorithms the SSD implements in its firmware are probably the most
 540 important part of the device and a major differentiator between e.g. SATA
 541 and USB-based SSDs.  SATA form factor drives will universally be far superior
 542 to USB storage sticks.
 543 SSDs can also have wildly different wearout rates and wildly different
 544 performance curves over time.
 545 For example the performance of a SSD which does not implement
 546 write-decombining can seriously degrade over time as its lookup
 547 tables become severely fragmented.
 548 For the purposes of this manual page we are primarily using Intel and OCZ
 549 drives when describing performance and wear issues.
 550 .Pp
 551 .Nm
 552 parameters should be carefully chosen to avoid early wearout.
 553 For example, the Intel X25V 40GB SSD has a minimum write durability
 554 of 40TB and an actual durability that can be quite a bit higher.
 555 Generally speaking, you want to select parameters that will give you
 556 at least 10 years of service life.
 557 The most important parameter to control this is
 558 .Va vm.swapcache.accrate .
 559 .Nm
 560 uses a very conservative 100KB/sec default but even a small X25V
 561 can probably handle 300KB/sec of continuous writing and still last 10 years.
 562 .Pp
 563 Depending on the wear leveling algorithm the drive uses, durability
 564 and performance can sometimes be improved by configuring less
 565 space (in a manufacturer-fresh drive) than the drive's probed capacity.
 566 For example, by only using 32GB of a 40GB SSD.
 567 SSDs typically implement 10% more storage than advertised and
 568 use this storage to improve wear leveling.
 569 As cells begin to fail
 570 this overallotment slowly becomes part of the primary storage
 571 until it has been exhausted.
 572 After that the SSD has basically failed.
 573 Keep in mind that if you use a larger portion of the SSD's advertised
 574 storage the SSD will not know if/when you decide to use less unless
 575 appropriate TRIM commands are sent (if supported), or a low level
 576 factory erase is issued.
 577 .Pp
 578 .Nm smartctl
 579 (from
 580 .Xr dports 7 Ap s
 581 .Pa sysutils/smartmontools )
 582 may be used to retrieve the wear indicator from the drive.
 583 One usually runs something like
 584 .Ql smartctl -d sat -a /dev/daXX
 585 (for AHCI/SILI/SCSI), or
 586 .Ql smartctl -a /dev/adXX
 587 for NATA.
 588 Some SSDs
 589 (particularly the Intels) will brick the SATA port when smart operations
 590 are done while the drive is busy with normal activity, so the tool should
 591 only be run when the SSD is idle.
 592 .Pp
 593 ID 232 (0xe8) in the SMART data dump indicates available reserved
 594 space and ID 233 (0xe9) is the wear-out meter.
 595 Reserved space
 596 typically starts at 100 and decrements to 10, after which the SSD
 597 is considered to operate in a degraded mode.
 598 The wear-out meter typically starts at 99 and decrements to 0,
 599 after which the SSD has failed.
 600 .Pp
 601 .Nm
 602 tends to use large 64KB writes and tends to cluster multiple writes
 603 linearly.
 604 The SSD is able to take significant advantage of this
 605 and write amplification effects are greatly reduced.
 606 If we take a 40GB Intel X25V as an example the vendor specifies a write
 607 durability of approximately 40TB, but
 608 .Nm
 609 should be able to squeeze out upwards of 200TB due the fairly optimal
 610 write clustering it does.
 611 The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
 612 per MLC cell, 40GB drive, with 34nm technology), but the firmware doesn't
 613 do perfect static wear leveling so the actual durability is less.
 614 In tests over several hundred days we have validated a write endurance
 615 greater than 200TB on the 40G Intel X25V using
 616 .Nm .
 617 .Pp
 618 In contrast, filesystems directly stored on a SSD could have
 619 fairly severe write amplification effects and will have durabilities
 620 ranging closer to the vendor-specified limit.
 621 .Pp
 622 Tests have shown that power cycling (with proper shutdown) and read
 623 operations do not adversely effect a SSD.  Writing within the wearout
 624 constraints provided by the vendor also does not make a powered SSD any
 625 less reliable over time.  Time itself seems to be a factor as the SSD
 626 encounters defects and weak cells in the flash chips.  Writes to a SSD
 627 will effect cold durability (a typical flash chip has 10 years of cold
 628 data retention when fresh and less than 1 year of cold data retention near
 629 the end of its wear life).  Keeping a SSD cool improves its data retention.
 630 .Pp
 631 Beware the standard comparison between SLC, MLC, and TLC-based flash
 632 in terms of wearout and durability.  Over the years, tests have shown
 633 that SLC is not actually any more reliable than MLC, despite having a
 634 significantly larger theoretical durability.  Cell and chip failures seem
 635 to trump theoretical wear limitations in terms of device reliability.
 636 With that in mind, we do not recommend using SLC for anything any more.
 637 Instead we recommend that the flash simply be over-provisioned to provide
 638 the needed durability.
 639 This is already done in numerous NVMe solutions for the vendor to be able
 640 to provide certain minimum wear guarantees.
 641 Durability scales with the amount of flash storage (but the fab process
 642 typically scales the opposite... smaller feature sizes for flash cells
 643 greatly reduce their durability).
 644 When wear calculations are in years, these differences become huge, but
 645 often the quantity of storage needed trumps the wear life so we expect most
 646 people will be using MLC.
 647 .Pp
 648 Beware the huge difference between larger (e.g. 2.5") form-factor SSDs
 649 and smaller SSDs such as USB sticks are very small M.2 storage.  Smaller
 650 form-factor devices have fewer flash chips and, much lower write bandwidths,
 651 less ram for caching and write-combining, and usb sticks in particular will
 652 usually have unsophisticated wear-leveling algorithms compared to a 2.5"
 653 SSD.  It is generally not a good idea to make a USB stick your primary
 654 storage.  Long-form-factor NGFF/M.2 devices will be better, and 2.5"
 655 form factor devices even better.  The read-bandwidth for a SATA SSD caps
 656 out more quickly than the read-bandwidth for a NVMe SSD, but the larger
 657 form factor of a 2.5" SATA SSD will often have superior write performance
 658 to a NGFF NVMe device.  There are 2.5" NVMe devices as well, requiring a
 659 special connector or PCIe adapter, which give you the best of both worlds.
 660 .Sh SEE ALSO
 661 .Xr chflags 1 ,
 662 .Xr fstab 5 ,
 663 .Xr disklabel64 8 ,
 664 .Xr hammer 8 ,
 665 .Xr swapon 8
 666 .Sh HISTORY
 667 .Nm
 668 first appeared in
 669 .Dx 2.5 .
 670 .Sh AUTHORS
 671 .An Matthew Dillon