move-ext4.txt-into-its-own-directory

   1 ext4: move ext4.txt into its own directory
   2
   3 From: Darrick J. Wong <darrick.wong@oracle.com>
   4
   5 Move Documentation/filesystems/ext4.txt into
   6 Documentation/filesystems/ext4/ext4.rst in preparation for adding more
   7 ext4 documentation.
   8
   9 Note that the documentation isn't in rst format yet, but as it's not
  10 linked from anywhere it won't cause build errors.
  11
  12 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
  13 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
  14 ---
  15  Documentation/filesystems/ext4.txt      |  627 -------------------------------
  16  Documentation/filesystems/ext4/ext4.rst |  627 +++++++++++++++++++++++++++++++
  17  2 files changed, 627 insertions(+), 627 deletions(-)
  18  delete mode 100644 Documentation/filesystems/ext4.txt
  19  create mode 100644 Documentation/filesystems/ext4/ext4.rst
  20
  21
  22 diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
  23 deleted file mode 100644
  24 index 7f628b9f7c4b..000000000000
  25 --- a/Documentation/filesystems/ext4.txt
  26 +++ /dev/null
  27 @@ -1,627 +0,0 @@
  28 -
  29 -Ext4 Filesystem
  30 -===============
  31 -
  32 -Ext4 is an advanced level of the ext3 filesystem which incorporates
  33 -scalability and reliability enhancements for supporting large filesystems
  34 -(64 bit) in keeping with increasing disk capacities and state-of-the-art
  35 -feature requirements.
  36 -
  37 -Mailing list:  linux-ext4@vger.kernel.org
  38 -Web site:      http://ext4.wiki.kernel.org
  39 -
  40 -
  41 -1. Quick usage instructions:
  42 -===========================
  43 -
  44 -Note: More extensive information for getting started with ext4 can be
  45 -      found at the ext4 wiki site at the URL:
  46 -      http://ext4.wiki.kernel.org/index.php/Ext4_Howto
  47 -
  48 -  - Compile and install the latest version of e2fsprogs (as of this
  49 -    writing version 1.41.3) from:
  50 -
  51 -    http://sourceforge.net/project/showfiles.php?group_id=2406
  52 -
  53 -       or
  54 -
  55 -    https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
  56 -
  57 -       or grab the latest git repository from:
  58 -
  59 -    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
  60 -
  61 -  - Note that it is highly important to install the mke2fs.conf file
  62 -    that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
  63 -    you have edited the /etc/mke2fs.conf file installed on your system,
  64 -    you will need to merge your changes with the version from e2fsprogs
  65 -    1.41.x.
  66 -
  67 -  - Create a new filesystem using the ext4 filesystem type:
  68 -
  69 -       # mke2fs -t ext4 /dev/hda1
  70 -
  71 -    Or to configure an existing ext3 filesystem to support extents:
  72 -
  73 -       # tune2fs -O extents /dev/hda1
  74 -
  75 -    If the filesystem was created with 128 byte inodes, it can be
  76 -    converted to use 256 byte for greater efficiency via:
  77 -
  78 -        # tune2fs -I 256 /dev/hda1
  79 -
  80 -    (Note: we currently do not have tools to convert an ext4
  81 -    filesystem back to ext3; so please do not do try this on production
  82 -    filesystems.)
  83 -
  84 -  - Mounting:
  85 -
  86 -       # mount -t ext4 /dev/hda1 /wherever
  87 -
  88 -  - When comparing performance with other filesystems, it's always
  89 -    important to try multiple workloads; very often a subtle change in a
  90 -    workload parameter can completely change the ranking of which
  91 -    filesystems do well compared to others.  When comparing versus ext3,
  92 -    note that ext4 enables write barriers by default, while ext3 does
  93 -    not enable write barriers by default.  So it is useful to use
  94 -    explicitly specify whether barriers are enabled or not when via the
  95 -    '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
  96 -    for a fair comparison.  When tuning ext3 for best benchmark numbers,
  97 -    it is often worthwhile to try changing the data journaling mode; '-o
  98 -    data=writeback' can be faster for some workloads.  (Note however that
  99 -    running mounted with data=writeback can potentially leave stale data
 100 -    exposed in recently written files in case of an unclean shutdown,
 101 -    which could be a security exposure in some situations.)  Configuring
 102 -    the filesystem with a large journal can also be helpful for
 103 -    metadata-intensive workloads.
 104 -
 105 -2. Features
 106 -===========
 107 -
 108 -2.1 Currently available
 109 -
 110 -* ability to use filesystems > 16TB (e2fsprogs support not available yet)
 111 -* extent format reduces metadata overhead (RAM, IO for access, transactions)
 112 -* extent format more robust in face of on-disk corruption due to magics,
 113 -* internal redundancy in tree
 114 -* improved file allocation (multi-block alloc)
 115 -* lift 32000 subdirectory limit imposed by i_links_count[1]
 116 -* nsec timestamps for mtime, atime, ctime, create time
 117 -* inode version field on disk (NFSv4, Lustre)
 118 -* reduced e2fsck time via uninit_bg feature
 119 -* journal checksumming for robustness, performance
 120 -* persistent file preallocation (e.g for streaming media, databases)
 121 -* ability to pack bitmaps and inode tables into larger virtual groups via the
 122 -  flex_bg feature
 123 -* large file support
 124 -* inode allocation using large virtual block groups via flex_bg
 125 -* delayed allocation
 126 -* large block (up to pagesize) support
 127 -* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
 128 -  the ordering)
 129 -
 130 -[1] Filesystems with a block size of 1k may see a limit imposed by the
 131 -directory hash tree having a maximum depth of two.
 132 -
 133 -2.2 Candidate features for future inclusion
 134 -
 135 -* online defrag (patches available but not well tested)
 136 -* reduced mke2fs time via lazy itable initialization in conjunction with
 137 -  the uninit_bg feature (capability to do this is available in e2fsprogs
 138 -  but a kernel thread to do lazy zeroing of unused inode table blocks
 139 -  after filesystem is first mounted is required for safety)
 140 -
 141 -There are several others under discussion, whether they all make it in is
 142 -partly a function of how much time everyone has to work on them. Features like
 143 -metadata checksumming have been discussed and planned for a bit but no patches
 144 -exist yet so I'm not sure they're in the near-term roadmap.
 145 -
 146 -The big performance win will come with mballoc, delalloc and flex_bg
 147 -grouping of bitmaps and inode tables.  Some test results available here:
 148 -
 149 - - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
 150 - - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
 151 -
 152 -3. Options
 153 -==========
 154 -
 155 -When mounting an ext4 filesystem, the following option are accepted:
 156 -(*) == default
 157 -
 158 -ro                     Mount filesystem read only. Note that ext4 will
 159 -                       replay the journal (and thus write to the
 160 -                       partition) even when mounted "read only". The
 161 -                       mount options "ro,noload" can be used to prevent
 162 -                       writes to the filesystem.
 163 -
 164 -journal_checksum       Enable checksumming of the journal transactions.
 165 -                       This will allow the recovery code in e2fsck and the
 166 -                       kernel to detect corruption in the kernel.  It is a
 167 -                       compatible change and will be ignored by older kernels.
 168 -
 169 -journal_async_commit   Commit block can be written to disk without waiting
 170 -                       for descriptor blocks. If enabled older kernels cannot
 171 -                       mount the device. This will enable 'journal_checksum'
 172 -                       internally.
 173 -
 174 -journal_path=path
 175 -journal_dev=devnum     When the external journal device's major/minor numbers
 176 -                       have changed, these options allow the user to specify
 177 -                       the new journal location.  The journal device is
 178 -                       identified through either its new major/minor numbers
 179 -                       encoded in devnum, or via a path to the device.
 180 -
 181 -norecovery             Don't load the journal on mounting.  Note that
 182 -noload                 if the filesystem was not unmounted cleanly,
 183 -                       skipping the journal replay will lead to the
 184 -                       filesystem containing inconsistencies that can
 185 -                       lead to any number of problems.
 186 -
 187 -data=journal           All data are committed into the journal prior to being
 188 -                       written into the main file system.  Enabling
 189 -                       this mode will disable delayed allocation and
 190 -                       O_DIRECT support.
 191 -
 192 -data=ordered   (*)     All data are forced directly out to the main file
 193 -                       system prior to its metadata being committed to the
 194 -                       journal.
 195 -
 196 -data=writeback         Data ordering is not preserved, data may be written
 197 -                       into the main file system after its metadata has been
 198 -                       committed to the journal.
 199 -
 200 -commit=nrsec   (*)     Ext4 can be told to sync all its data and metadata
 201 -                       every 'nrsec' seconds. The default value is 5 seconds.
 202 -                       This means that if you lose your power, you will lose
 203 -                       as much as the latest 5 seconds of work (your
 204 -                       filesystem will not be damaged though, thanks to the
 205 -                       journaling).  This default value (or any low value)
 206 -                       will hurt performance, but it's good for data-safety.
 207 -                       Setting it to 0 will have the same effect as leaving
 208 -                       it at the default (5 seconds).
 209 -                       Setting it to very large values will improve
 210 -                       performance.
 211 -
 212 -barrier=<0|1(*)>       This enables/disables the use of write barriers in
 213 -barrier(*)             the jbd code.  barrier=0 disables, barrier=1 enables.
 214 -nobarrier              This also requires an IO stack which can support
 215 -                       barriers, and if jbd gets an error on a barrier
 216 -                       write, it will disable again with a warning.
 217 -                       Write barriers enforce proper on-disk ordering
 218 -                       of journal commits, making volatile disk write caches
 219 -                       safe to use, at some performance penalty.  If
 220 -                       your disks are battery-backed in one way or another,
 221 -                       disabling barriers may safely improve performance.
 222 -                       The mount options "barrier" and "nobarrier" can
 223 -                       also be used to enable or disable barriers, for
 224 -                       consistency with other ext4 mount options.
 225 -
 226 -inode_readahead_blks=n This tuning parameter controls the maximum
 227 -                       number of inode table blocks that ext4's inode
 228 -                       table readahead algorithm will pre-read into
 229 -                       the buffer cache.  The default value is 32 blocks.
 230 -
 231 -nouser_xattr           Disables Extended User Attributes.  See the
 232 -                       attr(5) manual page for more information about
 233 -                       extended attributes.
 234 -
 235 -noacl                  This option disables POSIX Access Control List
 236 -                       support. If ACL support is enabled in the kernel
 237 -                       configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
 238 -                       enabled by default on mount. See the acl(5) manual
 239 -                       page for more information about acl.
 240 -
 241 -bsddf          (*)     Make 'df' act like BSD.
 242 -minixdf                        Make 'df' act like Minix.
 243 -
 244 -debug                  Extra debugging information is sent to syslog.
 245 -
 246 -abort                  Simulate the effects of calling ext4_abort() for
 247 -                       debugging purposes.  This is normally used while
 248 -                       remounting a filesystem which is already mounted.
 249 -
 250 -errors=remount-ro      Remount the filesystem read-only on an error.
 251 -errors=continue                Keep going on a filesystem error.
 252 -errors=panic           Panic and halt the machine if an error occurs.
 253 -                        (These mount options override the errors behavior
 254 -                        specified in the superblock, which can be configured
 255 -                        using tune2fs)
 256 -
 257 -data_err=ignore(*)     Just print an error message if an error occurs
 258 -                       in a file data buffer in ordered mode.
 259 -data_err=abort         Abort the journal if an error occurs in a file
 260 -                       data buffer in ordered mode.
 261 -
 262 -grpid                  New objects have the group ID of their parent.
 263 -bsdgroups
 264 -
 265 -nogrpid                (*)     New objects have the group ID of their creator.
 266 -sysvgroups
 267 -
 268 -resgid=n               The group ID which may use the reserved blocks.
 269 -
 270 -resuid=n               The user ID which may use the reserved blocks.
 271 -
 272 -sb=n                   Use alternate superblock at this location.
 273 -
 274 -quota                  These options are ignored by the filesystem. They
 275 -noquota                        are used only by quota tools to recognize volumes
 276 -grpquota               where quota should be turned on. See documentation
 277 -usrquota               in the quota-tools package for more details
 278 -                       (http://sourceforge.net/projects/linuxquota).
 279 -
 280 -jqfmt=<quota type>     These options tell filesystem details about quota
 281 -usrjquota=<file>       so that quota information can be properly updated
 282 -grpjquota=<file>       during journal replay. They replace the above
 283 -                       quota options. See documentation in the quota-tools
 284 -                       package for more details
 285 -                       (http://sourceforge.net/projects/linuxquota).
 286 -
 287 -stripe=n               Number of filesystem blocks that mballoc will try
 288 -                       to use for allocation size and alignment. For RAID5/6
 289 -                       systems this should be the number of data
 290 -                       disks *  RAID chunk size in file system blocks.
 291 -
 292 -delalloc       (*)     Defer block allocation until just before ext4
 293 -                       writes out the block(s) in question.  This
 294 -                       allows ext4 to better allocation decisions
 295 -                       more efficiently.
 296 -nodelalloc             Disable delayed allocation.  Blocks are allocated
 297 -                       when the data is copied from userspace to the
 298 -                       page cache, either via the write(2) system call
 299 -                       or when an mmap'ed page which was previously
 300 -                       unallocated is written for the first time.
 301 -
 302 -max_batch_time=usec    Maximum amount of time ext4 should wait for
 303 -                       additional filesystem operations to be batch
 304 -                       together with a synchronous write operation.
 305 -                       Since a synchronous write operation is going to
 306 -                       force a commit and then a wait for the I/O
 307 -                       complete, it doesn't cost much, and can be a
 308 -                       huge throughput win, we wait for a small amount
 309 -                       of time to see if any other transactions can
 310 -                       piggyback on the synchronous write.   The
 311 -                       algorithm used is designed to automatically tune
 312 -                       for the speed of the disk, by measuring the
 313 -                       amount of time (on average) that it takes to
 314 -                       finish committing a transaction.  Call this time
 315 -                       the "commit time".  If the time that the
 316 -                       transaction has been running is less than the
 317 -                       commit time, ext4 will try sleeping for the
 318 -                       commit time to see if other operations will join
 319 -                       the transaction.   The commit time is capped by
 320 -                       the max_batch_time, which defaults to 15000us
 321 -                       (15ms).   This optimization can be turned off
 322 -                       entirely by setting max_batch_time to 0.
 323 -
 324 -min_batch_time=usec    This parameter sets the commit time (as
 325 -                       described above) to be at least min_batch_time.
 326 -                       It defaults to zero microseconds.  Increasing
 327 -                       this parameter may improve the throughput of
 328 -                       multi-threaded, synchronous workloads on very
 329 -                       fast disks, at the cost of increasing latency.
 330 -
 331 -journal_ioprio=prio    The I/O priority (from 0 to 7, where 0 is the
 332 -                       highest priority) which should be used for I/O
 333 -                       operations submitted by kjournald2 during a
 334 -                       commit operation.  This defaults to 3, which is
 335 -                       a slightly higher priority than the default I/O
 336 -                       priority.
 337 -
 338 -auto_da_alloc(*)       Many broken applications don't use fsync() when
 339 -noauto_da_alloc                replacing existing files via patterns such as
 340 -                       fd = open("foo.new")/write(fd,..)/close(fd)/
 341 -                       rename("foo.new", "foo"), or worse yet,
 342 -                       fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
 343 -                       If auto_da_alloc is enabled, ext4 will detect
 344 -                       the replace-via-rename and replace-via-truncate
 345 -                       patterns and force that any delayed allocation
 346 -                       blocks are allocated such that at the next
 347 -                       journal commit, in the default data=ordered
 348 -                       mode, the data blocks of the new file are forced
 349 -                       to disk before the rename() operation is
 350 -                       committed.  This provides roughly the same level
 351 -                       of guarantees as ext3, and avoids the
 352 -                       "zero-length" problem that can happen when a
 353 -                       system crashes before the delayed allocation
 354 -                       blocks are forced to disk.
 355 -
 356 -noinit_itable          Do not initialize any uninitialized inode table
 357 -                       blocks in the background.  This feature may be
 358 -                       used by installation CD's so that the install
 359 -                       process can complete as quickly as possible; the
 360 -                       inode table initialization process would then be
 361 -                       deferred until the next time the  file system
 362 -                       is unmounted.
 363 -
 364 -init_itable=n          The lazy itable init code will wait n times the
 365 -                       number of milliseconds it took to zero out the
 366 -                       previous block group's inode table.  This
 367 -                       minimizes the impact on the system performance
 368 -                       while file system's inode table is being initialized.
 369 -
 370 -discard                        Controls whether ext4 should issue discard/TRIM
 371 -nodiscard(*)           commands to the underlying block device when
 372 -                       blocks are freed.  This is useful for SSD devices
 373 -                       and sparse/thinly-provisioned LUNs, but it is off
 374 -                       by default until sufficient testing has been done.
 375 -
 376 -nouid32                        Disables 32-bit UIDs and GIDs.  This is for
 377 -                       interoperability  with  older kernels which only
 378 -                       store and expect 16-bit values.
 379 -
 380 -block_validity(*)      These options enable or disable the in-kernel
 381 -noblock_validity       facility for tracking filesystem metadata blocks
 382 -                       within internal data structures.  This allows multi-
 383 -                       block allocator and other routines to notice
 384 -                       bugs or corrupted allocation bitmaps which cause
 385 -                       blocks to be allocated which overlap with
 386 -                       filesystem metadata blocks.
 387 -
 388 -dioread_lock           Controls whether or not ext4 should use the DIO read
 389 -dioread_nolock         locking. If the dioread_nolock option is specified
 390 -                       ext4 will allocate uninitialized extent before buffer
 391 -                       write and convert the extent to initialized after IO
 392 -                       completes. This approach allows ext4 code to avoid
 393 -                       using inode mutex, which improves scalability on high
 394 -                       speed storages. However this does not work with
 395 -                       data journaling and dioread_nolock option will be
 396 -                       ignored with kernel warning. Note that dioread_nolock
 397 -                       code path is only used for extent-based files.
 398 -                       Because of the restrictions this options comprises
 399 -                       it is off by default (e.g. dioread_lock).
 400 -
 401 -max_dir_size_kb=n      This limits the size of directories so that any
 402 -                       attempt to expand them beyond the specified
 403 -                       limit in kilobytes will cause an ENOSPC error.
 404 -                       This is useful in memory constrained
 405 -                       environments, where a very large directory can
 406 -                       cause severe performance problems or even
 407 -                       provoke the Out Of Memory killer.  (For example,
 408 -                       if there is only 512mb memory available, a 176mb
 409 -                       directory may seriously cramp the system's style.)
 410 -
 411 -i_version              Enable 64-bit inode version support. This option is
 412 -                       off by default.
 413 -
 414 -dax                    Use direct access (no page cache).  See
 415 -                       Documentation/filesystems/dax.txt.  Note that
 416 -                       this option is incompatible with data=journal.
 417 -
 418 -Data Mode
 419 -=========
 420 -There are 3 different data modes:
 421 -
 422 -* writeback mode
 423 -In data=writeback mode, ext4 does not journal data at all.  This mode provides
 424 -a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
 425 -mode - metadata journaling.  A crash+recovery can cause incorrect data to
 426 -appear in files which were written shortly before the crash.  This mode will
 427 -typically provide the best ext4 performance.
 428 -
 429 -* ordered mode
 430 -In data=ordered mode, ext4 only officially journals metadata, but it logically
 431 -groups metadata information related to data changes with the data blocks into a
 432 -single unit called a transaction.  When it's time to write the new metadata
 433 -out to disk, the associated data blocks are written first.  In general,
 434 -this mode performs slightly slower than writeback but significantly faster than journal mode.
 435 -
 436 -* journal mode
 437 -data=journal mode provides full data and metadata journaling.  All new data is
 438 -written to the journal first, and then to its final location.
 439 -In the event of a crash, the journal can be replayed, bringing both data and
 440 -metadata into a consistent state.  This mode is the slowest except when data
 441 -needs to be read from and written to disk at the same time where it
 442 -outperforms all others modes.  Enabling this mode will disable delayed
 443 -allocation and O_DIRECT support.
 444 -
 445 -/proc entries
 446 -=============
 447 -
 448 -Information about mounted ext4 file systems can be found in
 449 -/proc/fs/ext4.  Each mounted filesystem will have a directory in
 450 -/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
 451 -/proc/fs/ext4/dm-0).   The files in each per-device directory are shown
 452 -in table below.
 453 -
 454 -Files in /proc/fs/ext4/<devname>
 455 -..............................................................................
 456 - File            Content
 457 - mb_groups       details of multiblock allocator buddy cache of free blocks
 458 -..............................................................................
 459 -
 460 -/sys entries
 461 -============
 462 -
 463 -Information about mounted ext4 file systems can be found in
 464 -/sys/fs/ext4.  Each mounted filesystem will have a directory in
 465 -/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
 466 -/sys/fs/ext4/dm-0).   The files in each per-device directory are shown
 467 -in table below.
 468 -
 469 -Files in /sys/fs/ext4/<devname>
 470 -(see also Documentation/ABI/testing/sysfs-fs-ext4)
 471 -..............................................................................
 472 - File                         Content
 473 -
 474 - delayed_allocation_blocks    This file is read-only and shows the number of
 475 -                              blocks that are dirty in the page cache, but
 476 -                              which do not have their location in the
 477 -                              filesystem allocated yet.
 478 -
 479 - inode_goal                   Tuning parameter which (if non-zero) controls
 480 -                              the goal inode used by the inode allocator in
 481 -                              preference to all other allocation heuristics.
 482 -                              This is intended for debugging use only, and
 483 -                              should be 0 on production systems.
 484 -
 485 - inode_readahead_blks         Tuning parameter which controls the maximum
 486 -                              number of inode table blocks that ext4's inode
 487 -                              table readahead algorithm will pre-read into
 488 -                              the buffer cache
 489 -
 490 - lifetime_write_kbytes        This file is read-only and shows the number of
 491 -                              kilobytes of data that have been written to this
 492 -                              filesystem since it was created.
 493 -
 494 - max_writeback_mb_bump        The maximum number of megabytes the writeback
 495 -                              code will try to write out before move on to
 496 -                              another inode.
 497 -
 498 - mb_group_prealloc            The multiblock allocator will round up allocation
 499 -                              requests to a multiple of this tuning parameter if
 500 -                              the stripe size is not set in the ext4 superblock
 501 -
 502 - mb_max_to_scan               The maximum number of extents the multiblock
 503 -                              allocator will search to find the best extent
 504 -
 505 - mb_min_to_scan               The minimum number of extents the multiblock
 506 -                              allocator will search to find the best extent
 507 -
 508 - mb_order2_req                Tuning parameter which controls the minimum size
 509 -                              for requests (as a power of 2) where the buddy
 510 -                              cache is used
 511 -
 512 - mb_stats                     Controls whether the multiblock allocator should
 513 -                              collect statistics, which are shown during the
 514 -                              unmount. 1 means to collect statistics, 0 means
 515 -                              not to collect statistics
 516 -
 517 - mb_stream_req                Files which have fewer blocks than this tunable
 518 -                              parameter will have their blocks allocated out
 519 -                              of a block group specific preallocation pool, so
 520 -                              that small files are packed closely together.
 521 -                              Each large file will have its blocks allocated
 522 -                              out of its own unique preallocation pool.
 523 -
 524 - session_write_kbytes         This file is read-only and shows the number of
 525 -                              kilobytes of data that have been written to this
 526 -                              filesystem since it was mounted.
 527 -
 528 - reserved_clusters            This is RW file and contains number of reserved
 529 -                              clusters in the file system which will be used
 530 -                              in the specific situations to avoid costly
 531 -                              zeroout, unexpected ENOSPC, or possible data
 532 -                              loss. The default is 2% or 4096 clusters,
 533 -                              whichever is smaller and this can be changed
 534 -                              however it can never exceed number of clusters
 535 -                              in the file system. If there is not enough space
 536 -                              for the reserved space when mounting the file
 537 -                              mount will _not_ fail.
 538 -..............................................................................
 539 -
 540 -Ioctls
 541 -======
 542 -
 543 -There is some Ext4 specific functionality which can be accessed by applications
 544 -through the system call interfaces. The list of all Ext4 specific ioctls are
 545 -shown in the table below.
 546 -
 547 -Table of Ext4 specific ioctls
 548 -..............................................................................
 549 - Ioctl                       Description
 550 - EXT4_IOC_GETFLAGS           Get additional attributes associated with inode.
 551 -                             The ioctl argument is an integer bitfield, with
 552 -                             bit values described in ext4.h. This ioctl is an
 553 -                             alias for FS_IOC_GETFLAGS.
 554 -
 555 - EXT4_IOC_SETFLAGS           Set additional attributes associated with inode.
 556 -                             The ioctl argument is an integer bitfield, with
 557 -                             bit values described in ext4.h. This ioctl is an
 558 -                             alias for FS_IOC_SETFLAGS.
 559 -
 560 - EXT4_IOC_GETVERSION
 561 - EXT4_IOC_GETVERSION_OLD
 562 -                             Get the inode i_generation number stored for
 563 -                             each inode. The i_generation number is normally
 564 -                             changed only when new inode is created and it is
 565 -                             particularly useful for network filesystems. The
 566 -                             '_OLD' version of this ioctl is an alias for
 567 -                             FS_IOC_GETVERSION.
 568 -
 569 - EXT4_IOC_SETVERSION
 570 - EXT4_IOC_SETVERSION_OLD
 571 -                             Set the inode i_generation number stored for
 572 -                             each inode. The '_OLD' version of this ioctl
 573 -                             is an alias for FS_IOC_SETVERSION.
 574 -
 575 - EXT4_IOC_GROUP_EXTEND       This ioctl has the same purpose as the resize
 576 -                             mount option. It allows to resize filesystem
 577 -                             to the end of the last existing block group,
 578 -                             further resize has to be done with resize2fs,
 579 -                             either online, or offline. The argument points
 580 -                             to the unsigned logn number representing the
 581 -                             filesystem new block count.
 582 -
 583 - EXT4_IOC_MOVE_EXT           Move the block extents from orig_fd (the one
 584 -                             this ioctl is pointing to) to the donor_fd (the
 585 -                             one specified in move_extent structure passed
 586 -                             as an argument to this ioctl). Then, exchange
 587 -                             inode metadata between orig_fd and donor_fd.
 588 -                             This is especially useful for online
 589 -                             defragmentation, because the allocator has the
 590 -                             opportunity to allocate moved blocks better,
 591 -                             ideally into one contiguous extent.
 592 -
 593 - EXT4_IOC_GROUP_ADD          Add a new group descriptor to an existing or
 594 -                             new group descriptor block. The new group
 595 -                             descriptor is described by ext4_new_group_input
 596 -                             structure, which is passed as an argument to
 597 -                             this ioctl. This is especially useful in
 598 -                             conjunction with EXT4_IOC_GROUP_EXTEND,
 599 -                             which allows online resize of the filesystem
 600 -                             to the end of the last existing block group.
 601 -                             Those two ioctls combined is used in userspace
 602 -                             online resize tool (e.g. resize2fs).
 603 -
 604 - EXT4_IOC_MIGRATE            This ioctl operates on the filesystem itself.
 605 -                             It converts (migrates) ext3 indirect block mapped
 606 -                             inode to ext4 extent mapped inode by walking
 607 -                             through indirect block mapping of the original
 608 -                             inode and converting contiguous block ranges
 609 -                             into ext4 extents of the temporary inode. Then,
 610 -                             inodes are swapped. This ioctl might help, when
 611 -                             migrating from ext3 to ext4 filesystem, however
 612 -                             suggestion is to create fresh ext4 filesystem
 613 -                             and copy data from the backup. Note, that
 614 -                             filesystem has to support extents for this ioctl
 615 -                             to work.
 616 -
 617 - EXT4_IOC_ALLOC_DA_BLKS              Force all of the delay allocated blocks to be
 618 -                             allocated to preserve application-expected ext3
 619 -                             behaviour. Note that this will also start
 620 -                             triggering a write of the data blocks, but this
 621 -                             behaviour may change in the future as it is
 622 -                             not necessary and has been done this way only
 623 -                             for sake of simplicity.
 624 -
 625 - EXT4_IOC_RESIZE_FS          Resize the filesystem to a new size.  The number
 626 -                             of blocks of resized filesystem is passed in via
 627 -                             64 bit integer argument.  The kernel allocates
 628 -                             bitmaps and inode table, the userspace tool thus
 629 -                             just passes the new number of blocks.
 630 -
 631 - EXT4_IOC_SWAP_BOOT          Swap i_blocks and associated attributes
 632 -                             (like i_blocks, i_size, i_flags, ...) from
 633 -                             the specified inode with inode
 634 -                             EXT4_BOOT_LOADER_INO (#5). This is typically
 635 -                             used to store a boot loader in a secure part of
 636 -                             the filesystem, where it can't be changed by a
 637 -                             normal user by accident.
 638 -                             The data blocks of the previous boot loader
 639 -                             will be associated with the given inode.
 640 -
 641 -..............................................................................
 642 -
 643 -References
 644 -==========
 645 -
 646 -kernel source: <file:fs/ext4/>
 647 -               <file:fs/jbd2/>
 648 -
 649 -programs:      http://e2fsprogs.sourceforge.net/
 650 -
 651 -useful links:  http://fedoraproject.org/wiki/ext3-devel
 652 -               http://www.bullopensource.org/ext4/
 653 -               http://ext4.wiki.kernel.org/index.php/Main_Page
 654 -               http://fedoraproject.org/wiki/Features/Ext4
 655 diff --git a/Documentation/filesystems/ext4/ext4.rst b/Documentation/filesystems/ext4/ext4.rst
 656 new file mode 100644
 657 index 000000000000..7f628b9f7c4b
 658 --- /dev/null
 659 +++ b/Documentation/filesystems/ext4/ext4.rst
 660 @@ -0,0 +1,627 @@
 661 +
 662 +Ext4 Filesystem
 663 +===============
 664 +
 665 +Ext4 is an advanced level of the ext3 filesystem which incorporates
 666 +scalability and reliability enhancements for supporting large filesystems
 667 +(64 bit) in keeping with increasing disk capacities and state-of-the-art
 668 +feature requirements.
 669 +
 670 +Mailing list:  linux-ext4@vger.kernel.org
 671 +Web site:      http://ext4.wiki.kernel.org
 672 +
 673 +
 674 +1. Quick usage instructions:
 675 +===========================
 676 +
 677 +Note: More extensive information for getting started with ext4 can be
 678 +      found at the ext4 wiki site at the URL:
 679 +      http://ext4.wiki.kernel.org/index.php/Ext4_Howto
 680 +
 681 +  - Compile and install the latest version of e2fsprogs (as of this
 682 +    writing version 1.41.3) from:
 683 +
 684 +    http://sourceforge.net/project/showfiles.php?group_id=2406
 685 +
 686 +       or
 687 +
 688 +    https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
 689 +
 690 +       or grab the latest git repository from:
 691 +
 692 +    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
 693 +
 694 +  - Note that it is highly important to install the mke2fs.conf file
 695 +    that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
 696 +    you have edited the /etc/mke2fs.conf file installed on your system,
 697 +    you will need to merge your changes with the version from e2fsprogs
 698 +    1.41.x.
 699 +
 700 +  - Create a new filesystem using the ext4 filesystem type:
 701 +
 702 +       # mke2fs -t ext4 /dev/hda1
 703 +
 704 +    Or to configure an existing ext3 filesystem to support extents:
 705 +
 706 +       # tune2fs -O extents /dev/hda1
 707 +
 708 +    If the filesystem was created with 128 byte inodes, it can be
 709 +    converted to use 256 byte for greater efficiency via:
 710 +
 711 +        # tune2fs -I 256 /dev/hda1
 712 +
 713 +    (Note: we currently do not have tools to convert an ext4
 714 +    filesystem back to ext3; so please do not do try this on production
 715 +    filesystems.)
 716 +
 717 +  - Mounting:
 718 +
 719 +       # mount -t ext4 /dev/hda1 /wherever
 720 +
 721 +  - When comparing performance with other filesystems, it's always
 722 +    important to try multiple workloads; very often a subtle change in a
 723 +    workload parameter can completely change the ranking of which
 724 +    filesystems do well compared to others.  When comparing versus ext3,
 725 +    note that ext4 enables write barriers by default, while ext3 does
 726 +    not enable write barriers by default.  So it is useful to use
 727 +    explicitly specify whether barriers are enabled or not when via the
 728 +    '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
 729 +    for a fair comparison.  When tuning ext3 for best benchmark numbers,
 730 +    it is often worthwhile to try changing the data journaling mode; '-o
 731 +    data=writeback' can be faster for some workloads.  (Note however that
 732 +    running mounted with data=writeback can potentially leave stale data
 733 +    exposed in recently written files in case of an unclean shutdown,
 734 +    which could be a security exposure in some situations.)  Configuring
 735 +    the filesystem with a large journal can also be helpful for
 736 +    metadata-intensive workloads.
 737 +
 738 +2. Features
 739 +===========
 740 +
 741 +2.1 Currently available
 742 +
 743 +* ability to use filesystems > 16TB (e2fsprogs support not available yet)
 744 +* extent format reduces metadata overhead (RAM, IO for access, transactions)
 745 +* extent format more robust in face of on-disk corruption due to magics,
 746 +* internal redundancy in tree
 747 +* improved file allocation (multi-block alloc)
 748 +* lift 32000 subdirectory limit imposed by i_links_count[1]
 749 +* nsec timestamps for mtime, atime, ctime, create time
 750 +* inode version field on disk (NFSv4, Lustre)
 751 +* reduced e2fsck time via uninit_bg feature
 752 +* journal checksumming for robustness, performance
 753 +* persistent file preallocation (e.g for streaming media, databases)
 754 +* ability to pack bitmaps and inode tables into larger virtual groups via the
 755 +  flex_bg feature
 756 +* large file support
 757 +* inode allocation using large virtual block groups via flex_bg
 758 +* delayed allocation
 759 +* large block (up to pagesize) support
 760 +* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
 761 +  the ordering)
 762 +
 763 +[1] Filesystems with a block size of 1k may see a limit imposed by the
 764 +directory hash tree having a maximum depth of two.
 765 +
 766 +2.2 Candidate features for future inclusion
 767 +
 768 +* online defrag (patches available but not well tested)
 769 +* reduced mke2fs time via lazy itable initialization in conjunction with
 770 +  the uninit_bg feature (capability to do this is available in e2fsprogs
 771 +  but a kernel thread to do lazy zeroing of unused inode table blocks
 772 +  after filesystem is first mounted is required for safety)
 773 +
 774 +There are several others under discussion, whether they all make it in is
 775 +partly a function of how much time everyone has to work on them. Features like
 776 +metadata checksumming have been discussed and planned for a bit but no patches
 777 +exist yet so I'm not sure they're in the near-term roadmap.
 778 +
 779 +The big performance win will come with mballoc, delalloc and flex_bg
 780 +grouping of bitmaps and inode tables.  Some test results available here:
 781 +
 782 + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
 783 + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
 784 +
 785 +3. Options
 786 +==========
 787 +
 788 +When mounting an ext4 filesystem, the following option are accepted:
 789 +(*) == default
 790 +
 791 +ro                     Mount filesystem read only. Note that ext4 will
 792 +                       replay the journal (and thus write to the
 793 +                       partition) even when mounted "read only". The
 794 +                       mount options "ro,noload" can be used to prevent
 795 +                       writes to the filesystem.
 796 +
 797 +journal_checksum       Enable checksumming of the journal transactions.
 798 +                       This will allow the recovery code in e2fsck and the
 799 +                       kernel to detect corruption in the kernel.  It is a
 800 +                       compatible change and will be ignored by older kernels.
 801 +
 802 +journal_async_commit   Commit block can be written to disk without waiting
 803 +                       for descriptor blocks. If enabled older kernels cannot
 804 +                       mount the device. This will enable 'journal_checksum'
 805 +                       internally.
 806 +
 807 +journal_path=path
 808 +journal_dev=devnum     When the external journal device's major/minor numbers
 809 +                       have changed, these options allow the user to specify
 810 +                       the new journal location.  The journal device is
 811 +                       identified through either its new major/minor numbers
 812 +                       encoded in devnum, or via a path to the device.
 813 +
 814 +norecovery             Don't load the journal on mounting.  Note that
 815 +noload                 if the filesystem was not unmounted cleanly,
 816 +                       skipping the journal replay will lead to the
 817 +                       filesystem containing inconsistencies that can
 818 +                       lead to any number of problems.
 819 +
 820 +data=journal           All data are committed into the journal prior to being
 821 +                       written into the main file system.  Enabling
 822 +                       this mode will disable delayed allocation and
 823 +                       O_DIRECT support.
 824 +
 825 +data=ordered   (*)     All data are forced directly out to the main file
 826 +                       system prior to its metadata being committed to the
 827 +                       journal.
 828 +
 829 +data=writeback         Data ordering is not preserved, data may be written
 830 +                       into the main file system after its metadata has been
 831 +                       committed to the journal.
 832 +
 833 +commit=nrsec   (*)     Ext4 can be told to sync all its data and metadata
 834 +                       every 'nrsec' seconds. The default value is 5 seconds.
 835 +                       This means that if you lose your power, you will lose
 836 +                       as much as the latest 5 seconds of work (your
 837 +                       filesystem will not be damaged though, thanks to the
 838 +                       journaling).  This default value (or any low value)
 839 +                       will hurt performance, but it's good for data-safety.
 840 +                       Setting it to 0 will have the same effect as leaving
 841 +                       it at the default (5 seconds).
 842 +                       Setting it to very large values will improve
 843 +                       performance.
 844 +
 845 +barrier=<0|1(*)>       This enables/disables the use of write barriers in
 846 +barrier(*)             the jbd code.  barrier=0 disables, barrier=1 enables.
 847 +nobarrier              This also requires an IO stack which can support
 848 +                       barriers, and if jbd gets an error on a barrier
 849 +                       write, it will disable again with a warning.
 850 +                       Write barriers enforce proper on-disk ordering
 851 +                       of journal commits, making volatile disk write caches
 852 +                       safe to use, at some performance penalty.  If
 853 +                       your disks are battery-backed in one way or another,
 854 +                       disabling barriers may safely improve performance.
 855 +                       The mount options "barrier" and "nobarrier" can
 856 +                       also be used to enable or disable barriers, for
 857 +                       consistency with other ext4 mount options.
 858 +
 859 +inode_readahead_blks=n This tuning parameter controls the maximum
 860 +                       number of inode table blocks that ext4's inode
 861 +                       table readahead algorithm will pre-read into
 862 +                       the buffer cache.  The default value is 32 blocks.
 863 +
 864 +nouser_xattr           Disables Extended User Attributes.  See the
 865 +                       attr(5) manual page for more information about
 866 +                       extended attributes.
 867 +
 868 +noacl                  This option disables POSIX Access Control List
 869 +                       support. If ACL support is enabled in the kernel
 870 +                       configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
 871 +                       enabled by default on mount. See the acl(5) manual
 872 +                       page for more information about acl.
 873 +
 874 +bsddf          (*)     Make 'df' act like BSD.
 875 +minixdf                        Make 'df' act like Minix.
 876 +
 877 +debug                  Extra debugging information is sent to syslog.
 878 +
 879 +abort                  Simulate the effects of calling ext4_abort() for
 880 +                       debugging purposes.  This is normally used while
 881 +                       remounting a filesystem which is already mounted.
 882 +
 883 +errors=remount-ro      Remount the filesystem read-only on an error.
 884 +errors=continue                Keep going on a filesystem error.
 885 +errors=panic           Panic and halt the machine if an error occurs.
 886 +                        (These mount options override the errors behavior
 887 +                        specified in the superblock, which can be configured
 888 +                        using tune2fs)
 889 +
 890 +data_err=ignore(*)     Just print an error message if an error occurs
 891 +                       in a file data buffer in ordered mode.
 892 +data_err=abort         Abort the journal if an error occurs in a file
 893 +                       data buffer in ordered mode.
 894 +
 895 +grpid                  New objects have the group ID of their parent.
 896 +bsdgroups
 897 +
 898 +nogrpid                (*)     New objects have the group ID of their creator.
 899 +sysvgroups
 900 +
 901 +resgid=n               The group ID which may use the reserved blocks.
 902 +
 903 +resuid=n               The user ID which may use the reserved blocks.
 904 +
 905 +sb=n                   Use alternate superblock at this location.
 906 +
 907 +quota                  These options are ignored by the filesystem. They
 908 +noquota                        are used only by quota tools to recognize volumes
 909 +grpquota               where quota should be turned on. See documentation
 910 +usrquota               in the quota-tools package for more details
 911 +                       (http://sourceforge.net/projects/linuxquota).
 912 +
 913 +jqfmt=<quota type>     These options tell filesystem details about quota
 914 +usrjquota=<file>       so that quota information can be properly updated
 915 +grpjquota=<file>       during journal replay. They replace the above
 916 +                       quota options. See documentation in the quota-tools
 917 +                       package for more details
 918 +                       (http://sourceforge.net/projects/linuxquota).
 919 +
 920 +stripe=n               Number of filesystem blocks that mballoc will try
 921 +                       to use for allocation size and alignment. For RAID5/6
 922 +                       systems this should be the number of data
 923 +                       disks *  RAID chunk size in file system blocks.
 924 +
 925 +delalloc       (*)     Defer block allocation until just before ext4
 926 +                       writes out the block(s) in question.  This
 927 +                       allows ext4 to better allocation decisions
 928 +                       more efficiently.
 929 +nodelalloc             Disable delayed allocation.  Blocks are allocated
 930 +                       when the data is copied from userspace to the
 931 +                       page cache, either via the write(2) system call
 932 +                       or when an mmap'ed page which was previously
 933 +                       unallocated is written for the first time.
 934 +
 935 +max_batch_time=usec    Maximum amount of time ext4 should wait for
 936 +                       additional filesystem operations to be batch
 937 +                       together with a synchronous write operation.
 938 +                       Since a synchronous write operation is going to
 939 +                       force a commit and then a wait for the I/O
 940 +                       complete, it doesn't cost much, and can be a
 941 +                       huge throughput win, we wait for a small amount
 942 +                       of time to see if any other transactions can
 943 +                       piggyback on the synchronous write.   The
 944 +                       algorithm used is designed to automatically tune
 945 +                       for the speed of the disk, by measuring the
 946 +                       amount of time (on average) that it takes to
 947 +                       finish committing a transaction.  Call this time
 948 +                       the "commit time".  If the time that the
 949 +                       transaction has been running is less than the
 950 +                       commit time, ext4 will try sleeping for the
 951 +                       commit time to see if other operations will join
 952 +                       the transaction.   The commit time is capped by
 953 +                       the max_batch_time, which defaults to 15000us
 954 +                       (15ms).   This optimization can be turned off
 955 +                       entirely by setting max_batch_time to 0.
 956 +
 957 +min_batch_time=usec    This parameter sets the commit time (as
 958 +                       described above) to be at least min_batch_time.
 959 +                       It defaults to zero microseconds.  Increasing
 960 +                       this parameter may improve the throughput of
 961 +                       multi-threaded, synchronous workloads on very
 962 +                       fast disks, at the cost of increasing latency.
 963 +
 964 +journal_ioprio=prio    The I/O priority (from 0 to 7, where 0 is the
 965 +                       highest priority) which should be used for I/O
 966 +                       operations submitted by kjournald2 during a
 967 +                       commit operation.  This defaults to 3, which is
 968 +                       a slightly higher priority than the default I/O
 969 +                       priority.
 970 +
 971 +auto_da_alloc(*)       Many broken applications don't use fsync() when
 972 +noauto_da_alloc                replacing existing files via patterns such as
 973 +                       fd = open("foo.new")/write(fd,..)/close(fd)/
 974 +                       rename("foo.new", "foo"), or worse yet,
 975 +                       fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
 976 +                       If auto_da_alloc is enabled, ext4 will detect
 977 +                       the replace-via-rename and replace-via-truncate
 978 +                       patterns and force that any delayed allocation
 979 +                       blocks are allocated such that at the next
 980 +                       journal commit, in the default data=ordered
 981 +                       mode, the data blocks of the new file are forced
 982 +                       to disk before the rename() operation is
 983 +                       committed.  This provides roughly the same level
 984 +                       of guarantees as ext3, and avoids the
 985 +                       "zero-length" problem that can happen when a
 986 +                       system crashes before the delayed allocation
 987 +                       blocks are forced to disk.
 988 +
 989 +noinit_itable          Do not initialize any uninitialized inode table
 990 +                       blocks in the background.  This feature may be
 991 +                       used by installation CD's so that the install
 992 +                       process can complete as quickly as possible; the
 993 +                       inode table initialization process would then be
 994 +                       deferred until the next time the  file system
 995 +                       is unmounted.
 996 +
 997 +init_itable=n          The lazy itable init code will wait n times the
 998 +                       number of milliseconds it took to zero out the
 999 +                       previous block group's inode table.  This
1000 +                       minimizes the impact on the system performance
1001 +                       while file system's inode table is being initialized.
1002 +
1003 +discard                        Controls whether ext4 should issue discard/TRIM
1004 +nodiscard(*)           commands to the underlying block device when
1005 +                       blocks are freed.  This is useful for SSD devices
1006 +                       and sparse/thinly-provisioned LUNs, but it is off
1007 +                       by default until sufficient testing has been done.
1008 +
1009 +nouid32                        Disables 32-bit UIDs and GIDs.  This is for
1010 +                       interoperability  with  older kernels which only
1011 +                       store and expect 16-bit values.
1012 +
1013 +block_validity(*)      These options enable or disable the in-kernel
1014 +noblock_validity       facility for tracking filesystem metadata blocks
1015 +                       within internal data structures.  This allows multi-
1016 +                       block allocator and other routines to notice
1017 +                       bugs or corrupted allocation bitmaps which cause
1018 +                       blocks to be allocated which overlap with
1019 +                       filesystem metadata blocks.
1020 +
1021 +dioread_lock           Controls whether or not ext4 should use the DIO read
1022 +dioread_nolock         locking. If the dioread_nolock option is specified
1023 +                       ext4 will allocate uninitialized extent before buffer
1024 +                       write and convert the extent to initialized after IO
1025 +                       completes. This approach allows ext4 code to avoid
1026 +                       using inode mutex, which improves scalability on high
1027 +                       speed storages. However this does not work with
1028 +                       data journaling and dioread_nolock option will be
1029 +                       ignored with kernel warning. Note that dioread_nolock
1030 +                       code path is only used for extent-based files.
1031 +                       Because of the restrictions this options comprises
1032 +                       it is off by default (e.g. dioread_lock).
1033 +
1034 +max_dir_size_kb=n      This limits the size of directories so that any
1035 +                       attempt to expand them beyond the specified
1036 +                       limit in kilobytes will cause an ENOSPC error.
1037 +                       This is useful in memory constrained
1038 +                       environments, where a very large directory can
1039 +                       cause severe performance problems or even
1040 +                       provoke the Out Of Memory killer.  (For example,
1041 +                       if there is only 512mb memory available, a 176mb
1042 +                       directory may seriously cramp the system's style.)
1043 +
1044 +i_version              Enable 64-bit inode version support. This option is
1045 +                       off by default.
1046 +
1047 +dax                    Use direct access (no page cache).  See
1048 +                       Documentation/filesystems/dax.txt.  Note that
1049 +                       this option is incompatible with data=journal.
1050 +
1051 +Data Mode
1052 +=========
1053 +There are 3 different data modes:
1054 +
1055 +* writeback mode
1056 +In data=writeback mode, ext4 does not journal data at all.  This mode provides
1057 +a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
1058 +mode - metadata journaling.  A crash+recovery can cause incorrect data to
1059 +appear in files which were written shortly before the crash.  This mode will
1060 +typically provide the best ext4 performance.
1061 +
1062 +* ordered mode
1063 +In data=ordered mode, ext4 only officially journals metadata, but it logically
1064 +groups metadata information related to data changes with the data blocks into a
1065 +single unit called a transaction.  When it's time to write the new metadata
1066 +out to disk, the associated data blocks are written first.  In general,
1067 +this mode performs slightly slower than writeback but significantly faster than journal mode.
1068 +
1069 +* journal mode
1070 +data=journal mode provides full data and metadata journaling.  All new data is
1071 +written to the journal first, and then to its final location.
1072 +In the event of a crash, the journal can be replayed, bringing both data and
1073 +metadata into a consistent state.  This mode is the slowest except when data
1074 +needs to be read from and written to disk at the same time where it
1075 +outperforms all others modes.  Enabling this mode will disable delayed
1076 +allocation and O_DIRECT support.
1077 +
1078 +/proc entries
1079 +=============
1080 +
1081 +Information about mounted ext4 file systems can be found in
1082 +/proc/fs/ext4.  Each mounted filesystem will have a directory in
1083 +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
1084 +/proc/fs/ext4/dm-0).   The files in each per-device directory are shown
1085 +in table below.
1086 +
1087 +Files in /proc/fs/ext4/<devname>
1088 +..............................................................................
1089 + File            Content
1090 + mb_groups       details of multiblock allocator buddy cache of free blocks
1091 +..............................................................................
1092 +
1093 +/sys entries
1094 +============
1095 +
1096 +Information about mounted ext4 file systems can be found in
1097 +/sys/fs/ext4.  Each mounted filesystem will have a directory in
1098 +/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
1099 +/sys/fs/ext4/dm-0).   The files in each per-device directory are shown
1100 +in table below.
1101 +
1102 +Files in /sys/fs/ext4/<devname>
1103 +(see also Documentation/ABI/testing/sysfs-fs-ext4)
1104 +..............................................................................
1105 + File                         Content
1106 +
1107 + delayed_allocation_blocks    This file is read-only and shows the number of
1108 +                              blocks that are dirty in the page cache, but
1109 +                              which do not have their location in the
1110 +                              filesystem allocated yet.
1111 +
1112 + inode_goal                   Tuning parameter which (if non-zero) controls
1113 +                              the goal inode used by the inode allocator in
1114 +                              preference to all other allocation heuristics.
1115 +                              This is intended for debugging use only, and
1116 +                              should be 0 on production systems.
1117 +
1118 + inode_readahead_blks         Tuning parameter which controls the maximum
1119 +                              number of inode table blocks that ext4's inode
1120 +                              table readahead algorithm will pre-read into
1121 +                              the buffer cache
1122 +
1123 + lifetime_write_kbytes        This file is read-only and shows the number of
1124 +                              kilobytes of data that have been written to this
1125 +                              filesystem since it was created.
1126 +
1127 + max_writeback_mb_bump        The maximum number of megabytes the writeback
1128 +                              code will try to write out before move on to
1129 +                              another inode.
1130 +
1131 + mb_group_prealloc            The multiblock allocator will round up allocation
1132 +                              requests to a multiple of this tuning parameter if
1133 +                              the stripe size is not set in the ext4 superblock
1134 +
1135 + mb_max_to_scan               The maximum number of extents the multiblock
1136 +                              allocator will search to find the best extent
1137 +
1138 + mb_min_to_scan               The minimum number of extents the multiblock
1139 +                              allocator will search to find the best extent
1140 +
1141 + mb_order2_req                Tuning parameter which controls the minimum size
1142 +                              for requests (as a power of 2) where the buddy
1143 +                              cache is used
1144 +
1145 + mb_stats                     Controls whether the multiblock allocator should
1146 +                              collect statistics, which are shown during the
1147 +                              unmount. 1 means to collect statistics, 0 means
1148 +                              not to collect statistics
1149 +
1150 + mb_stream_req                Files which have fewer blocks than this tunable
1151 +                              parameter will have their blocks allocated out
1152 +                              of a block group specific preallocation pool, so
1153 +                              that small files are packed closely together.
1154 +                              Each large file will have its blocks allocated
1155 +                              out of its own unique preallocation pool.
1156 +
1157 + session_write_kbytes         This file is read-only and shows the number of
1158 +                              kilobytes of data that have been written to this
1159 +                              filesystem since it was mounted.
1160 +
1161 + reserved_clusters            This is RW file and contains number of reserved
1162 +                              clusters in the file system which will be used
1163 +                              in the specific situations to avoid costly
1164 +                              zeroout, unexpected ENOSPC, or possible data
1165 +                              loss. The default is 2% or 4096 clusters,
1166 +                              whichever is smaller and this can be changed
1167 +                              however it can never exceed number of clusters
1168 +                              in the file system. If there is not enough space
1169 +                              for the reserved space when mounting the file
1170 +                              mount will _not_ fail.
1171 +..............................................................................
1172 +
1173 +Ioctls
1174 +======
1175 +
1176 +There is some Ext4 specific functionality which can be accessed by applications
1177 +through the system call interfaces. The list of all Ext4 specific ioctls are
1178 +shown in the table below.
1179 +
1180 +Table of Ext4 specific ioctls
1181 +..............................................................................
1182 + Ioctl                       Description
1183 + EXT4_IOC_GETFLAGS           Get additional attributes associated with inode.
1184 +                             The ioctl argument is an integer bitfield, with
1185 +                             bit values described in ext4.h. This ioctl is an
1186 +                             alias for FS_IOC_GETFLAGS.
1187 +
1188 + EXT4_IOC_SETFLAGS           Set additional attributes associated with inode.
1189 +                             The ioctl argument is an integer bitfield, with
1190 +                             bit values described in ext4.h. This ioctl is an
1191 +                             alias for FS_IOC_SETFLAGS.
1192 +
1193 + EXT4_IOC_GETVERSION
1194 + EXT4_IOC_GETVERSION_OLD
1195 +                             Get the inode i_generation number stored for
1196 +                             each inode. The i_generation number is normally
1197 +                             changed only when new inode is created and it is
1198 +                             particularly useful for network filesystems. The
1199 +                             '_OLD' version of this ioctl is an alias for
1200 +                             FS_IOC_GETVERSION.
1201 +
1202 + EXT4_IOC_SETVERSION
1203 + EXT4_IOC_SETVERSION_OLD
1204 +                             Set the inode i_generation number stored for
1205 +                             each inode. The '_OLD' version of this ioctl
1206 +                             is an alias for FS_IOC_SETVERSION.
1207 +
1208 + EXT4_IOC_GROUP_EXTEND       This ioctl has the same purpose as the resize
1209 +                             mount option. It allows to resize filesystem
1210 +                             to the end of the last existing block group,
1211 +                             further resize has to be done with resize2fs,
1212 +                             either online, or offline. The argument points
1213 +                             to the unsigned logn number representing the
1214 +                             filesystem new block count.
1215 +
1216 + EXT4_IOC_MOVE_EXT           Move the block extents from orig_fd (the one
1217 +                             this ioctl is pointing to) to the donor_fd (the
1218 +                             one specified in move_extent structure passed
1219 +                             as an argument to this ioctl). Then, exchange
1220 +                             inode metadata between orig_fd and donor_fd.
1221 +                             This is especially useful for online
1222 +                             defragmentation, because the allocator has the
1223 +                             opportunity to allocate moved blocks better,
1224 +                             ideally into one contiguous extent.
1225 +
1226 + EXT4_IOC_GROUP_ADD          Add a new group descriptor to an existing or
1227 +                             new group descriptor block. The new group
1228 +                             descriptor is described by ext4_new_group_input
1229 +                             structure, which is passed as an argument to
1230 +                             this ioctl. This is especially useful in
1231 +                             conjunction with EXT4_IOC_GROUP_EXTEND,
1232 +                             which allows online resize of the filesystem
1233 +                             to the end of the last existing block group.
1234 +                             Those two ioctls combined is used in userspace
1235 +                             online resize tool (e.g. resize2fs).
1236 +
1237 + EXT4_IOC_MIGRATE            This ioctl operates on the filesystem itself.
1238 +                             It converts (migrates) ext3 indirect block mapped
1239 +                             inode to ext4 extent mapped inode by walking
1240 +                             through indirect block mapping of the original
1241 +                             inode and converting contiguous block ranges
1242 +                             into ext4 extents of the temporary inode. Then,
1243 +                             inodes are swapped. This ioctl might help, when
1244 +                             migrating from ext3 to ext4 filesystem, however
1245 +                             suggestion is to create fresh ext4 filesystem
1246 +                             and copy data from the backup. Note, that
1247 +                             filesystem has to support extents for this ioctl
1248 +                             to work.
1249 +
1250 + EXT4_IOC_ALLOC_DA_BLKS              Force all of the delay allocated blocks to be
1251 +                             allocated to preserve application-expected ext3
1252 +                             behaviour. Note that this will also start
1253 +                             triggering a write of the data blocks, but this
1254 +                             behaviour may change in the future as it is
1255 +                             not necessary and has been done this way only
1256 +                             for sake of simplicity.
1257 +
1258 + EXT4_IOC_RESIZE_FS          Resize the filesystem to a new size.  The number
1259 +                             of blocks of resized filesystem is passed in via
1260 +                             64 bit integer argument.  The kernel allocates
1261 +                             bitmaps and inode table, the userspace tool thus
1262 +                             just passes the new number of blocks.
1263 +
1264 + EXT4_IOC_SWAP_BOOT          Swap i_blocks and associated attributes
1265 +                             (like i_blocks, i_size, i_flags, ...) from
1266 +                             the specified inode with inode
1267 +                             EXT4_BOOT_LOADER_INO (#5). This is typically
1268 +                             used to store a boot loader in a secure part of
1269 +                             the filesystem, where it can't be changed by a
1270 +                             normal user by accident.
1271 +                             The data blocks of the previous boot loader
1272 +                             will be associated with the given inode.
1273 +
1274 +..............................................................................
1275 +
1276 +References
1277 +==========
1278 +
1279 +kernel source: <file:fs/ext4/>
1280 +               <file:fs/jbd2/>
1281 +
1282 +programs:      http://e2fsprogs.sourceforge.net/
1283 +
1284 +useful links:  http://fedoraproject.org/wiki/ext3-devel
1285 +               http://www.bullopensource.org/ext4/
1286 +               http://ext4.wiki.kernel.org/index.php/Main_Page
1287 +               http://fedoraproject.org/wiki/Features/Ext4
1288
1289