import-high-level-design-chapter-from-wiki-page

   1 ext4: import high level design chapter from wiki page
   2
   3 From: Darrick J. Wong <darrick.wong@oracle.com>
   4
   5 Import the chapter about high level design from the on-disk format wiki
   6 page into the kernel documentation.
   7
   8 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
   9 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
  10 ---
  11  .../filesystems/ext4/ondisk/allocators.rst         |   56 ++++++++
  12  Documentation/filesystems/ext4/ondisk/bigalloc.rst |   22 +++
  13  .../filesystems/ext4/ondisk/blockgroup.rst         |  135 +++++++++++++++++++
  14  Documentation/filesystems/ext4/ondisk/blocks.rst   |  142 ++++++++++++++++++++
  15  .../filesystems/ext4/ondisk/checksums.rst          |   73 ++++++++++
  16  Documentation/filesystems/ext4/ondisk/eainode.rst  |   18 +++
  17  Documentation/filesystems/ext4/ondisk/index.rst    |    1
  18  .../filesystems/ext4/ondisk/inlinedata.rst         |   37 +++++
  19  Documentation/filesystems/ext4/ondisk/overview.rst |   26 ++++
  20  .../filesystems/ext4/ondisk/special_inodes.rst     |   38 +++++
  21  10 files changed, 548 insertions(+)
  22  create mode 100644 Documentation/filesystems/ext4/ondisk/allocators.rst
  23  create mode 100644 Documentation/filesystems/ext4/ondisk/bigalloc.rst
  24  create mode 100644 Documentation/filesystems/ext4/ondisk/blockgroup.rst
  25  create mode 100644 Documentation/filesystems/ext4/ondisk/blocks.rst
  26  create mode 100644 Documentation/filesystems/ext4/ondisk/checksums.rst
  27  create mode 100644 Documentation/filesystems/ext4/ondisk/eainode.rst
  28  create mode 100644 Documentation/filesystems/ext4/ondisk/inlinedata.rst
  29  create mode 100644 Documentation/filesystems/ext4/ondisk/overview.rst
  30  create mode 100644 Documentation/filesystems/ext4/ondisk/special_inodes.rst
  31
  32
  33 diff --git a/Documentation/filesystems/ext4/ondisk/allocators.rst b/Documentation/filesystems/ext4/ondisk/allocators.rst
  34 new file mode 100644
  35 index 000000000000..7aa85152ace3
  36 --- /dev/null
  37 +++ b/Documentation/filesystems/ext4/ondisk/allocators.rst
  38 @@ -0,0 +1,56 @@
  39 +.. SPDX-License-Identifier: GPL-2.0
  40 +
  41 +Block and Inode Allocation Policy
  42 +---------------------------------
  43 +
  44 +ext4 recognizes (better than ext3, anyway) that data locality is
  45 +generally a desirably quality of a filesystem. On a spinning disk,
  46 +keeping related blocks near each other reduces the amount of movement
  47 +that the head actuator and disk must perform to access a data block,
  48 +thus speeding up disk IO. On an SSD there of course are no moving parts,
  49 +but locality can increase the size of each transfer request while
  50 +reducing the total number of requests. This locality may also have the
  51 +effect of concentrating writes on a single erase block, which can speed
  52 +up file rewrites significantly. Therefore, it is useful to reduce
  53 +fragmentation whenever possible.
  54 +
  55 +The first tool that ext4 uses to combat fragmentation is the multi-block
  56 +allocator. When a file is first created, the block allocator
  57 +speculatively allocates 8KiB of disk space to the file on the assumption
  58 +that the space will get written soon. When the file is closed, the
  59 +unused speculative allocations are of course freed, but if the
  60 +speculation is correct (typically the case for full writes of small
  61 +files) then the file data gets written out in a single multi-block
  62 +extent. A second related trick that ext4 uses is delayed allocation.
  63 +Under this scheme, when a file needs more blocks to absorb file writes,
  64 +the filesystem defers deciding the exact placement on the disk until all
  65 +the dirty buffers are being written out to disk. By not committing to a
  66 +particular placement until it's absolutely necessary (the commit timeout
  67 +is hit, or sync() is called, or the kernel runs out of memory), the hope
  68 +is that the filesystem can make better location decisions.
  69 +
  70 +The third trick that ext4 (and ext3) uses is that it tries to keep a
  71 +file's data blocks in the same block group as its inode. This cuts down
  72 +on the seek penalty when the filesystem first has to read a file's inode
  73 +to learn where the file's data blocks live and then seek over to the
  74 +file's data blocks to begin I/O operations.
  75 +
  76 +The fourth trick is that all the inodes in a directory are placed in the
  77 +same block group as the directory, when feasible. The working assumption
  78 +here is that all the files in a directory might be related, therefore it
  79 +is useful to try to keep them all together.
  80 +
  81 +The fifth trick is that the disk volume is cut up into 128MB block
  82 +groups; these mini-containers are used as outlined above to try to
  83 +maintain data locality. However, there is a deliberate quirk -- when a
  84 +directory is created in the root directory, the inode allocator scans
  85 +the block groups and puts that directory into the least heavily loaded
  86 +block group that it can find. This encourages directories to spread out
  87 +over a disk; as the top-level directory/file blobs fill up one block
  88 +group, the allocators simply move on to the next block group. Allegedly
  89 +this scheme evens out the loading on the block groups, though the author
  90 +suspects that the directories which are so unlucky as to land towards
  91 +the end of a spinning drive get a raw deal performance-wise.
  92 +
  93 +Of course if all of these mechanisms fail, one can always use e4defrag
  94 +to defragment files.
  95 diff --git a/Documentation/filesystems/ext4/ondisk/bigalloc.rst b/Documentation/filesystems/ext4/ondisk/bigalloc.rst
  96 new file mode 100644
  97 index 000000000000..c6d88557553c
  98 --- /dev/null
  99 +++ b/Documentation/filesystems/ext4/ondisk/bigalloc.rst
 100 @@ -0,0 +1,22 @@
 101 +.. SPDX-License-Identifier: GPL-2.0
 102 +
 103 +Bigalloc
 104 +--------
 105 +
 106 +At the moment, the default size of a block is 4KiB, which is a commonly
 107 +supported page size on most MMU-capable hardware. This is fortunate, as
 108 +ext4 code is not prepared to handle the case where the block size
 109 +exceeds the page size. However, for a filesystem of mostly huge files,
 110 +it is desirable to be able to allocate disk blocks in units of multiple
 111 +blocks to reduce both fragmentation and metadata overhead. The
 112 +`bigalloc <Bigalloc>`__ feature provides exactly this ability. The
 113 +administrator can set a block cluster size at mkfs time (which is stored
 114 +in the s\_log\_cluster\_size field in the superblock); from then on, the
 115 +block bitmaps track clusters, not individual blocks. This means that
 116 +block groups can be several gigabytes in size (instead of just 128MiB);
 117 +however, the minimum allocation unit becomes a cluster, not a block,
 118 +even for directories. TaoBao had a patchset to extend the “use units of
 119 +clusters instead of blocks” to the extent tree, though it is not clear
 120 +where those patches went-- they eventually morphed into “extent tree v2”
 121 +but that code has not landed as of May 2015.
 122 +
 123 diff --git a/Documentation/filesystems/ext4/ondisk/blockgroup.rst b/Documentation/filesystems/ext4/ondisk/blockgroup.rst
 124 new file mode 100644
 125 index 000000000000..baf888e4c06a
 126 --- /dev/null
 127 +++ b/Documentation/filesystems/ext4/ondisk/blockgroup.rst
 128 @@ -0,0 +1,135 @@
 129 +.. SPDX-License-Identifier: GPL-2.0
 130 +
 131 +Layout
 132 +------
 133 +
 134 +The layout of a standard block group is approximately as follows (each
 135 +of these fields is discussed in a separate section below):
 136 +
 137 +.. list-table::
 138 +   :widths: 1 1 1 1 1 1 1 1
 139 +   :header-rows: 1
 140 +
 141 +   * - Group 0 Padding
 142 +     - ext4 Super Block
 143 +     - Group Descriptors
 144 +     - Reserved GDT Blocks
 145 +     - Data Block Bitmap
 146 +     - inode Bitmap
 147 +     - inode Table
 148 +     - Data Blocks
 149 +   * - 1024 bytes
 150 +     - 1 block
 151 +     - many blocks
 152 +     - many blocks
 153 +     - 1 block
 154 +     - 1 block
 155 +     - many blocks
 156 +     - many more blocks
 157 +
 158 +For the special case of block group 0, the first 1024 bytes are unused,
 159 +to allow for the installation of x86 boot sectors and other oddities.
 160 +The superblock will start at offset 1024 bytes, whichever block that
 161 +happens to be (usually 0). However, if for some reason the block size =
 162 +1024, then block 0 is marked in use and the superblock goes in block 1.
 163 +For all other block groups, there is no padding.
 164 +
 165 +The ext4 driver primarily works with the superblock and the group
 166 +descriptors that are found in block group 0. Redundant copies of the
 167 +superblock and group descriptors are written to some of the block groups
 168 +across the disk in case the beginning of the disk gets trashed, though
 169 +not all block groups necessarily host a redundant copy (see following
 170 +paragraph for more details). If the group does not have a redundant
 171 +copy, the block group begins with the data block bitmap. Note also that
 172 +when the filesystem is freshly formatted, mkfs will allocate “reserve
 173 +GDT block” space after the block group descriptors and before the start
 174 +of the block bitmaps to allow for future expansion of the filesystem. By
 175 +default, a filesystem is allowed to increase in size by a factor of
 176 +1024x over the original filesystem size.
 177 +
 178 +The location of the inode table is given by ``grp.bg_inode_table_*``. It
 179 +is continuous range of blocks large enough to contain
 180 +``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
 181 +
 182 +As for the ordering of items in a block group, it is generally
 183 +established that the super block and the group descriptor table, if
 184 +present, will be at the beginning of the block group. The bitmaps and
 185 +the inode table can be anywhere, and it is quite possible for the
 186 +bitmaps to come after the inode table, or for both to be in different
 187 +groups (flex\_bg). Leftover space is used for file data blocks, indirect
 188 +block maps, extent tree blocks, and extended attributes.
 189 +
 190 +Flexible Block Groups
 191 +---------------------
 192 +
 193 +Starting in ext4, there is a new feature called flexible block groups
 194 +(flex\_bg). In a flex\_bg, several block groups are tied together as one
 195 +logical block group; the bitmap spaces and the inode table space in the
 196 +first block group of the flex\_bg are expanded to include the bitmaps
 197 +and inode tables of all other block groups in the flex\_bg. For example,
 198 +if the flex\_bg size is 4, then group 0 will contain (in order) the
 199 +superblock, group descriptors, data block bitmaps for groups 0-3, inode
 200 +bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
 201 +space in group 0 is for file data. The effect of this is to group the
 202 +block metadata close together for faster loading, and to enable large
 203 +files to be continuous on disk. Backup copies of the superblock and
 204 +group descriptors are always at the beginning of block groups, even if
 205 +flex\_bg is enabled. The number of block groups that make up a flex\_bg
 206 +is given by 2 ^ ``sb.s_log_groups_per_flex``.
 207 +
 208 +Meta Block Groups
 209 +-----------------
 210 +
 211 +Without the option META\_BG, for safety concerns, all block group
 212 +descriptors copies are kept in the first block group. Given the default
 213 +128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
 214 +can have at most 2^27/64 = 2^21 block groups. This limits the entire
 215 +filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
 216 +
 217 +The solution to this problem is to use the metablock group feature
 218 +(META\_BG), which is already in ext3 for all 2.6 releases. With the
 219 +META\_BG feature, ext4 filesystems are partitioned into many metablock
 220 +groups. Each metablock group is a cluster of block groups whose group
 221 +descriptor structures can be stored in a single disk block. For ext4
 222 +filesystems with 4 KB block size, a single metablock group partition
 223 +includes 64 block groups, or 8 GiB of disk space. The metablock group
 224 +feature moves the location of the group descriptors from the congested
 225 +first block group of the whole filesystem into the first group of each
 226 +metablock group itself. The backups are in the second and last group of
 227 +each metablock group. This increases the 2^21 maximum block groups limit
 228 +to the hard limit 2^32, allowing support for a 512PiB filesystem.
 229 +
 230 +The change in the filesystem format replaces the current scheme where
 231 +the superblock is followed by a variable-length set of block group
 232 +descriptors. Instead, the superblock and a single block group descriptor
 233 +block is placed at the beginning of the first, second, and last block
 234 +groups in a meta-block group. A meta-block group is a collection of
 235 +block groups which can be described by a single block group descriptor
 236 +block. Since the size of the block group descriptor structure is 32
 237 +bytes, a meta-block group contains 32 block groups for filesystems with
 238 +a 1KB block size, and 128 block groups for filesystems with a 4KB
 239 +blocksize. Filesystems can either be created using this new block group
 240 +descriptor layout, or existing filesystems can be resized on-line, and
 241 +the field s\_first\_meta\_bg in the superblock will indicate the first
 242 +block group using this new layout.
 243 +
 244 +Please see an important note about ``BLOCK_UNINIT`` in the section about
 245 +block and inode bitmaps.
 246 +
 247 +Lazy Block Group Initialization
 248 +-------------------------------
 249 +
 250 +A new feature for ext4 are three block group descriptor flags that
 251 +enable mkfs to skip initializing other parts of the block group
 252 +metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean
 253 +that the inode and block bitmaps for that group can be calculated and
 254 +therefore the on-disk bitmap blocks are not initialized. This is
 255 +generally the case for an empty block group or a block group containing
 256 +only fixed-location block group metadata. The INODE\_ZEROED flag means
 257 +that the inode table has been initialized; mkfs will unset this flag and
 258 +rely on the kernel to initialize the inode tables in the background.
 259 +
 260 +By not writing zeroes to the bitmaps and inode table, mkfs time is
 261 +reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
 262 +but the dumpe2fs output prints this as “uninit\_bg”. They are the same
 263 +thing.
 264 diff --git a/Documentation/filesystems/ext4/ondisk/blocks.rst b/Documentation/filesystems/ext4/ondisk/blocks.rst
 265 new file mode 100644
 266 index 000000000000..73d4dc0f7bda
 267 --- /dev/null
 268 +++ b/Documentation/filesystems/ext4/ondisk/blocks.rst
 269 @@ -0,0 +1,142 @@
 270 +.. SPDX-License-Identifier: GPL-2.0
 271 +
 272 +Blocks
 273 +------
 274 +
 275 +ext4 allocates storage space in units of “blocks”. A block is a group of
 276 +sectors between 1KiB and 64KiB, and the number of sectors must be an
 277 +integral power of 2. Blocks are in turn grouped into larger units called
 278 +block groups. Block size is specified at mkfs time and typically is
 279 +4KiB. You may experience mounting problems if block size is greater than
 280 +page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory
 281 +pages). By default a filesystem can contain 2^32 blocks; if the '64bit'
 282 +feature is enabled, then a filesystem can have 2^64 blocks.
 283 +
 284 +For 32-bit filesystems, limits are as follows:
 285 +
 286 +.. list-table::
 287 +   :widths: 1 1 1 1 1
 288 +   :header-rows: 1
 289 +
 290 +   * - Item
 291 +     - 1KiB
 292 +     - 2KiB
 293 +     - 4KiB
 294 +     - 64KiB
 295 +   * - Blocks
 296 +     - 2^32
 297 +     - 2^32
 298 +     - 2^32
 299 +     - 2^32
 300 +   * - Inodes
 301 +     - 2^32
 302 +     - 2^32
 303 +     - 2^32
 304 +     - 2^32
 305 +   * - File System Size
 306 +     - 4TiB
 307 +     - 8TiB
 308 +     - 16TiB
 309 +     - 256PiB
 310 +   * - Blocks Per Block Group
 311 +     - 8,192
 312 +     - 16,384
 313 +     - 32,768
 314 +     - 524,288
 315 +   * - Inodes Per Block Group
 316 +     - 8,192
 317 +     - 16,384
 318 +     - 32,768
 319 +     - 524,288
 320 +   * - Block Group Size
 321 +     - 8MiB
 322 +     - 32MiB
 323 +     - 128MiB
 324 +     - 32GiB
 325 +   * - Blocks Per File, Extents
 326 +     - 2^32
 327 +     - 2^32
 328 +     - 2^32
 329 +     - 2^32
 330 +   * - Blocks Per File, Block Maps
 331 +     - 16,843,020
 332 +     - 134,480,396
 333 +     - 1,074,791,436
 334 +     - 4,398,314,962,956 (really 2^32 due to field size limitations)
 335 +   * - File Size, Extents
 336 +     - 4TiB
 337 +     - 8TiB
 338 +     - 16TiB
 339 +     - 256TiB
 340 +   * - File Size, Block Maps
 341 +     - 16GiB
 342 +     - 256GiB
 343 +     - 4TiB
 344 +     - 256TiB
 345 +
 346 +For 64-bit filesystems, limits are as follows:
 347 +
 348 +.. list-table::
 349 +   :widths: 1 1 1 1 1
 350 +   :header-rows: 1
 351 +
 352 +   * - Item
 353 +     - 1KiB
 354 +     - 2KiB
 355 +     - 4KiB
 356 +     - 64KiB
 357 +   * - Blocks
 358 +     - 2^64
 359 +     - 2^64
 360 +     - 2^64
 361 +     - 2^64
 362 +   * - Inodes
 363 +     - 2^32
 364 +     - 2^32
 365 +     - 2^32
 366 +     - 2^32
 367 +   * - File System Size
 368 +     - 16ZiB
 369 +     - 32ZiB
 370 +     - 64ZiB
 371 +     - 1YiB
 372 +   * - Blocks Per Block Group
 373 +     - 8,192
 374 +     - 16,384
 375 +     - 32,768
 376 +     - 524,288
 377 +   * - Inodes Per Block Group
 378 +     - 8,192
 379 +     - 16,384
 380 +     - 32,768
 381 +     - 524,288
 382 +   * - Block Group Size
 383 +     - 8MiB
 384 +     - 32MiB
 385 +     - 128MiB
 386 +     - 32GiB
 387 +   * - Blocks Per File, Extents
 388 +     - 2^32
 389 +     - 2^32
 390 +     - 2^32
 391 +     - 2^32
 392 +   * - Blocks Per File, Block Maps
 393 +     - 16,843,020
 394 +     - 134,480,396
 395 +     - 1,074,791,436
 396 +     - 4,398,314,962,956 (really 2^32 due to field size limitations)
 397 +   * - File Size, Extents
 398 +     - 4TiB
 399 +     - 8TiB
 400 +     - 16TiB
 401 +     - 256TiB
 402 +   * - File Size, Block Maps
 403 +     - 16GiB
 404 +     - 256GiB
 405 +     - 4TiB
 406 +     - 256TiB
 407 +
 408 +Note: Files not using extents (i.e. files using block maps) must be
 409 +placed within the first 2^32 blocks of a filesystem. Files with extents
 410 +must be placed within the first 2^48 blocks of a filesystem. It's not
 411 +clear what happens with larger filesystems.
 412 diff --git a/Documentation/filesystems/ext4/ondisk/checksums.rst b/Documentation/filesystems/ext4/ondisk/checksums.rst
 413 new file mode 100644
 414 index 000000000000..9d6a793b2e03
 415 --- /dev/null
 416 +++ b/Documentation/filesystems/ext4/ondisk/checksums.rst
 417 @@ -0,0 +1,73 @@
 418 +.. SPDX-License-Identifier: GPL-2.0
 419 +
 420 +Checksums
 421 +---------
 422 +
 423 +Starting in early 2012, metadata checksums were added to all major ext4
 424 +and jbd2 data structures. The associated feature flag is metadata\_csum.
 425 +The desired checksum algorithm is indicated in the superblock, though as
 426 +of October 2012 the only supported algorithm is crc32c. Some data
 427 +structures did not have space to fit a full 32-bit checksum, so only the
 428 +lower 16 bits are stored. Enabling the 64bit feature increases the data
 429 +structure size so that full 32-bit checksums can be stored for many data
 430 +structures. However, existing 32-bit filesystems cannot be extended to
 431 +enable 64bit mode, at least not without the experimental resize2fs
 432 +patches to do so.
 433 +
 434 +Existing filesystems can have checksumming added by running
 435 +``tune2fs -O metadata_csum`` against the underlying device. If tune2fs
 436 +encounters directory blocks that lack sufficient empty space to add a
 437 +checksum, it will request that you run ``e2fsck -D`` to have the
 438 +directories rebuilt with checksums. This has the added benefit of
 439 +removing slack space from the directory files and rebalancing the htree
 440 +indexes. If you \_ignore\_ this step, your directories will not be
 441 +protected by a checksum!
 442 +
 443 +The following table describes the data elements that go into each type
 444 +of checksum. The checksum function is whatever the superblock describes
 445 +(crc32c as of October 2013) unless noted otherwise.
 446 +
 447 +.. list-table::
 448 +   :widths: 1 1 4
 449 +   :header-rows: 1
 450 +
 451 +   * - Metadata
 452 +     - Length
 453 +     - Ingredients
 454 +   * - Superblock
 455 +     - \_\_le32
 456 +     - The entire superblock up to the checksum field. The UUID lives inside
 457 +       the superblock.
 458 +   * - MMP
 459 +     - \_\_le32
 460 +     - UUID + the entire MMP block up to the checksum field.
 461 +   * - Extended Attributes
 462 +     - \_\_le32
 463 +     - UUID + the entire extended attribute block. The checksum field is set to
 464 +       zero.
 465 +   * - Directory Entries
 466 +     - \_\_le32
 467 +     - UUID + inode number + inode generation + the directory block up to the
 468 +       fake entry enclosing the checksum field.
 469 +   * - HTREE Nodes
 470 +     - \_\_le32
 471 +     - UUID + inode number + inode generation + all valid extents + HTREE tail.
 472 +       The checksum field is set to zero.
 473 +   * - Extents
 474 +     - \_\_le32
 475 +     - UUID + inode number + inode generation + the entire extent block up to
 476 +       the checksum field.
 477 +   * - Bitmaps
 478 +     - \_\_le32 or \_\_le16
 479 +     - UUID + the entire bitmap. Checksums are stored in the group descriptor,
 480 +       and truncated if the group descriptor size is 32 bytes (i.e. ^64bit)
 481 +   * - Inodes
 482 +     - \_\_le32
 483 +     - UUID + inode number + inode generation + the entire inode. The checksum
 484 +       field is set to zero. Each inode has its own checksum.
 485 +   * - Group Descriptors
 486 +     - \_\_le16
 487 +     - If metadata\_csum, then UUID + group number + the entire descriptor;
 488 +       else if gdt\_csum, then crc16(UUID + group number + the entire
 489 +       descriptor). In all cases, only the lower 16 bits are stored.
 490 +
 491 diff --git a/Documentation/filesystems/ext4/ondisk/eainode.rst b/Documentation/filesystems/ext4/ondisk/eainode.rst
 492 new file mode 100644
 493 index 000000000000..ecc0d01a0a72
 494 --- /dev/null
 495 +++ b/Documentation/filesystems/ext4/ondisk/eainode.rst
 496 @@ -0,0 +1,18 @@
 497 +.. SPDX-License-Identifier: GPL-2.0
 498 +
 499 +Large Extended Attribute Values
 500 +-------------------------------
 501 +
 502 +To enable ext4 to store extended attribute values that do not fit in the
 503 +inode or in the single extended attribute block attached to an inode,
 504 +the EA\_INODE feature allows us to store the value in the data blocks of
 505 +a regular file inode. This “EA inode” is linked only from the extended
 506 +attribute name index and must not appear in a directory entry. The
 507 +inode's i\_atime field is used to store a checksum of the xattr value;
 508 +and i\_ctime/i\_version store a 64-bit reference count, which enables
 509 +sharing of large xattr values between multiple owning inodes. For
 510 +backward compatibility with older versions of this feature, the
 511 +i\_mtime/i\_generation *may* store a back-reference to the inode number
 512 +and i\_generation of the **one** owning inode (in cases where the EA
 513 +inode is not referenced by multiple inodes) to verify that the EA inode
 514 +is the correct one being accessed.
 515 diff --git a/Documentation/filesystems/ext4/ondisk/index.rst b/Documentation/filesystems/ext4/ondisk/index.rst
 516 index 98cde12ee8cb..282ba197b6b2 100644
 517 --- a/Documentation/filesystems/ext4/ondisk/index.rst
 518 +++ b/Documentation/filesystems/ext4/ondisk/index.rst
 519 @@ -4,3 +4,4 @@
 520  Data Structures and Algorithms
 521  ==============================
 522  .. include:: about.rst
 523 +.. include:: overview.rst
 524 diff --git a/Documentation/filesystems/ext4/ondisk/inlinedata.rst b/Documentation/filesystems/ext4/ondisk/inlinedata.rst
 525 new file mode 100644
 526 index 000000000000..d1075178ce0b
 527 --- /dev/null
 528 +++ b/Documentation/filesystems/ext4/ondisk/inlinedata.rst
 529 @@ -0,0 +1,37 @@
 530 +.. SPDX-License-Identifier: GPL-2.0
 531 +
 532 +Inline Data
 533 +-----------
 534 +
 535 +The inline data feature was designed to handle the case that a file's
 536 +data is so tiny that it readily fits inside the inode, which
 537 +(theoretically) reduces disk block consumption and reduces seeks. If the
 538 +file is smaller than 60 bytes, then the data are stored inline in
 539 +``inode.i_block``. If the rest of the file would fit inside the extended
 540 +attribute space, then it might be found as an extended attribute
 541 +“system.data” within the inode body (“ibody EA”). This of course
 542 +constrains the amount of extended attributes one can attach to an inode.
 543 +If the data size increases beyond i\_block + ibody EA, a regular block
 544 +is allocated and the contents moved to that block.
 545 +
 546 +Pending a change to compact the extended attribute key used to store
 547 +inline data, one ought to be able to store 160 bytes of data in a
 548 +256-byte inode (as of June 2015, when i\_extra\_isize is 28). Prior to
 549 +that, the limit was 156 bytes due to inefficient use of inode space.
 550 +
 551 +The inline data feature requires the presence of an extended attribute
 552 +for “system.data”, even if the attribute value is zero length.
 553 +
 554 +Inline Directories
 555 +~~~~~~~~~~~~~~~~~~
 556 +
 557 +The first four bytes of i\_block are the inode number of the parent
 558 +directory. Following that is a 56-byte space for an array of directory
 559 +entries; see ``struct ext4_dir_entry``. If there is a “system.data”
 560 +attribute in the inode body, the EA value is an array of
 561 +``struct ext4_dir_entry`` as well. Note that for inline directories, the
 562 +i\_block and EA space are treated as separate dirent blocks; directory
 563 +entries cannot span the two.
 564 +
 565 +Inline directory entries are not checksummed, as the inode checksum
 566 +should protect all inline data contents.
 567 diff --git a/Documentation/filesystems/ext4/ondisk/overview.rst b/Documentation/filesystems/ext4/ondisk/overview.rst
 568 new file mode 100644
 569 index 000000000000..cbab18baba12
 570 --- /dev/null
 571 +++ b/Documentation/filesystems/ext4/ondisk/overview.rst
 572 @@ -0,0 +1,26 @@
 573 +.. SPDX-License-Identifier: GPL-2.0
 574 +
 575 +High Level Design
 576 +=================
 577 +
 578 +An ext4 file system is split into a series of block groups. To reduce
 579 +performance difficulties due to fragmentation, the block allocator tries
 580 +very hard to keep each file's blocks within the same group, thereby
 581 +reducing seek times. The size of a block group is specified in
 582 +``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 \*
 583 +``block_size_in_bytes``. With the default block size of 4KiB, each group
 584 +will contain 32,768 blocks, for a length of 128MiB. The number of block
 585 +groups is the size of the device divided by the size of a block group.
 586 +
 587 +All fields in ext4 are written to disk in little-endian order. HOWEVER,
 588 +all fields in jbd2 (the journal) are written to disk in big-endian
 589 +order.
 590 +
 591 +.. include:: blocks.rst
 592 +.. include:: blockgroup.rst
 593 +.. include:: special_inodes.rst
 594 +.. include:: allocators.rst
 595 +.. include:: checksums.rst
 596 +.. include:: bigalloc.rst
 597 +.. include:: inlinedata.rst
 598 +.. include:: eainode.rst
 599 diff --git a/Documentation/filesystems/ext4/ondisk/special_inodes.rst b/Documentation/filesystems/ext4/ondisk/special_inodes.rst
 600 new file mode 100644
 601 index 000000000000..a82f70c9baeb
 602 --- /dev/null
 603 +++ b/Documentation/filesystems/ext4/ondisk/special_inodes.rst
 604 @@ -0,0 +1,38 @@
 605 +.. SPDX-License-Identifier: GPL-2.0
 606 +
 607 +Special inodes
 608 +--------------
 609 +
 610 +ext4 reserves some inode for special features, as follows:
 611 +
 612 +.. list-table::
 613 +   :widths: 1 79
 614 +   :header-rows: 1
 615 +
 616 +   * - inode Number
 617 +     - Purpose
 618 +   * - 0
 619 +     - Doesn't exist; there is no inode 0.
 620 +   * - 1
 621 +     - List of defective blocks.
 622 +   * - 2
 623 +     - Root directory.
 624 +   * - 3
 625 +     - User quota.
 626 +   * - 4
 627 +     - Group quota.
 628 +   * - 5
 629 +     - Boot loader.
 630 +   * - 6
 631 +     - Undelete directory.
 632 +   * - 7
 633 +     - Reserved group descriptors inode. (“resize inode”)
 634 +   * - 8
 635 +     - Journal inode.
 636 +   * - 9
 637 +     - The “exclude” inode, for snapshots(?)
 638 +   * - 10
 639 +     - Replica inode, used for some non-upstream feature?
 640 +   * - 11
 641 +     - Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock.
 642 +
 643
 644