1 ext4: import directory layout chapter from wiki page
3 From: Darrick J. Wong <darrick.wong@oracle.com>
5 Import the chapter about directory layout from the on-disk format wiki
6 page into the kernel documentation.
8 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
9 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
11 .../filesystems/ext4/ondisk/directory.rst | 426 ++++++++++++++++++++
12 Documentation/filesystems/ext4/ondisk/dynamic.rst | 1
13 2 files changed, 427 insertions(+)
14 create mode 100644 Documentation/filesystems/ext4/ondisk/directory.rst
17 diff --git a/Documentation/filesystems/ext4/ondisk/directory.rst b/Documentation/filesystems/ext4/ondisk/directory.rst
19 index 000000000000..8fcba68c2884
21 +++ b/Documentation/filesystems/ext4/ondisk/directory.rst
23 +.. SPDX-License-Identifier: GPL-2.0
28 +In an ext4 filesystem, a directory is more or less a flat file that maps
29 +an arbitrary byte string (usually ASCII) to an inode number on the
30 +filesystem. There can be many directory entries across the filesystem
31 +that reference the same inode number--these are known as hard links, and
32 +that is why hard links cannot reference files on other filesystems. As
33 +such, directory entries are found by reading the data block(s)
34 +associated with a directory file for the particular directory entry that
37 +Linear (Classic) Directories
38 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40 +By default, each directory lists its entries in an “almost-linear”
41 +array. I write “almost” because it's not a linear array in the memory
42 +sense because directory entries are not split across filesystem blocks.
43 +Therefore, it is more accurate to say that a directory is a series of
44 +data blocks and that each block contains a linear array of directory
45 +entries. The end of each per-block array is signified by reaching the
46 +end of the block; the last entry in the block has a record length that
47 +takes it all the way to the end of the block. The end of the entire
48 +directory is of course signified by reaching the end of the file. Unused
49 +directory entries are signified by inode = 0. By default the filesystem
50 +uses ``struct ext4_dir_entry_2`` for directory entries unless the
51 +“filetype” feature flag is not set, in which case it uses
52 +``struct ext4_dir_entry``.
54 +The original directory entry format is ``struct ext4_dir_entry``, which
55 +is at most 263 bytes long, though on disk you'll need to reference
56 +``dirent.rec_len`` to know for sure.
69 + - Number of the inode that this directory entry points to.
73 + - Length of this directory entry. Must be a multiple of 4.
77 + - Length of the file name.
80 + - name[EXT4\_NAME\_LEN]
83 +Since file names cannot be longer than 255 bytes, the new directory
84 +entry format shortens the rec\_len field and uses the space for a file
85 +type flag, probably to avoid having to load every inode during directory
86 +tree traversal. This format is ``ext4_dir_entry_2``, which is at most
87 +263 bytes long, though on disk you'll need to reference
88 +``dirent.rec_len`` to know for sure.
101 + - Number of the inode that this directory entry points to.
105 + - Length of this directory entry.
109 + - Length of the file name.
113 + - File type code, see ftype_ table below.
116 + - name[EXT4\_NAME\_LEN]
121 +The directory file type is one of the following values:
136 + - Character device file.
138 + - Block device file.
146 +In order to add checksums to these classic directory blocks, a phony
147 +``struct ext4_dir_entry`` is placed at the end of each leaf block to
148 +hold the checksum. The directory entry is 12 bytes long. The inode
149 +number and name\_len fields are set to zero to fool old software into
150 +ignoring an apparently empty directory entry, and the checksum is stored
151 +in the place where the name normally goes. The structure is
152 +``struct ext4_dir_entry_tail``:
164 + - det\_reserved\_zero1
165 + - Inode number, which must be zero.
169 + - Length of this directory entry, which must be 12.
172 + - det\_reserved\_zero2
173 + - Length of the file name, which must be zero.
176 + - det\_reserved\_ft
177 + - File type, which must be 0xDE.
181 + - Directory leaf block checksum.
183 +The leaf directory block checksum is calculated against the FS UUID, the
184 +directory's inode number, the directory's inode generation number, and
185 +the entire directory entry block up to (but not including) the fake
188 +Hash Tree Directories
189 +~~~~~~~~~~~~~~~~~~~~~
191 +A linear array of directory entries isn't great for performance, so a
192 +new feature was added to ext3 to provide a faster (but peculiar)
193 +balanced tree keyed off a hash of the directory entry name. If the
194 +EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
195 +hashed btree (htree) to organize and find directory entries. For
196 +backwards read-only compatibility with ext2, this tree is actually
197 +hidden inside the directory file, masquerading as “empty” directory data
198 +blocks! It was stated previously that the end of the linear directory
199 +entry table was signified with an entry pointing to inode 0; this is
200 +(ab)used to fool the old linear-scan algorithm into thinking that the
201 +rest of the directory block is empty so that it moves on.
203 +The root of the tree always lives in the first data block of the
204 +directory. By ext2 custom, the '.' and '..' entries must appear at the
205 +beginning of this first block, so they are put here as two
206 +``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
207 +the root node contains metadata about the tree and finally a hash->block
208 +map to find nodes that are lower in the htree. If
209 +``dx_root.info.indirect_levels`` is non-zero then the htree has two
210 +levels; the data block pointed to by the root node's map is an interior
211 +node, which is indexed by a minor hash. Interior nodes in this tree
212 +contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
213 +minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
214 +array of all ``struct ext4_dir_entry_2``; all of these entries
215 +(presumably) hash to the same value. If there is an overflow, the
216 +entries simply overflow into the next leaf node, and the
217 +least-significant bit of the hash (in the interior node map) that gets
218 +us to this next leaf node is set.
220 +To traverse the directory as a htree, the code calculates the hash of
221 +the desired file name and uses it to find the corresponding block
222 +number. If the tree is flat, the block is a linear array of directory
223 +entries that can be searched; otherwise, the minor hash of the file name
224 +is computed and used against this second block to find the corresponding
225 +third block number. That third block number will be a linear array of
228 +To traverse the directory as a linear array (such as the old code does),
229 +the code simply reads every data block in the directory. The blocks used
230 +for the htree will appear to have no entries (aside from '.' and '..')
231 +and so only the leaf nodes will appear to have any interesting content.
233 +The root of the htree is in ``struct dx_root``, which is the full length
247 + - inode number of this directory.
251 + - Length of this record, 12.
255 + - Length of the name, 1.
259 + - File type of this entry, 0x2 (directory) (if the feature flag is set).
267 + - inode number of parent directory.
271 + - block\_size - 12. The record length is long enough to cover all htree
276 + - Length of the name, 2.
279 + - dotdot.file\_type
280 + - File type of this entry, 0x2 (directory) (if the feature flag is set).
287 + - struct dx\_root\_info.reserved\_zero
291 + - struct dx\_root\_info.hash\_version
292 + - Hash type, see dirhash_ table below.
295 + - struct dx\_root\_info.info\_length
296 + - Length of the tree information, 0x8.
299 + - struct dx\_root\_info.indirect\_levels
300 + - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
301 + feature is set; cannot be larger than 2 otherwise.
304 + - struct dx\_root\_info.unused\_flags
309 + - Maximum number of dx\_entries that can follow this header, plus 1 for
314 + - Actual number of dx\_entries that follow this header, plus 1 for the
319 + - The block number (within the directory file) that goes with hash=0.
323 + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
327 +The directory hash is one of the following values:
342 + - Legacy, unsigned.
344 + - Half MD4, unsigned.
348 +Interior nodes of an htree are recorded as ``struct dx_node``, which is
349 +also the full length of a data block:
362 + - Zero, to make it look like this entry is not in use.
366 + - The size of the block, in order to hide all of the dx\_node data.
370 + - Zero. There is no name for this “unused” directory entry.
374 + - Zero. There is no file type for this “unused” directory entry.
378 + - Maximum number of dx\_entries that can follow this header, plus 1 for
383 + - Actual number of dx\_entries that follow this header, plus 1 for the
388 + - The block number (within the directory file) that goes with the lowest
389 + hash value of this block. This value is stored in the parent block.
393 + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
395 +The hash maps that exist in both ``struct dx_root`` and
396 +``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes
414 + - Block number (within the directory file, not filesystem blocks) of the
415 + next node in the htree.
417 +(If you think this is all quite clever and peculiar, so does the
420 +If metadata checksums are enabled, the last 8 bytes of the directory
421 +block (precisely the length of one dx\_entry) are used to store a
422 +``struct dx_tail``, which contains the checksum. The ``limit`` and
423 +``count`` entries in the dx\_root/dx\_node structures are adjusted as
424 +necessary to fit the dx\_tail into the block. If there is no space for
425 +the dx\_tail, the user is notified to run e2fsck -D to rebuild the
426 +directory index (which will ensure that there's space for the checksum.
427 +The dx\_tail structure is 8 bytes long and looks like this:
444 + - Checksum of the htree directory block.
446 +The checksum is calculated against the FS UUID, the htree index header
447 +(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
448 +use, and the tail block (dx\_tail).
449 diff --git a/Documentation/filesystems/ext4/ondisk/dynamic.rst b/Documentation/filesystems/ext4/ondisk/dynamic.rst
450 index f090de8dd1c1..f2f14822b0f5 100644
451 --- a/Documentation/filesystems/ext4/ondisk/dynamic.rst
452 +++ b/Documentation/filesystems/ext4/ondisk/dynamic.rst
453 @@ -8,3 +8,4 @@ allocated to files.
455 .. include:: inodes.rst
456 .. include:: ifork.rst
457 +.. include:: directory.rst