1 ext4: move ext4.txt into its own directory
3 From: Darrick J. Wong <darrick.wong@oracle.com>
5 Move Documentation/filesystems/ext4.txt into
6 Documentation/filesystems/ext4/ext4.rst in preparation for adding more
9 Note that the documentation isn't in rst format yet, but as it's not
10 linked from anywhere it won't cause build errors.
12 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
13 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
15 Documentation/filesystems/ext4.txt | 627 -------------------------------
16 Documentation/filesystems/ext4/ext4.rst | 627 +++++++++++++++++++++++++++++++
17 2 files changed, 627 insertions(+), 627 deletions(-)
18 delete mode 100644 Documentation/filesystems/ext4.txt
19 create mode 100644 Documentation/filesystems/ext4/ext4.rst
22 diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
23 deleted file mode 100644
24 index 7f628b9f7c4b..000000000000
25 --- a/Documentation/filesystems/ext4.txt
32 -Ext4 is an advanced level of the ext3 filesystem which incorporates
33 -scalability and reliability enhancements for supporting large filesystems
34 -(64 bit) in keeping with increasing disk capacities and state-of-the-art
35 -feature requirements.
37 -Mailing list: linux-ext4@vger.kernel.org
38 -Web site: http://ext4.wiki.kernel.org
41 -1. Quick usage instructions:
42 -===========================
44 -Note: More extensive information for getting started with ext4 can be
45 - found at the ext4 wiki site at the URL:
46 - http://ext4.wiki.kernel.org/index.php/Ext4_Howto
48 - - Compile and install the latest version of e2fsprogs (as of this
49 - writing version 1.41.3) from:
51 - http://sourceforge.net/project/showfiles.php?group_id=2406
55 - https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
57 - or grab the latest git repository from:
59 - git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
61 - - Note that it is highly important to install the mke2fs.conf file
62 - that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
63 - you have edited the /etc/mke2fs.conf file installed on your system,
64 - you will need to merge your changes with the version from e2fsprogs
67 - - Create a new filesystem using the ext4 filesystem type:
69 - # mke2fs -t ext4 /dev/hda1
71 - Or to configure an existing ext3 filesystem to support extents:
73 - # tune2fs -O extents /dev/hda1
75 - If the filesystem was created with 128 byte inodes, it can be
76 - converted to use 256 byte for greater efficiency via:
78 - # tune2fs -I 256 /dev/hda1
80 - (Note: we currently do not have tools to convert an ext4
81 - filesystem back to ext3; so please do not do try this on production
86 - # mount -t ext4 /dev/hda1 /wherever
88 - - When comparing performance with other filesystems, it's always
89 - important to try multiple workloads; very often a subtle change in a
90 - workload parameter can completely change the ranking of which
91 - filesystems do well compared to others. When comparing versus ext3,
92 - note that ext4 enables write barriers by default, while ext3 does
93 - not enable write barriers by default. So it is useful to use
94 - explicitly specify whether barriers are enabled or not when via the
95 - '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
96 - for a fair comparison. When tuning ext3 for best benchmark numbers,
97 - it is often worthwhile to try changing the data journaling mode; '-o
98 - data=writeback' can be faster for some workloads. (Note however that
99 - running mounted with data=writeback can potentially leave stale data
100 - exposed in recently written files in case of an unclean shutdown,
101 - which could be a security exposure in some situations.) Configuring
102 - the filesystem with a large journal can also be helpful for
103 - metadata-intensive workloads.
108 -2.1 Currently available
110 -* ability to use filesystems > 16TB (e2fsprogs support not available yet)
111 -* extent format reduces metadata overhead (RAM, IO for access, transactions)
112 -* extent format more robust in face of on-disk corruption due to magics,
113 -* internal redundancy in tree
114 -* improved file allocation (multi-block alloc)
115 -* lift 32000 subdirectory limit imposed by i_links_count[1]
116 -* nsec timestamps for mtime, atime, ctime, create time
117 -* inode version field on disk (NFSv4, Lustre)
118 -* reduced e2fsck time via uninit_bg feature
119 -* journal checksumming for robustness, performance
120 -* persistent file preallocation (e.g for streaming media, databases)
121 -* ability to pack bitmaps and inode tables into larger virtual groups via the
123 -* large file support
124 -* inode allocation using large virtual block groups via flex_bg
125 -* delayed allocation
126 -* large block (up to pagesize) support
127 -* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
130 -[1] Filesystems with a block size of 1k may see a limit imposed by the
131 -directory hash tree having a maximum depth of two.
133 -2.2 Candidate features for future inclusion
135 -* online defrag (patches available but not well tested)
136 -* reduced mke2fs time via lazy itable initialization in conjunction with
137 - the uninit_bg feature (capability to do this is available in e2fsprogs
138 - but a kernel thread to do lazy zeroing of unused inode table blocks
139 - after filesystem is first mounted is required for safety)
141 -There are several others under discussion, whether they all make it in is
142 -partly a function of how much time everyone has to work on them. Features like
143 -metadata checksumming have been discussed and planned for a bit but no patches
144 -exist yet so I'm not sure they're in the near-term roadmap.
146 -The big performance win will come with mballoc, delalloc and flex_bg
147 -grouping of bitmaps and inode tables. Some test results available here:
149 - - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
150 - - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
155 -When mounting an ext4 filesystem, the following option are accepted:
158 -ro Mount filesystem read only. Note that ext4 will
159 - replay the journal (and thus write to the
160 - partition) even when mounted "read only". The
161 - mount options "ro,noload" can be used to prevent
162 - writes to the filesystem.
164 -journal_checksum Enable checksumming of the journal transactions.
165 - This will allow the recovery code in e2fsck and the
166 - kernel to detect corruption in the kernel. It is a
167 - compatible change and will be ignored by older kernels.
169 -journal_async_commit Commit block can be written to disk without waiting
170 - for descriptor blocks. If enabled older kernels cannot
171 - mount the device. This will enable 'journal_checksum'
175 -journal_dev=devnum When the external journal device's major/minor numbers
176 - have changed, these options allow the user to specify
177 - the new journal location. The journal device is
178 - identified through either its new major/minor numbers
179 - encoded in devnum, or via a path to the device.
181 -norecovery Don't load the journal on mounting. Note that
182 -noload if the filesystem was not unmounted cleanly,
183 - skipping the journal replay will lead to the
184 - filesystem containing inconsistencies that can
185 - lead to any number of problems.
187 -data=journal All data are committed into the journal prior to being
188 - written into the main file system. Enabling
189 - this mode will disable delayed allocation and
192 -data=ordered (*) All data are forced directly out to the main file
193 - system prior to its metadata being committed to the
196 -data=writeback Data ordering is not preserved, data may be written
197 - into the main file system after its metadata has been
198 - committed to the journal.
200 -commit=nrsec (*) Ext4 can be told to sync all its data and metadata
201 - every 'nrsec' seconds. The default value is 5 seconds.
202 - This means that if you lose your power, you will lose
203 - as much as the latest 5 seconds of work (your
204 - filesystem will not be damaged though, thanks to the
205 - journaling). This default value (or any low value)
206 - will hurt performance, but it's good for data-safety.
207 - Setting it to 0 will have the same effect as leaving
208 - it at the default (5 seconds).
209 - Setting it to very large values will improve
212 -barrier=<0|1(*)> This enables/disables the use of write barriers in
213 -barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
214 -nobarrier This also requires an IO stack which can support
215 - barriers, and if jbd gets an error on a barrier
216 - write, it will disable again with a warning.
217 - Write barriers enforce proper on-disk ordering
218 - of journal commits, making volatile disk write caches
219 - safe to use, at some performance penalty. If
220 - your disks are battery-backed in one way or another,
221 - disabling barriers may safely improve performance.
222 - The mount options "barrier" and "nobarrier" can
223 - also be used to enable or disable barriers, for
224 - consistency with other ext4 mount options.
226 -inode_readahead_blks=n This tuning parameter controls the maximum
227 - number of inode table blocks that ext4's inode
228 - table readahead algorithm will pre-read into
229 - the buffer cache. The default value is 32 blocks.
231 -nouser_xattr Disables Extended User Attributes. See the
232 - attr(5) manual page for more information about
233 - extended attributes.
235 -noacl This option disables POSIX Access Control List
236 - support. If ACL support is enabled in the kernel
237 - configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
238 - enabled by default on mount. See the acl(5) manual
239 - page for more information about acl.
241 -bsddf (*) Make 'df' act like BSD.
242 -minixdf Make 'df' act like Minix.
244 -debug Extra debugging information is sent to syslog.
246 -abort Simulate the effects of calling ext4_abort() for
247 - debugging purposes. This is normally used while
248 - remounting a filesystem which is already mounted.
250 -errors=remount-ro Remount the filesystem read-only on an error.
251 -errors=continue Keep going on a filesystem error.
252 -errors=panic Panic and halt the machine if an error occurs.
253 - (These mount options override the errors behavior
254 - specified in the superblock, which can be configured
257 -data_err=ignore(*) Just print an error message if an error occurs
258 - in a file data buffer in ordered mode.
259 -data_err=abort Abort the journal if an error occurs in a file
260 - data buffer in ordered mode.
262 -grpid New objects have the group ID of their parent.
265 -nogrpid (*) New objects have the group ID of their creator.
268 -resgid=n The group ID which may use the reserved blocks.
270 -resuid=n The user ID which may use the reserved blocks.
272 -sb=n Use alternate superblock at this location.
274 -quota These options are ignored by the filesystem. They
275 -noquota are used only by quota tools to recognize volumes
276 -grpquota where quota should be turned on. See documentation
277 -usrquota in the quota-tools package for more details
278 - (http://sourceforge.net/projects/linuxquota).
280 -jqfmt=<quota type> These options tell filesystem details about quota
281 -usrjquota=<file> so that quota information can be properly updated
282 -grpjquota=<file> during journal replay. They replace the above
283 - quota options. See documentation in the quota-tools
284 - package for more details
285 - (http://sourceforge.net/projects/linuxquota).
287 -stripe=n Number of filesystem blocks that mballoc will try
288 - to use for allocation size and alignment. For RAID5/6
289 - systems this should be the number of data
290 - disks * RAID chunk size in file system blocks.
292 -delalloc (*) Defer block allocation until just before ext4
293 - writes out the block(s) in question. This
294 - allows ext4 to better allocation decisions
296 -nodelalloc Disable delayed allocation. Blocks are allocated
297 - when the data is copied from userspace to the
298 - page cache, either via the write(2) system call
299 - or when an mmap'ed page which was previously
300 - unallocated is written for the first time.
302 -max_batch_time=usec Maximum amount of time ext4 should wait for
303 - additional filesystem operations to be batch
304 - together with a synchronous write operation.
305 - Since a synchronous write operation is going to
306 - force a commit and then a wait for the I/O
307 - complete, it doesn't cost much, and can be a
308 - huge throughput win, we wait for a small amount
309 - of time to see if any other transactions can
310 - piggyback on the synchronous write. The
311 - algorithm used is designed to automatically tune
312 - for the speed of the disk, by measuring the
313 - amount of time (on average) that it takes to
314 - finish committing a transaction. Call this time
315 - the "commit time". If the time that the
316 - transaction has been running is less than the
317 - commit time, ext4 will try sleeping for the
318 - commit time to see if other operations will join
319 - the transaction. The commit time is capped by
320 - the max_batch_time, which defaults to 15000us
321 - (15ms). This optimization can be turned off
322 - entirely by setting max_batch_time to 0.
324 -min_batch_time=usec This parameter sets the commit time (as
325 - described above) to be at least min_batch_time.
326 - It defaults to zero microseconds. Increasing
327 - this parameter may improve the throughput of
328 - multi-threaded, synchronous workloads on very
329 - fast disks, at the cost of increasing latency.
331 -journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
332 - highest priority) which should be used for I/O
333 - operations submitted by kjournald2 during a
334 - commit operation. This defaults to 3, which is
335 - a slightly higher priority than the default I/O
338 -auto_da_alloc(*) Many broken applications don't use fsync() when
339 -noauto_da_alloc replacing existing files via patterns such as
340 - fd = open("foo.new")/write(fd,..)/close(fd)/
341 - rename("foo.new", "foo"), or worse yet,
342 - fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
343 - If auto_da_alloc is enabled, ext4 will detect
344 - the replace-via-rename and replace-via-truncate
345 - patterns and force that any delayed allocation
346 - blocks are allocated such that at the next
347 - journal commit, in the default data=ordered
348 - mode, the data blocks of the new file are forced
349 - to disk before the rename() operation is
350 - committed. This provides roughly the same level
351 - of guarantees as ext3, and avoids the
352 - "zero-length" problem that can happen when a
353 - system crashes before the delayed allocation
354 - blocks are forced to disk.
356 -noinit_itable Do not initialize any uninitialized inode table
357 - blocks in the background. This feature may be
358 - used by installation CD's so that the install
359 - process can complete as quickly as possible; the
360 - inode table initialization process would then be
361 - deferred until the next time the file system
364 -init_itable=n The lazy itable init code will wait n times the
365 - number of milliseconds it took to zero out the
366 - previous block group's inode table. This
367 - minimizes the impact on the system performance
368 - while file system's inode table is being initialized.
370 -discard Controls whether ext4 should issue discard/TRIM
371 -nodiscard(*) commands to the underlying block device when
372 - blocks are freed. This is useful for SSD devices
373 - and sparse/thinly-provisioned LUNs, but it is off
374 - by default until sufficient testing has been done.
376 -nouid32 Disables 32-bit UIDs and GIDs. This is for
377 - interoperability with older kernels which only
378 - store and expect 16-bit values.
380 -block_validity(*) These options enable or disable the in-kernel
381 -noblock_validity facility for tracking filesystem metadata blocks
382 - within internal data structures. This allows multi-
383 - block allocator and other routines to notice
384 - bugs or corrupted allocation bitmaps which cause
385 - blocks to be allocated which overlap with
386 - filesystem metadata blocks.
388 -dioread_lock Controls whether or not ext4 should use the DIO read
389 -dioread_nolock locking. If the dioread_nolock option is specified
390 - ext4 will allocate uninitialized extent before buffer
391 - write and convert the extent to initialized after IO
392 - completes. This approach allows ext4 code to avoid
393 - using inode mutex, which improves scalability on high
394 - speed storages. However this does not work with
395 - data journaling and dioread_nolock option will be
396 - ignored with kernel warning. Note that dioread_nolock
397 - code path is only used for extent-based files.
398 - Because of the restrictions this options comprises
399 - it is off by default (e.g. dioread_lock).
401 -max_dir_size_kb=n This limits the size of directories so that any
402 - attempt to expand them beyond the specified
403 - limit in kilobytes will cause an ENOSPC error.
404 - This is useful in memory constrained
405 - environments, where a very large directory can
406 - cause severe performance problems or even
407 - provoke the Out Of Memory killer. (For example,
408 - if there is only 512mb memory available, a 176mb
409 - directory may seriously cramp the system's style.)
411 -i_version Enable 64-bit inode version support. This option is
414 -dax Use direct access (no page cache). See
415 - Documentation/filesystems/dax.txt. Note that
416 - this option is incompatible with data=journal.
420 -There are 3 different data modes:
423 -In data=writeback mode, ext4 does not journal data at all. This mode provides
424 -a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
425 -mode - metadata journaling. A crash+recovery can cause incorrect data to
426 -appear in files which were written shortly before the crash. This mode will
427 -typically provide the best ext4 performance.
430 -In data=ordered mode, ext4 only officially journals metadata, but it logically
431 -groups metadata information related to data changes with the data blocks into a
432 -single unit called a transaction. When it's time to write the new metadata
433 -out to disk, the associated data blocks are written first. In general,
434 -this mode performs slightly slower than writeback but significantly faster than journal mode.
437 -data=journal mode provides full data and metadata journaling. All new data is
438 -written to the journal first, and then to its final location.
439 -In the event of a crash, the journal can be replayed, bringing both data and
440 -metadata into a consistent state. This mode is the slowest except when data
441 -needs to be read from and written to disk at the same time where it
442 -outperforms all others modes. Enabling this mode will disable delayed
443 -allocation and O_DIRECT support.
448 -Information about mounted ext4 file systems can be found in
449 -/proc/fs/ext4. Each mounted filesystem will have a directory in
450 -/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
451 -/proc/fs/ext4/dm-0). The files in each per-device directory are shown
454 -Files in /proc/fs/ext4/<devname>
455 -..............................................................................
457 - mb_groups details of multiblock allocator buddy cache of free blocks
458 -..............................................................................
463 -Information about mounted ext4 file systems can be found in
464 -/sys/fs/ext4. Each mounted filesystem will have a directory in
465 -/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
466 -/sys/fs/ext4/dm-0). The files in each per-device directory are shown
469 -Files in /sys/fs/ext4/<devname>
470 -(see also Documentation/ABI/testing/sysfs-fs-ext4)
471 -..............................................................................
474 - delayed_allocation_blocks This file is read-only and shows the number of
475 - blocks that are dirty in the page cache, but
476 - which do not have their location in the
477 - filesystem allocated yet.
479 - inode_goal Tuning parameter which (if non-zero) controls
480 - the goal inode used by the inode allocator in
481 - preference to all other allocation heuristics.
482 - This is intended for debugging use only, and
483 - should be 0 on production systems.
485 - inode_readahead_blks Tuning parameter which controls the maximum
486 - number of inode table blocks that ext4's inode
487 - table readahead algorithm will pre-read into
490 - lifetime_write_kbytes This file is read-only and shows the number of
491 - kilobytes of data that have been written to this
492 - filesystem since it was created.
494 - max_writeback_mb_bump The maximum number of megabytes the writeback
495 - code will try to write out before move on to
498 - mb_group_prealloc The multiblock allocator will round up allocation
499 - requests to a multiple of this tuning parameter if
500 - the stripe size is not set in the ext4 superblock
502 - mb_max_to_scan The maximum number of extents the multiblock
503 - allocator will search to find the best extent
505 - mb_min_to_scan The minimum number of extents the multiblock
506 - allocator will search to find the best extent
508 - mb_order2_req Tuning parameter which controls the minimum size
509 - for requests (as a power of 2) where the buddy
512 - mb_stats Controls whether the multiblock allocator should
513 - collect statistics, which are shown during the
514 - unmount. 1 means to collect statistics, 0 means
515 - not to collect statistics
517 - mb_stream_req Files which have fewer blocks than this tunable
518 - parameter will have their blocks allocated out
519 - of a block group specific preallocation pool, so
520 - that small files are packed closely together.
521 - Each large file will have its blocks allocated
522 - out of its own unique preallocation pool.
524 - session_write_kbytes This file is read-only and shows the number of
525 - kilobytes of data that have been written to this
526 - filesystem since it was mounted.
528 - reserved_clusters This is RW file and contains number of reserved
529 - clusters in the file system which will be used
530 - in the specific situations to avoid costly
531 - zeroout, unexpected ENOSPC, or possible data
532 - loss. The default is 2% or 4096 clusters,
533 - whichever is smaller and this can be changed
534 - however it can never exceed number of clusters
535 - in the file system. If there is not enough space
536 - for the reserved space when mounting the file
537 - mount will _not_ fail.
538 -..............................................................................
543 -There is some Ext4 specific functionality which can be accessed by applications
544 -through the system call interfaces. The list of all Ext4 specific ioctls are
545 -shown in the table below.
547 -Table of Ext4 specific ioctls
548 -..............................................................................
550 - EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
551 - The ioctl argument is an integer bitfield, with
552 - bit values described in ext4.h. This ioctl is an
553 - alias for FS_IOC_GETFLAGS.
555 - EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
556 - The ioctl argument is an integer bitfield, with
557 - bit values described in ext4.h. This ioctl is an
558 - alias for FS_IOC_SETFLAGS.
560 - EXT4_IOC_GETVERSION
561 - EXT4_IOC_GETVERSION_OLD
562 - Get the inode i_generation number stored for
563 - each inode. The i_generation number is normally
564 - changed only when new inode is created and it is
565 - particularly useful for network filesystems. The
566 - '_OLD' version of this ioctl is an alias for
569 - EXT4_IOC_SETVERSION
570 - EXT4_IOC_SETVERSION_OLD
571 - Set the inode i_generation number stored for
572 - each inode. The '_OLD' version of this ioctl
573 - is an alias for FS_IOC_SETVERSION.
575 - EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
576 - mount option. It allows to resize filesystem
577 - to the end of the last existing block group,
578 - further resize has to be done with resize2fs,
579 - either online, or offline. The argument points
580 - to the unsigned logn number representing the
581 - filesystem new block count.
583 - EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
584 - this ioctl is pointing to) to the donor_fd (the
585 - one specified in move_extent structure passed
586 - as an argument to this ioctl). Then, exchange
587 - inode metadata between orig_fd and donor_fd.
588 - This is especially useful for online
589 - defragmentation, because the allocator has the
590 - opportunity to allocate moved blocks better,
591 - ideally into one contiguous extent.
593 - EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
594 - new group descriptor block. The new group
595 - descriptor is described by ext4_new_group_input
596 - structure, which is passed as an argument to
597 - this ioctl. This is especially useful in
598 - conjunction with EXT4_IOC_GROUP_EXTEND,
599 - which allows online resize of the filesystem
600 - to the end of the last existing block group.
601 - Those two ioctls combined is used in userspace
602 - online resize tool (e.g. resize2fs).
604 - EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
605 - It converts (migrates) ext3 indirect block mapped
606 - inode to ext4 extent mapped inode by walking
607 - through indirect block mapping of the original
608 - inode and converting contiguous block ranges
609 - into ext4 extents of the temporary inode. Then,
610 - inodes are swapped. This ioctl might help, when
611 - migrating from ext3 to ext4 filesystem, however
612 - suggestion is to create fresh ext4 filesystem
613 - and copy data from the backup. Note, that
614 - filesystem has to support extents for this ioctl
617 - EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
618 - allocated to preserve application-expected ext3
619 - behaviour. Note that this will also start
620 - triggering a write of the data blocks, but this
621 - behaviour may change in the future as it is
622 - not necessary and has been done this way only
623 - for sake of simplicity.
625 - EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number
626 - of blocks of resized filesystem is passed in via
627 - 64 bit integer argument. The kernel allocates
628 - bitmaps and inode table, the userspace tool thus
629 - just passes the new number of blocks.
631 - EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes
632 - (like i_blocks, i_size, i_flags, ...) from
633 - the specified inode with inode
634 - EXT4_BOOT_LOADER_INO (#5). This is typically
635 - used to store a boot loader in a secure part of
636 - the filesystem, where it can't be changed by a
637 - normal user by accident.
638 - The data blocks of the previous boot loader
639 - will be associated with the given inode.
641 -..............................................................................
646 -kernel source: <file:fs/ext4/>
649 -programs: http://e2fsprogs.sourceforge.net/
651 -useful links: http://fedoraproject.org/wiki/ext3-devel
652 - http://www.bullopensource.org/ext4/
653 - http://ext4.wiki.kernel.org/index.php/Main_Page
654 - http://fedoraproject.org/wiki/Features/Ext4
655 diff --git a/Documentation/filesystems/ext4/ext4.rst b/Documentation/filesystems/ext4/ext4.rst
657 index 000000000000..7f628b9f7c4b
659 +++ b/Documentation/filesystems/ext4/ext4.rst
665 +Ext4 is an advanced level of the ext3 filesystem which incorporates
666 +scalability and reliability enhancements for supporting large filesystems
667 +(64 bit) in keeping with increasing disk capacities and state-of-the-art
668 +feature requirements.
670 +Mailing list: linux-ext4@vger.kernel.org
671 +Web site: http://ext4.wiki.kernel.org
674 +1. Quick usage instructions:
675 +===========================
677 +Note: More extensive information for getting started with ext4 can be
678 + found at the ext4 wiki site at the URL:
679 + http://ext4.wiki.kernel.org/index.php/Ext4_Howto
681 + - Compile and install the latest version of e2fsprogs (as of this
682 + writing version 1.41.3) from:
684 + http://sourceforge.net/project/showfiles.php?group_id=2406
688 + https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
690 + or grab the latest git repository from:
692 + git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
694 + - Note that it is highly important to install the mke2fs.conf file
695 + that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
696 + you have edited the /etc/mke2fs.conf file installed on your system,
697 + you will need to merge your changes with the version from e2fsprogs
700 + - Create a new filesystem using the ext4 filesystem type:
702 + # mke2fs -t ext4 /dev/hda1
704 + Or to configure an existing ext3 filesystem to support extents:
706 + # tune2fs -O extents /dev/hda1
708 + If the filesystem was created with 128 byte inodes, it can be
709 + converted to use 256 byte for greater efficiency via:
711 + # tune2fs -I 256 /dev/hda1
713 + (Note: we currently do not have tools to convert an ext4
714 + filesystem back to ext3; so please do not do try this on production
719 + # mount -t ext4 /dev/hda1 /wherever
721 + - When comparing performance with other filesystems, it's always
722 + important to try multiple workloads; very often a subtle change in a
723 + workload parameter can completely change the ranking of which
724 + filesystems do well compared to others. When comparing versus ext3,
725 + note that ext4 enables write barriers by default, while ext3 does
726 + not enable write barriers by default. So it is useful to use
727 + explicitly specify whether barriers are enabled or not when via the
728 + '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
729 + for a fair comparison. When tuning ext3 for best benchmark numbers,
730 + it is often worthwhile to try changing the data journaling mode; '-o
731 + data=writeback' can be faster for some workloads. (Note however that
732 + running mounted with data=writeback can potentially leave stale data
733 + exposed in recently written files in case of an unclean shutdown,
734 + which could be a security exposure in some situations.) Configuring
735 + the filesystem with a large journal can also be helpful for
736 + metadata-intensive workloads.
741 +2.1 Currently available
743 +* ability to use filesystems > 16TB (e2fsprogs support not available yet)
744 +* extent format reduces metadata overhead (RAM, IO for access, transactions)
745 +* extent format more robust in face of on-disk corruption due to magics,
746 +* internal redundancy in tree
747 +* improved file allocation (multi-block alloc)
748 +* lift 32000 subdirectory limit imposed by i_links_count[1]
749 +* nsec timestamps for mtime, atime, ctime, create time
750 +* inode version field on disk (NFSv4, Lustre)
751 +* reduced e2fsck time via uninit_bg feature
752 +* journal checksumming for robustness, performance
753 +* persistent file preallocation (e.g for streaming media, databases)
754 +* ability to pack bitmaps and inode tables into larger virtual groups via the
756 +* large file support
757 +* inode allocation using large virtual block groups via flex_bg
758 +* delayed allocation
759 +* large block (up to pagesize) support
760 +* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
763 +[1] Filesystems with a block size of 1k may see a limit imposed by the
764 +directory hash tree having a maximum depth of two.
766 +2.2 Candidate features for future inclusion
768 +* online defrag (patches available but not well tested)
769 +* reduced mke2fs time via lazy itable initialization in conjunction with
770 + the uninit_bg feature (capability to do this is available in e2fsprogs
771 + but a kernel thread to do lazy zeroing of unused inode table blocks
772 + after filesystem is first mounted is required for safety)
774 +There are several others under discussion, whether they all make it in is
775 +partly a function of how much time everyone has to work on them. Features like
776 +metadata checksumming have been discussed and planned for a bit but no patches
777 +exist yet so I'm not sure they're in the near-term roadmap.
779 +The big performance win will come with mballoc, delalloc and flex_bg
780 +grouping of bitmaps and inode tables. Some test results available here:
782 + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
783 + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
788 +When mounting an ext4 filesystem, the following option are accepted:
791 +ro Mount filesystem read only. Note that ext4 will
792 + replay the journal (and thus write to the
793 + partition) even when mounted "read only". The
794 + mount options "ro,noload" can be used to prevent
795 + writes to the filesystem.
797 +journal_checksum Enable checksumming of the journal transactions.
798 + This will allow the recovery code in e2fsck and the
799 + kernel to detect corruption in the kernel. It is a
800 + compatible change and will be ignored by older kernels.
802 +journal_async_commit Commit block can be written to disk without waiting
803 + for descriptor blocks. If enabled older kernels cannot
804 + mount the device. This will enable 'journal_checksum'
808 +journal_dev=devnum When the external journal device's major/minor numbers
809 + have changed, these options allow the user to specify
810 + the new journal location. The journal device is
811 + identified through either its new major/minor numbers
812 + encoded in devnum, or via a path to the device.
814 +norecovery Don't load the journal on mounting. Note that
815 +noload if the filesystem was not unmounted cleanly,
816 + skipping the journal replay will lead to the
817 + filesystem containing inconsistencies that can
818 + lead to any number of problems.
820 +data=journal All data are committed into the journal prior to being
821 + written into the main file system. Enabling
822 + this mode will disable delayed allocation and
825 +data=ordered (*) All data are forced directly out to the main file
826 + system prior to its metadata being committed to the
829 +data=writeback Data ordering is not preserved, data may be written
830 + into the main file system after its metadata has been
831 + committed to the journal.
833 +commit=nrsec (*) Ext4 can be told to sync all its data and metadata
834 + every 'nrsec' seconds. The default value is 5 seconds.
835 + This means that if you lose your power, you will lose
836 + as much as the latest 5 seconds of work (your
837 + filesystem will not be damaged though, thanks to the
838 + journaling). This default value (or any low value)
839 + will hurt performance, but it's good for data-safety.
840 + Setting it to 0 will have the same effect as leaving
841 + it at the default (5 seconds).
842 + Setting it to very large values will improve
845 +barrier=<0|1(*)> This enables/disables the use of write barriers in
846 +barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
847 +nobarrier This also requires an IO stack which can support
848 + barriers, and if jbd gets an error on a barrier
849 + write, it will disable again with a warning.
850 + Write barriers enforce proper on-disk ordering
851 + of journal commits, making volatile disk write caches
852 + safe to use, at some performance penalty. If
853 + your disks are battery-backed in one way or another,
854 + disabling barriers may safely improve performance.
855 + The mount options "barrier" and "nobarrier" can
856 + also be used to enable or disable barriers, for
857 + consistency with other ext4 mount options.
859 +inode_readahead_blks=n This tuning parameter controls the maximum
860 + number of inode table blocks that ext4's inode
861 + table readahead algorithm will pre-read into
862 + the buffer cache. The default value is 32 blocks.
864 +nouser_xattr Disables Extended User Attributes. See the
865 + attr(5) manual page for more information about
866 + extended attributes.
868 +noacl This option disables POSIX Access Control List
869 + support. If ACL support is enabled in the kernel
870 + configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
871 + enabled by default on mount. See the acl(5) manual
872 + page for more information about acl.
874 +bsddf (*) Make 'df' act like BSD.
875 +minixdf Make 'df' act like Minix.
877 +debug Extra debugging information is sent to syslog.
879 +abort Simulate the effects of calling ext4_abort() for
880 + debugging purposes. This is normally used while
881 + remounting a filesystem which is already mounted.
883 +errors=remount-ro Remount the filesystem read-only on an error.
884 +errors=continue Keep going on a filesystem error.
885 +errors=panic Panic and halt the machine if an error occurs.
886 + (These mount options override the errors behavior
887 + specified in the superblock, which can be configured
890 +data_err=ignore(*) Just print an error message if an error occurs
891 + in a file data buffer in ordered mode.
892 +data_err=abort Abort the journal if an error occurs in a file
893 + data buffer in ordered mode.
895 +grpid New objects have the group ID of their parent.
898 +nogrpid (*) New objects have the group ID of their creator.
901 +resgid=n The group ID which may use the reserved blocks.
903 +resuid=n The user ID which may use the reserved blocks.
905 +sb=n Use alternate superblock at this location.
907 +quota These options are ignored by the filesystem. They
908 +noquota are used only by quota tools to recognize volumes
909 +grpquota where quota should be turned on. See documentation
910 +usrquota in the quota-tools package for more details
911 + (http://sourceforge.net/projects/linuxquota).
913 +jqfmt=<quota type> These options tell filesystem details about quota
914 +usrjquota=<file> so that quota information can be properly updated
915 +grpjquota=<file> during journal replay. They replace the above
916 + quota options. See documentation in the quota-tools
917 + package for more details
918 + (http://sourceforge.net/projects/linuxquota).
920 +stripe=n Number of filesystem blocks that mballoc will try
921 + to use for allocation size and alignment. For RAID5/6
922 + systems this should be the number of data
923 + disks * RAID chunk size in file system blocks.
925 +delalloc (*) Defer block allocation until just before ext4
926 + writes out the block(s) in question. This
927 + allows ext4 to better allocation decisions
929 +nodelalloc Disable delayed allocation. Blocks are allocated
930 + when the data is copied from userspace to the
931 + page cache, either via the write(2) system call
932 + or when an mmap'ed page which was previously
933 + unallocated is written for the first time.
935 +max_batch_time=usec Maximum amount of time ext4 should wait for
936 + additional filesystem operations to be batch
937 + together with a synchronous write operation.
938 + Since a synchronous write operation is going to
939 + force a commit and then a wait for the I/O
940 + complete, it doesn't cost much, and can be a
941 + huge throughput win, we wait for a small amount
942 + of time to see if any other transactions can
943 + piggyback on the synchronous write. The
944 + algorithm used is designed to automatically tune
945 + for the speed of the disk, by measuring the
946 + amount of time (on average) that it takes to
947 + finish committing a transaction. Call this time
948 + the "commit time". If the time that the
949 + transaction has been running is less than the
950 + commit time, ext4 will try sleeping for the
951 + commit time to see if other operations will join
952 + the transaction. The commit time is capped by
953 + the max_batch_time, which defaults to 15000us
954 + (15ms). This optimization can be turned off
955 + entirely by setting max_batch_time to 0.
957 +min_batch_time=usec This parameter sets the commit time (as
958 + described above) to be at least min_batch_time.
959 + It defaults to zero microseconds. Increasing
960 + this parameter may improve the throughput of
961 + multi-threaded, synchronous workloads on very
962 + fast disks, at the cost of increasing latency.
964 +journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
965 + highest priority) which should be used for I/O
966 + operations submitted by kjournald2 during a
967 + commit operation. This defaults to 3, which is
968 + a slightly higher priority than the default I/O
971 +auto_da_alloc(*) Many broken applications don't use fsync() when
972 +noauto_da_alloc replacing existing files via patterns such as
973 + fd = open("foo.new")/write(fd,..)/close(fd)/
974 + rename("foo.new", "foo"), or worse yet,
975 + fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
976 + If auto_da_alloc is enabled, ext4 will detect
977 + the replace-via-rename and replace-via-truncate
978 + patterns and force that any delayed allocation
979 + blocks are allocated such that at the next
980 + journal commit, in the default data=ordered
981 + mode, the data blocks of the new file are forced
982 + to disk before the rename() operation is
983 + committed. This provides roughly the same level
984 + of guarantees as ext3, and avoids the
985 + "zero-length" problem that can happen when a
986 + system crashes before the delayed allocation
987 + blocks are forced to disk.
989 +noinit_itable Do not initialize any uninitialized inode table
990 + blocks in the background. This feature may be
991 + used by installation CD's so that the install
992 + process can complete as quickly as possible; the
993 + inode table initialization process would then be
994 + deferred until the next time the file system
997 +init_itable=n The lazy itable init code will wait n times the
998 + number of milliseconds it took to zero out the
999 + previous block group's inode table. This
1000 + minimizes the impact on the system performance
1001 + while file system's inode table is being initialized.
1003 +discard Controls whether ext4 should issue discard/TRIM
1004 +nodiscard(*) commands to the underlying block device when
1005 + blocks are freed. This is useful for SSD devices
1006 + and sparse/thinly-provisioned LUNs, but it is off
1007 + by default until sufficient testing has been done.
1009 +nouid32 Disables 32-bit UIDs and GIDs. This is for
1010 + interoperability with older kernels which only
1011 + store and expect 16-bit values.
1013 +block_validity(*) These options enable or disable the in-kernel
1014 +noblock_validity facility for tracking filesystem metadata blocks
1015 + within internal data structures. This allows multi-
1016 + block allocator and other routines to notice
1017 + bugs or corrupted allocation bitmaps which cause
1018 + blocks to be allocated which overlap with
1019 + filesystem metadata blocks.
1021 +dioread_lock Controls whether or not ext4 should use the DIO read
1022 +dioread_nolock locking. If the dioread_nolock option is specified
1023 + ext4 will allocate uninitialized extent before buffer
1024 + write and convert the extent to initialized after IO
1025 + completes. This approach allows ext4 code to avoid
1026 + using inode mutex, which improves scalability on high
1027 + speed storages. However this does not work with
1028 + data journaling and dioread_nolock option will be
1029 + ignored with kernel warning. Note that dioread_nolock
1030 + code path is only used for extent-based files.
1031 + Because of the restrictions this options comprises
1032 + it is off by default (e.g. dioread_lock).
1034 +max_dir_size_kb=n This limits the size of directories so that any
1035 + attempt to expand them beyond the specified
1036 + limit in kilobytes will cause an ENOSPC error.
1037 + This is useful in memory constrained
1038 + environments, where a very large directory can
1039 + cause severe performance problems or even
1040 + provoke the Out Of Memory killer. (For example,
1041 + if there is only 512mb memory available, a 176mb
1042 + directory may seriously cramp the system's style.)
1044 +i_version Enable 64-bit inode version support. This option is
1047 +dax Use direct access (no page cache). See
1048 + Documentation/filesystems/dax.txt. Note that
1049 + this option is incompatible with data=journal.
1053 +There are 3 different data modes:
1056 +In data=writeback mode, ext4 does not journal data at all. This mode provides
1057 +a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
1058 +mode - metadata journaling. A crash+recovery can cause incorrect data to
1059 +appear in files which were written shortly before the crash. This mode will
1060 +typically provide the best ext4 performance.
1063 +In data=ordered mode, ext4 only officially journals metadata, but it logically
1064 +groups metadata information related to data changes with the data blocks into a
1065 +single unit called a transaction. When it's time to write the new metadata
1066 +out to disk, the associated data blocks are written first. In general,
1067 +this mode performs slightly slower than writeback but significantly faster than journal mode.
1070 +data=journal mode provides full data and metadata journaling. All new data is
1071 +written to the journal first, and then to its final location.
1072 +In the event of a crash, the journal can be replayed, bringing both data and
1073 +metadata into a consistent state. This mode is the slowest except when data
1074 +needs to be read from and written to disk at the same time where it
1075 +outperforms all others modes. Enabling this mode will disable delayed
1076 +allocation and O_DIRECT support.
1081 +Information about mounted ext4 file systems can be found in
1082 +/proc/fs/ext4. Each mounted filesystem will have a directory in
1083 +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
1084 +/proc/fs/ext4/dm-0). The files in each per-device directory are shown
1087 +Files in /proc/fs/ext4/<devname>
1088 +..............................................................................
1090 + mb_groups details of multiblock allocator buddy cache of free blocks
1091 +..............................................................................
1096 +Information about mounted ext4 file systems can be found in
1097 +/sys/fs/ext4. Each mounted filesystem will have a directory in
1098 +/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
1099 +/sys/fs/ext4/dm-0). The files in each per-device directory are shown
1102 +Files in /sys/fs/ext4/<devname>
1103 +(see also Documentation/ABI/testing/sysfs-fs-ext4)
1104 +..............................................................................
1107 + delayed_allocation_blocks This file is read-only and shows the number of
1108 + blocks that are dirty in the page cache, but
1109 + which do not have their location in the
1110 + filesystem allocated yet.
1112 + inode_goal Tuning parameter which (if non-zero) controls
1113 + the goal inode used by the inode allocator in
1114 + preference to all other allocation heuristics.
1115 + This is intended for debugging use only, and
1116 + should be 0 on production systems.
1118 + inode_readahead_blks Tuning parameter which controls the maximum
1119 + number of inode table blocks that ext4's inode
1120 + table readahead algorithm will pre-read into
1123 + lifetime_write_kbytes This file is read-only and shows the number of
1124 + kilobytes of data that have been written to this
1125 + filesystem since it was created.
1127 + max_writeback_mb_bump The maximum number of megabytes the writeback
1128 + code will try to write out before move on to
1131 + mb_group_prealloc The multiblock allocator will round up allocation
1132 + requests to a multiple of this tuning parameter if
1133 + the stripe size is not set in the ext4 superblock
1135 + mb_max_to_scan The maximum number of extents the multiblock
1136 + allocator will search to find the best extent
1138 + mb_min_to_scan The minimum number of extents the multiblock
1139 + allocator will search to find the best extent
1141 + mb_order2_req Tuning parameter which controls the minimum size
1142 + for requests (as a power of 2) where the buddy
1145 + mb_stats Controls whether the multiblock allocator should
1146 + collect statistics, which are shown during the
1147 + unmount. 1 means to collect statistics, 0 means
1148 + not to collect statistics
1150 + mb_stream_req Files which have fewer blocks than this tunable
1151 + parameter will have their blocks allocated out
1152 + of a block group specific preallocation pool, so
1153 + that small files are packed closely together.
1154 + Each large file will have its blocks allocated
1155 + out of its own unique preallocation pool.
1157 + session_write_kbytes This file is read-only and shows the number of
1158 + kilobytes of data that have been written to this
1159 + filesystem since it was mounted.
1161 + reserved_clusters This is RW file and contains number of reserved
1162 + clusters in the file system which will be used
1163 + in the specific situations to avoid costly
1164 + zeroout, unexpected ENOSPC, or possible data
1165 + loss. The default is 2% or 4096 clusters,
1166 + whichever is smaller and this can be changed
1167 + however it can never exceed number of clusters
1168 + in the file system. If there is not enough space
1169 + for the reserved space when mounting the file
1170 + mount will _not_ fail.
1171 +..............................................................................
1176 +There is some Ext4 specific functionality which can be accessed by applications
1177 +through the system call interfaces. The list of all Ext4 specific ioctls are
1178 +shown in the table below.
1180 +Table of Ext4 specific ioctls
1181 +..............................................................................
1183 + EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
1184 + The ioctl argument is an integer bitfield, with
1185 + bit values described in ext4.h. This ioctl is an
1186 + alias for FS_IOC_GETFLAGS.
1188 + EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
1189 + The ioctl argument is an integer bitfield, with
1190 + bit values described in ext4.h. This ioctl is an
1191 + alias for FS_IOC_SETFLAGS.
1193 + EXT4_IOC_GETVERSION
1194 + EXT4_IOC_GETVERSION_OLD
1195 + Get the inode i_generation number stored for
1196 + each inode. The i_generation number is normally
1197 + changed only when new inode is created and it is
1198 + particularly useful for network filesystems. The
1199 + '_OLD' version of this ioctl is an alias for
1200 + FS_IOC_GETVERSION.
1202 + EXT4_IOC_SETVERSION
1203 + EXT4_IOC_SETVERSION_OLD
1204 + Set the inode i_generation number stored for
1205 + each inode. The '_OLD' version of this ioctl
1206 + is an alias for FS_IOC_SETVERSION.
1208 + EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
1209 + mount option. It allows to resize filesystem
1210 + to the end of the last existing block group,
1211 + further resize has to be done with resize2fs,
1212 + either online, or offline. The argument points
1213 + to the unsigned logn number representing the
1214 + filesystem new block count.
1216 + EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
1217 + this ioctl is pointing to) to the donor_fd (the
1218 + one specified in move_extent structure passed
1219 + as an argument to this ioctl). Then, exchange
1220 + inode metadata between orig_fd and donor_fd.
1221 + This is especially useful for online
1222 + defragmentation, because the allocator has the
1223 + opportunity to allocate moved blocks better,
1224 + ideally into one contiguous extent.
1226 + EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
1227 + new group descriptor block. The new group
1228 + descriptor is described by ext4_new_group_input
1229 + structure, which is passed as an argument to
1230 + this ioctl. This is especially useful in
1231 + conjunction with EXT4_IOC_GROUP_EXTEND,
1232 + which allows online resize of the filesystem
1233 + to the end of the last existing block group.
1234 + Those two ioctls combined is used in userspace
1235 + online resize tool (e.g. resize2fs).
1237 + EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
1238 + It converts (migrates) ext3 indirect block mapped
1239 + inode to ext4 extent mapped inode by walking
1240 + through indirect block mapping of the original
1241 + inode and converting contiguous block ranges
1242 + into ext4 extents of the temporary inode. Then,
1243 + inodes are swapped. This ioctl might help, when
1244 + migrating from ext3 to ext4 filesystem, however
1245 + suggestion is to create fresh ext4 filesystem
1246 + and copy data from the backup. Note, that
1247 + filesystem has to support extents for this ioctl
1250 + EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
1251 + allocated to preserve application-expected ext3
1252 + behaviour. Note that this will also start
1253 + triggering a write of the data blocks, but this
1254 + behaviour may change in the future as it is
1255 + not necessary and has been done this way only
1256 + for sake of simplicity.
1258 + EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number
1259 + of blocks of resized filesystem is passed in via
1260 + 64 bit integer argument. The kernel allocates
1261 + bitmaps and inode table, the userspace tool thus
1262 + just passes the new number of blocks.
1264 + EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes
1265 + (like i_blocks, i_size, i_flags, ...) from
1266 + the specified inode with inode
1267 + EXT4_BOOT_LOADER_INO (#5). This is typically
1268 + used to store a boot loader in a secure part of
1269 + the filesystem, where it can't be changed by a
1270 + normal user by accident.
1271 + The data blocks of the previous boot loader
1272 + will be associated with the given inode.
1274 +..............................................................................
1279 +kernel source: <file:fs/ext4/>
1282 +programs: http://e2fsprogs.sourceforge.net/
1284 +useful links: http://fedoraproject.org/wiki/ext3-devel
1285 + http://www.bullopensource.org/ext4/
1286 + http://ext4.wiki.kernel.org/index.php/Main_Page
1287 + http://fedoraproject.org/wiki/Features/Ext4