doc/intern.texi

   1 @c This is part of the paxutils manual.
   2 @c Copyright (C) 2006--2024 Free Software Foundation, Inc.
   3 @c This file is distributed under GFDL 1.1 or any later version
   4 @c published by the Free Software Foundation.
   5
   6 @menu
   7 * Standard::           Basic Tar Format
   8 * Extensions::         @acronym{GNU} Extensions to the Archive Format
   9 * Sparse Formats::     Storing Sparse Files
  10 * Snapshot Files::
  11 * Dumpdir::
  12 @end menu
  13
  14 @node Standard
  15 @unnumberedsec Basic Tar Format
  16 @UNREVISED{}
  17
  18 While an archive may contain many files, the archive itself is a
  19 single ordinary file.  Like any other file, an archive file can be
  20 written to a storage device such as a tape or disk, sent through a
  21 pipe or over a network, saved on the active file system, or even
  22 stored in another archive.  An archive file is not easy to read or
  23 manipulate without using the @command{tar} utility or Tar mode in
  24 @acronym{GNU} Emacs.
  25
  26 Physically, an archive consists of a series of file entries terminated
  27 by an end-of-archive entry, which consists of two 512 blocks of zero
  28 bytes.  A file
  29 entry usually describes one of the files in the archive (an
  30 @dfn{archive member}), and consists of a file header and the contents
  31 of the file.  File headers contain file names and statistics, checksum
  32 information which @command{tar} uses to detect file corruption, and
  33 information about file types.
  34
  35 Archives are permitted to have more than one member with the same
  36 member name.  One way this situation can occur is if more than one
  37 version of a file has been stored in the archive.  For information
  38 about adding new versions of a file to an archive, see @ref{update}.
  39
  40 In addition to entries describing archive members, an archive may
  41 contain entries which @command{tar} itself uses to store information.
  42 @xref{label}, for an example of such an archive entry.
  43
  44 A @command{tar} archive file contains a series of blocks.  Each block
  45 contains @code{BLOCKSIZE} bytes.  Although this format may be thought
  46 of as being on magnetic tape, other media are often used.
  47
  48 Each file archived is represented by a header block which describes
  49 the file, followed by zero or more blocks which give the contents
  50 of the file.  At the end of the archive file there are two 512-byte blocks
  51 filled with binary zeros as an end-of-file marker.  A reasonable system
  52 should write such end-of-file marker at the end of an archive, but
  53 must not assume that such a block exists when reading an archive.  In
  54 particular, @GNUTAR{} does not treat missing end-of-file marker as an
  55 error and silently ignores the fact.  You can instruct it to issue
  56 a warning, however, by using the @option{--warning=missing-zero-blocks}
  57 option (@pxref{General Warnings, missing-zero-blocks}).
  58
  59 The blocks may be @dfn{blocked} for physical I/O operations.
  60 Each record of @var{n} blocks (where @var{n} is set by the
  61 @option{--blocking-factor=@var{512-size}} (@option{-b @var{512-size}}) option to @command{tar}) is written with a single
  62 @w{@samp{write ()}} operation.  On magnetic tapes, the result of
  63 such a write is a single record.  When writing an archive,
  64 the last record of blocks should be written at the full size, with
  65 blocks after the zero block containing all zeros.  When reading
  66 an archive, a reasonable system should properly handle an archive
  67 whose last record is shorter than the rest, or which contains garbage
  68 records after a zero block.
  69
  70 The header block is defined in C as follows.  In the @GNUTAR{}
  71 distribution, this is part of file @file{src/tar.h}:
  72
  73 @smallexample
  74 @include header.texi
  75 @end smallexample
  76
  77 All characters in header blocks are represented by using 8-bit
  78 characters in the local variant of ASCII.  Each field within the
  79 structure is contiguous; that is, there is no padding used within
  80 the structure.  Each character on the archive medium is stored
  81 contiguously.
  82
  83 Bytes representing the contents of files (after the header block
  84 of each file) are not translated in any way and are not constrained
  85 to represent characters in any character set.  The @command{tar} format
  86 does not distinguish text files from binary files, and no translation
  87 of file contents is performed.
  88
  89 The @code{name}, @code{linkname}, @code{magic}, @code{uname}, and
  90 @code{gname} are null-terminated character strings.  All other fields
  91 are zero-filled octal numbers in ASCII.  Each numeric field of width
  92 @var{w} contains @var{w} minus 1 digits, and a null.
  93 (In the extended @acronym{GNU} format, the numeric fields can take
  94 other forms.)
  95
  96 The @code{name} field is the file name of the file, with directory names
  97 (if any) preceding the file name, separated by slashes.
  98
  99 @FIXME{how big a name before field overflows?}
 100
 101 The @code{mode} field provides nine bits specifying file permissions
 102 and three bits to specify the Set @acronym{UID}, Set @acronym{GID}, and Save Text
 103 (@dfn{sticky}) modes.  Values for these bits are defined above.
 104 When special permissions are required to create a file with a given
 105 mode, and the user restoring files from the archive does not hold such
 106 permissions, the mode bit(s) specifying those special permissions
 107 are ignored.  Modes which are not supported by the operating system
 108 restoring files from the archive will be ignored.  Unsupported modes
 109 should be faked up when creating or updating an archive; e.g., the
 110 group permission could be copied from the @emph{other} permission.
 111
 112 The @code{uid} and @code{gid} fields are the numeric user and group
 113 @acronym{ID} of the file owners, respectively.  If the operating system does
 114 not support numeric user or group @acronym{ID}s, these fields should
 115 be ignored.
 116
 117 The @code{size} field is the size of the file in bytes; for archive
 118 members that are symbolic or hard links to another file, this field
 119 is specified as zero.
 120
 121 The @code{mtime} field represents the data modification time of the file at
 122 the time it was archived.  It represents the integer number of
 123 seconds since January 1, 1970, 00:00 Coordinated Universal Time.
 124
 125 The @code{chksum} field represents
 126 the simple sum of all bytes in the header block.  Each 8-bit
 127 byte in the header is added to an unsigned integer, initialized to
 128 zero, the precision of which shall be no less than seventeen bits.
 129 When calculating the checksum, the @code{chksum} field is treated as
 130 if it were filled with spaces (ASCII 32).
 131
 132 The @code{typeflag} field specifies the type of file archived.  If a
 133 particular implementation does not recognize or permit the specified
 134 type, the file will be extracted as if it were a regular file.  As this
 135 action occurs, @command{tar} issues a warning to the standard error.
 136
 137 The @code{atime} and @code{ctime} fields are used in making incremental
 138 backups; they store, respectively, the particular file's access and
 139 status change times.
 140
 141 The @code{offset} is used by the @option{--multi-volume} (@option{-M}) option, when
 142 making a multi-volume archive.  The offset is number of bytes into
 143 the file that we need to restart at to continue the file on the next
 144 tape, i.e., where we store the location that a continued file is
 145 continued at.
 146
 147 The following fields were added to deal with sparse files.  A file
 148 is @dfn{sparse} if it takes in unallocated blocks which end up being
 149 represented as zeros, i.e., no useful data.  A test to see if a file
 150 is sparse is to look at the number blocks allocated for it versus the
 151 number of characters in the file; if there are fewer blocks allocated
 152 for the file than would normally be allocated for a file of that
 153 size, then the file is sparse.  This is the method @command{tar} uses to
 154 detect a sparse file, and once such a file is detected, it is treated
 155 differently from non-sparse files.
 156
 157 Sparse files are often @code{dbm} files, or other database-type files
 158 which have data at some points and emptiness in the greater part of
 159 the file.  Such files can appear to be very large when an @samp{ls
 160 -l} is done on them, when in truth, there may be a very small amount
 161 of important data contained in the file.  It is thus undesirable
 162 to have @command{tar} think that it must back up this entire file, as
 163 great quantities of room are wasted on empty blocks, which can lead
 164 to running out of room on a tape far earlier than is necessary.
 165 Thus, sparse files are dealt with so that these empty blocks are
 166 not written to the tape.  Instead, what is written to the tape is a
 167 description, of sorts, of the sparse file: where the holes are, how
 168 big the holes are, and how much data is found at the end of the hole.
 169 This way, the file takes up potentially far less room on the tape,
 170 and when the file is extracted later on, it will look exactly the way
 171 it looked beforehand.  The following is a description of the fields
 172 used to handle a sparse file:
 173
 174 The @code{sp} is an array of @code{struct sparse}.  Each @code{struct
 175 sparse} contains two 12-character strings which represent an offset
 176 into the file and a number of bytes to be written at that offset.
 177 The offset is absolute, and not relative to the offset in preceding
 178 array element.
 179
 180 The header can hold four of these @code{struct sparse} at the moment;
 181 if more are needed, they are not stored in the header.
 182
 183 The @code{isextended} flag is set when an @code{extended_header}
 184 is needed to deal with a file.  Note that this means that this flag
 185 can only be set when dealing with a sparse file, and it is only set
 186 in the event that the description of the file will not fit in the
 187 allotted room for sparse structures in the header.  In other words,
 188 an extended_header is needed.
 189
 190 The @code{extended_header} structure is used for sparse files which
 191 need more sparse structures than can fit in the header.  The header can
 192 fit 4 such structures; if more are needed, the flag @code{isextended}
 193 gets set and the next block is an @code{extended_header}.
 194
 195 Each @code{extended_header} structure contains an array of 21
 196 sparse structures, along with a similar @code{isextended} flag
 197 that the header had.  There can be an indeterminate number of such
 198 @code{extended_header}s to describe a sparse file.
 199
 200 @table @asis
 201
 202 @item @code{REGTYPE}
 203 @itemx @code{AREGTYPE}
 204 These flags represent a regular file.  In order to be compatible
 205 with older versions of @command{tar}, a @code{typeflag} value of
 206 @code{AREGTYPE} should be silently recognized as a regular file.
 207 New archives should be created using @code{REGTYPE}.  Also, for
 208 backward compatibility, @command{tar} treats a regular file whose name
 209 ends with a slash as a directory.
 210
 211 @item @code{LNKTYPE}
 212 This flag represents a file linked to another file, of any type,
 213 previously archived.  Such files are identified in Unix by each
 214 file having the same device and inode number.  The linked-to name is
 215 specified in the @code{linkname} field with a trailing null.
 216
 217 @item @code{SYMTYPE}
 218 This represents a symbolic link to another file.  The linked-to name
 219 is specified in the @code{linkname} field with a trailing null.
 220
 221 @item @code{CHRTYPE}
 222 @itemx @code{BLKTYPE}
 223 These represent character special files and block special files
 224 respectively.  In this case the @code{devmajor} and @code{devminor}
 225 fields will contain the major and minor device numbers respectively.
 226 Operating systems may map the device specifications to their own
 227 local specification, or may ignore the entry.
 228
 229 @item @code{DIRTYPE}
 230 This flag specifies a directory or sub-directory.  The directory
 231 name in the @code{name} field should end with a slash.  On systems where
 232 disk allocation is performed on a directory basis, the @code{size} field
 233 will contain the maximum number of bytes (which may be rounded to
 234 the nearest disk block allocation unit) which the directory may
 235 hold.  A @code{size} field of zero indicates no such limiting.  Systems
 236 which do not support limiting in this manner should ignore the
 237 @code{size} field.
 238
 239 @item @code{FIFOTYPE}
 240 This specifies a FIFO special file.  Note that the archiving of a
 241 FIFO file archives the existence of this file and not its contents.
 242
 243 @item @code{CONTTYPE}
 244 This specifies a contiguous file, which is the same as a normal
 245 file except that, in operating systems which support it, all its
 246 space is allocated contiguously on the disk.  Operating systems
 247 which do not allow contiguous allocation should silently treat this
 248 type as a normal file.
 249
 250 @item @code{A} @dots{} @code{Z}
 251 These are reserved for custom implementations.  Some of these are
 252 used in the @acronym{GNU} modified format, as described below.
 253
 254 @end table
 255
 256 Other values are reserved for specification in future revisions of
 257 the P1003 standard, and should not be used by any @command{tar} program.
 258
 259 The @code{magic} field indicates that this archive was output in
 260 the P1003 archive format.  If this field contains @code{TMAGIC},
 261 the @code{uname} and @code{gname} fields will contain the ASCII
 262 representation of the owner and group of the file respectively.
 263 If found, the user and group @acronym{ID}s are used rather than the values in
 264 the @code{uid} and @code{gid} fields.
 265
 266 For references, see ISO/IEC 9945-1:1990 or IEEE Std 1003.1-1990, pages
 267 169-173 (section 10.1) for @cite{Archive/Interchange File Format}; and
 268 IEEE Std 1003.2-1992, pages 380-388 (section 4.48) and pages 936-940
 269 (section E.4.48) for @cite{pax - Portable archive interchange}.
 270
 271 @node Extensions
 272 @unnumberedsec @acronym{GNU} Extensions to the Archive Format
 273 @UNREVISED{}
 274
 275 The @acronym{GNU} format uses additional file types to describe new types of
 276 files in an archive.  These are listed below.
 277
 278 @table @code
 279 @item GNUTYPE_DUMPDIR
 280 @itemx 'D'
 281 This represents a directory and a list of files created by the
 282 @option{--incremental} (@option{-G}) option.  The @code{size} field gives the total
 283 size of the associated list of files.  Each file name is preceded by
 284 either a @samp{Y} (the file should be in this archive) or an @samp{N}.
 285 (The file is a directory, or is not stored in the archive.)  Each file
 286 name is terminated by a null.  There is an additional null after the
 287 last file name.
 288
 289 @item GNUTYPE_MULTIVOL
 290 @itemx 'M'
 291 This represents a file continued from another volume of a multi-volume
 292 archive created with the @option{--multi-volume} (@option{-M}) option.  The original
 293 type of the file is not given here.  The @code{size} field gives the
 294 maximum size of this piece of the file (assuming the volume does
 295 not end before the file is written out).  The @code{offset} field
 296 gives the offset from the beginning of the file where this part of
 297 the file begins.  Thus @code{size} plus @code{offset} should equal
 298 the original size of the file.
 299
 300 @item GNUTYPE_SPARSE
 301 @itemx 'S'
 302 This flag indicates that we are dealing with a sparse file.  Note
 303 that archiving a sparse file requires special operations to find
 304 holes in the file, which mark the positions of these holes, along
 305 with the number of bytes of data to be found after the hole.
 306
 307 @item GNUTYPE_VOLHDR
 308 @itemx 'V'
 309 This file type is used to mark the volume header that was given with
 310 the @option{--label=@var{archive-label}} (@option{-V @var{archive-label}}) option when the archive was created.  The @code{name}
 311 field contains the @code{name} given after the @option{--label=@var{archive-label}} (@option{-V @var{archive-label}}) option.
 312 The @code{size} field is zero.  Only the first file in each volume
 313 of an archive should have this type.
 314
 315 @end table
 316
 317 For fields containing numbers or timestamps that are out of range for
 318 the basic format, the @acronym{GNU} format uses a base-256
 319 representation instead of an ASCII octal number.  If the leading byte
 320 is 0xff (255), all the bytes of the field (including the leading byte)
 321 are concatenated in big-endian order, with the result being a negative
 322 number expressed in two's complement form.  If the leading byte is
 323 0x80 (128), the non-leading bytes of the field are concatenated in
 324 big-endian order, with the result being a positive number expressed in
 325 binary form.  Leading bytes other than 0xff, 0x80 and ASCII octal
 326 digits are reserved for future use, as are base-256 representations of
 327 values that would be in range for the basic format.
 328
 329 You may have trouble reading a @acronym{GNU} format archive on a
 330 non-@acronym{GNU} system if the options @option{--incremental} (@option{-G}),
 331 @option{--multi-volume} (@option{-M}), @option{--sparse} (@option{-S}), or @option{--label=@var{archive-label}} (@option{-V @var{archive-label}}) were
 332 used when writing the archive.  In general, if @command{tar} does not
 333 use the @acronym{GNU}-added fields of the header, other versions of
 334 @command{tar} should be able to read the archive.  Otherwise, the
 335 @command{tar} program will give an error, the most likely one being a
 336 checksum error.
 337
 338 @node Sparse Formats
 339 @unnumberedsec Storing Sparse Files
 340 @include sparse.texi
 341
 342 @node Snapshot Files
 343 @unnumberedsec Format of the Incremental Snapshot Files
 344 @include snapshot.texi
 345
 346 @node Dumpdir
 347 @unnumberedsec Dumpdir
 348 @include dumpdir.texi