doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.2.0 (2024-01-19)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <https://xz.tukaani.org/format/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <https://xz.tukaani.org/format/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.2.0     2024-01-19    Added RISC-V filter and updated URLs in
  97                                 Sections 0.2 and 7. The URL of this
  98                                 specification was changed.
  99
 100         1.1.0     2022-12-11    Added ARM64 filter and clarified 32-bit
 101                                 ARM endianness in Section 5.3.2,
 102                                 language improvements in Section 5.4
 103
 104         1.0.4     2009-08-27    Language improvements in Sections 1.2,
 105                                 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
 106
 107         1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
 108
 109         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
 110
 111         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
 112                                 clarifications to Sections 2, 2.2,
 113                                 3.3, 4.4, and 5.3.2
 114
 115         1.0.0     2009-01-14    The first official version
 116
 117
 118 1. Conventions
 119
 120         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 121         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 122         document are to be interpreted as described in [RFC-2119].
 123
 124         Indicating a warning means displaying a message, returning
 125         appropriate exit status, or doing something else to let the
 126         user know that something worth warning occurred. The operation
 127         SHOULD still finish if a warning is indicated.
 128
 129         Indicating an error means displaying a message, returning
 130         appropriate exit status, or doing something else to let the
 131         user know that something prevented successfully finishing the
 132         operation. The operation MUST be aborted once an error has
 133         been indicated.
 134
 135
 136 1.1. Byte and Its Representation
 137
 138         In this document, byte is always 8 bits.
 139
 140         A "null byte" has all bits unset. That is, the value of a null
 141         byte is 0x00.
 142
 143         To represent byte blocks, this document uses notation that
 144         is similar to the notation used in [RFC-1952]:
 145
 146             +-------+
 147             |  Foo  |   One byte.
 148             +-------+
 149
 150             +---+---+
 151             |  Foo  |   Two bytes; that is, some of the vertical bars
 152             +---+---+   can be missing.
 153
 154             +=======+
 155             |  Foo  |   Zero or more bytes.
 156             +=======+
 157
 158         In this document, a boxed byte or a byte sequence declared
 159         using this notation is called "a field". The example field
 160         above would be called "the Foo field" or plain "Foo".
 161
 162         If there are many fields, they may be split to multiple lines.
 163         This is indicated with an arrow ("--->"):
 164
 165             +=====+
 166             | Foo |
 167             +=====+
 168
 169                  +=====+
 170             ---> | Bar |
 171                  +=====+
 172
 173         The above is equivalent to this:
 174
 175             +=====+=====+
 176             | Foo | Bar |
 177             +=====+=====+
 178
 179
 180 1.2. Multibyte Integers
 181
 182         Multibyte integers of static length, such as CRC values,
 183         are stored in little endian byte order (least significant
 184         byte first).
 185
 186         When smaller values are more likely than bigger values (for
 187         example file sizes), multibyte integers are encoded in a
 188         variable-length representation:
 189           - Numbers in the range [0, 127] are copied as is, and take
 190             one byte of space.
 191           - Bigger numbers will occupy two or more bytes. All but the
 192             last byte of the multibyte representation have the highest
 193             (eighth) bit set.
 194
 195         For now, the value of the variable-length integers is limited
 196         to 63 bits, which limits the encoded size of the integer to
 197         nine bytes. These limits may be increased in the future if
 198         needed.
 199
 200         The following C code illustrates encoding and decoding of
 201         variable-length integers. The functions return the number of
 202         bytes occupied by the integer (1-9), or zero on error.
 203
 204             #include <stddef.h>
 205             #include <inttypes.h>
 206
 207             size_t
 208             encode(uint8_t buf[static 9], uint64_t num)
 209             {
 210                 if (num > UINT64_MAX / 2)
 211                     return 0;
 212
 213                 size_t i = 0;
 214
 215                 while (num >= 0x80) {
 216                     buf[i++] = (uint8_t)(num) | 0x80;
 217                     num >>= 7;
 218                 }
 219
 220                 buf[i++] = (uint8_t)(num);
 221
 222                 return i;
 223             }
 224
 225             size_t
 226             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 227             {
 228                 if (size_max == 0)
 229                     return 0;
 230
 231                 if (size_max > 9)
 232                     size_max = 9;
 233
 234                 *num = buf[0] & 0x7F;
 235                 size_t i = 0;
 236
 237                 while (buf[i++] & 0x80) {
 238                     if (i >= size_max || buf[i] == 0x00)
 239                         return 0;
 240
 241                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 242                 }
 243
 244                 return i;
 245             }
 246
 247
 248 2. Overall Structure of .xz File
 249
 250         A standalone .xz files consist of one or more Streams which may
 251         have Stream Padding between or after them:
 252
 253             +========+================+========+================+
 254             | Stream | Stream Padding | Stream | Stream Padding | ...
 255             +========+================+========+================+
 256
 257         The sizes of Stream and Stream Padding are always multiples
 258         of four bytes, thus the size of every valid .xz file MUST be
 259         a multiple of four bytes.
 260
 261         While a typical file contains only one Stream and no Stream
 262         Padding, a decoder handling standalone .xz files SHOULD support
 263         files that have more than one Stream or Stream Padding.
 264
 265         In contrast to standalone .xz files, when the .xz file format
 266         is used as an internal part of some other file format or
 267         communication protocol, it usually is expected that the decoder
 268         stops after the first Stream, and doesn't look for Stream
 269         Padding or possibly other Streams.
 270
 271
 272 2.1. Stream
 273
 274         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 275         |     Stream Header     | Block | Block | ... | Block |
 276         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 277
 278              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 279         ---> | Index |     Stream Footer     |
 280              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 281
 282         All the above fields have a size that is a multiple of four. If
 283         Stream is used as an internal part of another file format, it
 284         is RECOMMENDED to make the Stream start at an offset that is
 285         a multiple of four bytes.
 286
 287         Stream Header, Index, and Stream Footer are always present in
 288         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 289
 290         There are zero or more Blocks. The maximum number of Blocks is
 291         limited only by the maximum size of the Index field.
 292
 293         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 294         The same limit applies to the total amount of uncompressed
 295         data stored in a Stream.
 296
 297         If an implementation supports handling .xz files with multiple
 298         concatenated Streams, it MAY apply the above limits to the file
 299         as a whole instead of limiting per Stream basis.
 300
 301
 302 2.1.1. Stream Header
 303
 304         +---+---+---+---+---+---+-------+------+--+--+--+--+
 305         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 306         +---+---+---+---+---+---+-------+------+--+--+--+--+
 307
 308
 309 2.1.1.1. Header Magic Bytes
 310
 311         The first six (6) bytes of the Stream are so called Header
 312         Magic Bytes. They can be used to identify the file type.
 313
 314             Using a C array and ASCII:
 315             const uint8_t HEADER_MAGIC[6]
 316                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 317
 318             In plain hexadecimal:
 319             FD 37 7A 58 5A 00
 320
 321         Notes:
 322           - The first byte (0xFD) was chosen so that the files cannot
 323             be erroneously detected as being in .lzma format, in which
 324             the first byte is in the range [0x00, 0xE0].
 325           - The sixth byte (0x00) was chosen to prevent applications
 326             from misdetecting the file as a text file.
 327
 328         If the Header Magic Bytes don't match, the decoder MUST
 329         indicate an error.
 330
 331
 332 2.1.1.2. Stream Flags
 333
 334         The first byte of Stream Flags is always a null byte. In the
 335         future, this byte may be used to indicate a new Stream version
 336         or other Stream properties.
 337
 338         The second byte of Stream Flags is a bit field:
 339
 340             Bit(s)  Mask  Description
 341              0-3    0x0F  Type of Check (see Section 3.4):
 342                               ID    Size      Check name
 343                               0x00   0 bytes  None
 344                               0x01   4 bytes  CRC32
 345                               0x02   4 bytes  (Reserved)
 346                               0x03   4 bytes  (Reserved)
 347                               0x04   8 bytes  CRC64
 348                               0x05   8 bytes  (Reserved)
 349                               0x06   8 bytes  (Reserved)
 350                               0x07  16 bytes  (Reserved)
 351                               0x08  16 bytes  (Reserved)
 352                               0x09  16 bytes  (Reserved)
 353                               0x0A  32 bytes  SHA-256
 354                               0x0B  32 bytes  (Reserved)
 355                               0x0C  32 bytes  (Reserved)
 356                               0x0D  64 bytes  (Reserved)
 357                               0x0E  64 bytes  (Reserved)
 358                               0x0F  64 bytes  (Reserved)
 359              4-7    0xF0  Reserved for future use; MUST be zero for now.
 360
 361         Implementations SHOULD support at least the Check IDs 0x00
 362         (None) and 0x01 (CRC32). Supporting other Check IDs is
 363         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 364         indicate a warning or error.
 365
 366         If any reserved bit is set, the decoder MUST indicate an error.
 367         It is possible that there is a new field present which the
 368         decoder is not aware of, and can thus parse the Stream Header
 369         incorrectly.
 370
 371
 372 2.1.1.3. CRC32
 373
 374         The CRC32 is calculated from the Stream Flags field. It is
 375         stored as an unsigned 32-bit little endian integer. If the
 376         calculated value does not match the stored one, the decoder
 377         MUST indicate an error.
 378
 379         The idea is that Stream Flags would always be two bytes, even
 380         if new features are needed. This way old decoders will be able
 381         to verify the CRC32 calculated from Stream Flags, and thus
 382         distinguish between corrupt files (CRC32 doesn't match) and
 383         files that the decoder doesn't support (CRC32 matches but
 384         Stream Flags has reserved bits set).
 385
 386
 387 2.1.2. Stream Footer
 388
 389         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 390         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 391         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 392
 393
 394 2.1.2.1. CRC32
 395
 396         The CRC32 is calculated from the Backward Size and Stream Flags
 397         fields. It is stored as an unsigned 32-bit little endian
 398         integer. If the calculated value does not match the stored one,
 399         the decoder MUST indicate an error.
 400
 401         The reason to have the CRC32 field before the Backward Size and
 402         Stream Flags fields is to keep the four-byte fields aligned to
 403         a multiple of four bytes.
 404
 405
 406 2.1.2.2. Backward Size
 407
 408         Backward Size is stored as a 32-bit little endian integer,
 409         which indicates the size of the Index field as multiple of
 410         four bytes, minimum value being four bytes:
 411
 412             real_backward_size = (stored_backward_size + 1) * 4;
 413
 414         If the stored value does not match the real size of the Index
 415         field, the decoder MUST indicate an error.
 416
 417         Using a fixed-size integer to store Backward Size makes
 418         it slightly simpler to parse the Stream Footer when the
 419         application needs to parse the Stream backwards.
 420
 421
 422 2.1.2.3. Stream Flags
 423
 424         This is a copy of the Stream Flags field from the Stream
 425         Header. The information stored to Stream Flags is needed
 426         when parsing the Stream backwards. The decoder MUST compare
 427         the Stream Flags fields in both Stream Header and Stream
 428         Footer, and indicate an error if they are not identical.
 429
 430
 431 2.1.2.4. Footer Magic Bytes
 432
 433         As the last step of the decoding process, the decoder MUST
 434         verify the existence of Footer Magic Bytes. If they don't
 435         match, an error MUST be indicated.
 436
 437             Using a C array and ASCII:
 438             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 439
 440             In hexadecimal:
 441             59 5A
 442
 443         The primary reason to have Footer Magic Bytes is to make
 444         it easier to detect incomplete files quickly, without
 445         uncompressing. If the file does not end with Footer Magic Bytes
 446         (excluding Stream Padding described in Section 2.2), it cannot
 447         be undamaged, unless someone has intentionally appended garbage
 448         after the end of the Stream.
 449
 450
 451 2.2. Stream Padding
 452
 453         Only the decoders that support decoding of concatenated Streams
 454         MUST support Stream Padding.
 455
 456         Stream Padding MUST contain only null bytes. To preserve the
 457         four-byte alignment of consecutive Streams, the size of Stream
 458         Padding MUST be a multiple of four bytes. Empty Stream Padding
 459         is allowed. If these requirements are not met, the decoder MUST
 460         indicate an error.
 461
 462         Note that non-empty Stream Padding is allowed at the end of the
 463         file; there doesn't need to be a new Stream after non-empty
 464         Stream Padding. This can be convenient in certain situations
 465         [GNU-tar].
 466
 467         The possibility of Stream Padding MUST be taken into account
 468         when designing an application that parses Streams backwards,
 469         and the application supports concatenated Streams.
 470
 471
 472 3. Block
 473
 474         +==============+=================+===============+=======+
 475         | Block Header | Compressed Data | Block Padding | Check |
 476         +==============+=================+===============+=======+
 477
 478
 479 3.1. Block Header
 480
 481         +-------------------+-------------+=================+
 482         | Block Header Size | Block Flags | Compressed Size |
 483         +-------------------+-------------+=================+
 484
 485              +===================+======================+
 486         ---> | Uncompressed Size | List of Filter Flags |
 487              +===================+======================+
 488
 489              +================+--+--+--+--+
 490         ---> | Header Padding |   CRC32   |
 491              +================+--+--+--+--+
 492
 493
 494 3.1.1. Block Header Size
 495
 496         This field overlaps with the Index Indicator field (see
 497         Section 4.1).
 498
 499         This field contains the size of the Block Header field,
 500         including the Block Header Size field itself. Valid values are
 501         in the range [0x01, 0xFF], which indicate the size of the Block
 502         Header as multiples of four bytes, minimum size being eight
 503         bytes:
 504
 505             real_header_size = (encoded_header_size + 1) * 4;
 506
 507         If a Block Header bigger than 1024 bytes is needed in the
 508         future, a new field can be added between the Block Header and
 509         Compressed Data fields. The presence of this new field would
 510         be indicated in the Block Header field.
 511
 512
 513 3.1.2. Block Flags
 514
 515         The Block Flags field is a bit field:
 516
 517             Bit(s)  Mask  Description
 518              0-1    0x03  Number of filters (1-4)
 519              2-5    0x3C  Reserved for future use; MUST be zero for now.
 520               6     0x40  The Compressed Size field is present.
 521               7     0x80  The Uncompressed Size field is present.
 522
 523         If any reserved bit is set, the decoder MUST indicate an error.
 524         It is possible that there is a new field present which the
 525         decoder is not aware of, and can thus parse the Block Header
 526         incorrectly.
 527
 528
 529 3.1.3. Compressed Size
 530
 531         This field is present only if the appropriate bit is set in
 532         the Block Flags field (see Section 3.1.2).
 533
 534         The Compressed Size field contains the size of the Compressed
 535         Data field, which MUST be non-zero. Compressed Size is stored
 536         using the encoding described in Section 1.2. If the Compressed
 537         Size doesn't match the size of the Compressed Data field, the
 538         decoder MUST indicate an error.
 539
 540
 541 3.1.4. Uncompressed Size
 542
 543         This field is present only if the appropriate bit is set in
 544         the Block Flags field (see Section 3.1.2).
 545
 546         The Uncompressed Size field contains the size of the Block
 547         after uncompressing. Uncompressed Size is stored using the
 548         encoding described in Section 1.2. If the Uncompressed Size
 549         does not match the real uncompressed size, the decoder MUST
 550         indicate an error.
 551
 552         Storing the Compressed Size and Uncompressed Size fields serves
 553         several purposes:
 554           - The decoder knows how much memory it needs to allocate
 555             for a temporary buffer in multithreaded mode.
 556           - Simple error detection: wrong size indicates a broken file.
 557           - Seeking forwards to a specific location in streamed mode.
 558
 559         It should be noted that the only reliable way to determine
 560         the real uncompressed size is to uncompress the Block,
 561         because the Block Header and Index fields may contain
 562         (intentionally or unintentionally) invalid information.
 563
 564
 565 3.1.5. List of Filter Flags
 566
 567         +================+================+     +================+
 568         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 569         +================+================+     +================+
 570
 571         The number of Filter Flags fields is stored in the Block Flags
 572         field (see Section 3.1.2).
 573
 574         The format of each Filter Flags field is as follows:
 575
 576             +===========+====================+===================+
 577             | Filter ID | Size of Properties | Filter Properties |
 578             +===========+====================+===================+
 579
 580         Both Filter ID and Size of Properties are stored using the
 581         encoding described in Section 1.2. Size of Properties indicates
 582         the size of the Filter Properties field as bytes. The list of
 583         officially defined Filter IDs and the formats of their Filter
 584         Properties are described in Section 5.3.
 585
 586         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 587         (2^62) are reserved for implementation-specific internal use.
 588         These Filter IDs MUST never be used in List of Filter Flags.
 589
 590
 591 3.1.6. Header Padding
 592
 593         This field contains as many null byte as it is needed to make
 594         the Block Header have the size specified in Block Header Size.
 595         If any of the bytes are not null bytes, the decoder MUST
 596         indicate an error. It is possible that there is a new field
 597         present which the decoder is not aware of, and can thus parse
 598         the Block Header incorrectly.
 599
 600
 601 3.1.7. CRC32
 602
 603         The CRC32 is calculated over everything in the Block Header
 604         field except the CRC32 field itself. It is stored as an
 605         unsigned 32-bit little endian integer. If the calculated
 606         value does not match the stored one, the decoder MUST indicate
 607         an error.
 608
 609         By verifying the CRC32 of the Block Header before parsing the
 610         actual contents allows the decoder to distinguish between
 611         corrupt and unsupported files.
 612
 613
 614 3.2. Compressed Data
 615
 616         The format of Compressed Data depends on Block Flags and List
 617         of Filter Flags. Excluding the descriptions of the simplest
 618         filters in Section 5.3, the format of the filter-specific
 619         encoded data is out of scope of this document.
 620
 621
 622 3.3. Block Padding
 623
 624         Block Padding MUST contain 0-3 null bytes to make the size of
 625         the Block a multiple of four bytes. This can be needed when
 626         the size of Compressed Data is not a multiple of four. If any
 627         of the bytes in Block Padding are not null bytes, the decoder
 628         MUST indicate an error.
 629
 630
 631 3.4. Check
 632
 633         The type and size of the Check field depends on which bits
 634         are set in the Stream Flags field (see Section 2.1.1.2).
 635
 636         The Check, when used, is calculated from the original
 637         uncompressed data. If the calculated Check does not match the
 638         stored one, the decoder MUST indicate an error. If the selected
 639         type of Check is not supported by the decoder, it SHOULD
 640         indicate a warning or error.
 641
 642
 643 4. Index
 644
 645         +-----------------+===================+
 646         | Index Indicator | Number of Records |
 647         +-----------------+===================+
 648
 649              +=================+===============+-+-+-+-+
 650         ---> | List of Records | Index Padding | CRC32 |
 651              +=================+===============+-+-+-+-+
 652
 653         Index serves several purposes. Using it, one can
 654           - verify that all Blocks in a Stream have been processed;
 655           - find out the uncompressed size of a Stream; and
 656           - quickly access the beginning of any Block (random access).
 657
 658
 659 4.1. Index Indicator
 660
 661         This field overlaps with the Block Header Size field (see
 662         Section 3.1.1). The value of Index Indicator is always 0x00.
 663
 664
 665 4.2. Number of Records
 666
 667         This field indicates how many Records there are in the List
 668         of Records field, and thus how many Blocks there are in the
 669         Stream. The value is stored using the encoding described in
 670         Section 1.2. If the decoder has decoded all the Blocks of the
 671         Stream, and then notices that the Number of Records doesn't
 672         match the real number of Blocks, the decoder MUST indicate an
 673         error.
 674
 675
 676 4.3. List of Records
 677
 678         List of Records consists of as many Records as indicated by the
 679         Number of Records field:
 680
 681             +========+========+
 682             | Record | Record | ...
 683             +========+========+
 684
 685         Each Record contains information about one Block:
 686
 687             +===============+===================+
 688             | Unpadded Size | Uncompressed Size |
 689             +===============+===================+
 690
 691         If the decoder has decoded all the Blocks of the Stream, it
 692         MUST verify that the contents of the Records match the real
 693         Unpadded Size and Uncompressed Size of the respective Blocks.
 694
 695         Implementation hint: It is possible to verify the Index with
 696         constant memory usage by calculating for example SHA-256 of
 697         both the real size values and the List of Records, then
 698         comparing the hash values. Implementing this using
 699         non-cryptographic hash like CRC32 SHOULD be avoided unless
 700         small code size is important.
 701
 702         If the decoder supports random-access reading, it MUST verify
 703         that Unpadded Size and Uncompressed Size of every completely
 704         decoded Block match the sizes stored in the Index. If only
 705         partial Block is decoded, the decoder MUST verify that the
 706         processed sizes don't exceed the sizes stored in the Index.
 707
 708
 709 4.3.1. Unpadded Size
 710
 711         This field indicates the size of the Block excluding the Block
 712         Padding field. That is, Unpadded Size is the size of the Block
 713         Header, Compressed Data, and Check fields. Unpadded Size is
 714         stored using the encoding described in Section 1.2. The value
 715         MUST never be zero; with the current structure of Blocks, the
 716         actual minimum value for Unpadded Size is five.
 717
 718         Implementation note: Because the size of the Block Padding
 719         field is not included in Unpadded Size, calculating the total
 720         size of a Stream or doing random-access reading requires
 721         calculating the actual size of the Blocks by rounding Unpadded
 722         Sizes up to the next multiple of four.
 723
 724         The reason to exclude Block Padding from Unpadded Size is to
 725         ease making a raw copy of Compressed Data without Block
 726         Padding. This can be useful, for example, if someone wants
 727         to convert Streams to some other file format quickly.
 728
 729
 730 4.3.2. Uncompressed Size
 731
 732         This field indicates the Uncompressed Size of the respective
 733         Block as bytes. The value is stored using the encoding
 734         described in Section 1.2.
 735
 736
 737 4.4. Index Padding
 738
 739         This field MUST contain 0-3 null bytes to pad the Index to
 740         a multiple of four bytes. If any of the bytes are not null
 741         bytes, the decoder MUST indicate an error.
 742
 743
 744 4.5. CRC32
 745
 746         The CRC32 is calculated over everything in the Index field
 747         except the CRC32 field itself. The CRC32 is stored as an
 748         unsigned 32-bit little endian integer. If the calculated
 749         value does not match the stored one, the decoder MUST indicate
 750         an error.
 751
 752
 753 5. Filter Chains
 754
 755         The Block Flags field defines how many filters are used. When
 756         more than one filter is used, the filters are chained; that is,
 757         the output of one filter is the input of another filter. The
 758         following figure illustrates the direction of data flow.
 759
 760                     v   Uncompressed Data   ^
 761                     |       Filter 0        |
 762             Encoder |       Filter 1        | Decoder
 763                     |       Filter n        |
 764                     v    Compressed Data    ^
 765
 766
 767 5.1. Alignment
 768
 769         Alignment of uncompressed input data is usually the job of
 770         the application producing the data. For example, to get the
 771         best results, an archiver tool should make sure that all
 772         PowerPC executable files in the archive stream start at
 773         offsets that are multiples of four bytes.
 774
 775         Some filters, for example LZMA2, can be configured to take
 776         advantage of specified alignment of input data. Note that
 777         taking advantage of aligned input can be beneficial also when
 778         a filter is not the first filter in the chain. For example,
 779         if you compress PowerPC executables, you may want to use the
 780         PowerPC filter and chain that with the LZMA2 filter. Because
 781         not only the input but also the output alignment of the PowerPC
 782         filter is four bytes, it is now beneficial to set LZMA2
 783         settings so that the LZMA2 encoder can take advantage of its
 784         four-byte-aligned input data.
 785
 786         The output of the last filter in the chain is stored to the
 787         Compressed Data field, which is is guaranteed to be aligned
 788         to a multiple of four bytes relative to the beginning of the
 789         Stream. This can increase
 790           - speed, if the filtered data is handled multiple bytes at
 791             a time by the filter-specific encoder and decoder,
 792             because accessing aligned data in computer memory is
 793             usually faster; and
 794           - compression ratio, if the output data is later compressed
 795             with an external compression tool.
 796
 797
 798 5.2. Security
 799
 800         If filters would be allowed to be chained freely, it would be
 801         possible to create malicious files, that would be very slow to
 802         decode. Such files could be used to create denial of service
 803         attacks.
 804
 805         Slow files could occur when multiple filters are chained:
 806
 807             v   Compressed input data
 808             |   Filter 1 decoder (last filter)
 809             |   Filter 0 decoder (non-last filter)
 810             v   Uncompressed output data
 811
 812         The decoder of the last filter in the chain produces a lot of
 813         output from little input. Another filter in the chain takes the
 814         output of the last filter, and produces very little output
 815         while consuming a lot of input. As a result, a lot of data is
 816         moved inside the filter chain, but the filter chain as a whole
 817         gets very little work done.
 818
 819         To prevent this kind of slow files, there are restrictions on
 820         how the filters can be chained. These restrictions MUST be
 821         taken into account when designing new filters.
 822
 823         The maximum number of filters in the chain has been limited to
 824         four, thus there can be at maximum of three non-last filters.
 825         Of these three non-last filters, only two are allowed to change
 826         the size of the data.
 827
 828         The non-last filters, that change the size of the data, MUST
 829         have a limit how much the decoder can compress the data: the
 830         decoder SHOULD produce at least n bytes of output when the
 831         filter is given 2n bytes of input. This  limit is not
 832         absolute, but significant deviations MUST be avoided.
 833
 834         The above limitations guarantee that if the last filter in the
 835         chain produces 4n bytes of output, the chain as a whole will
 836         produce at least n bytes of output.
 837
 838
 839 5.3. Filters
 840
 841 5.3.1. LZMA2
 842
 843         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
 844         compression algorithm with high compression ratio and fast
 845         decompression. LZMA is based on LZ77 and range coding
 846         algorithms.
 847
 848         LZMA2 is an extension on top of the original LZMA. LZMA2 uses
 849         LZMA internally, but adds support for flushing the encoder,
 850         uncompressed chunks, eases stateful decoder implementations,
 851         and improves support for multithreading. Thus, the plain LZMA
 852         will not be supported in this file format.
 853
 854             Filter ID:                  0x21
 855             Size of Filter Properties:  1 byte
 856             Changes size of data:       Yes
 857             Allow as a non-last filter: No
 858             Allow as the last filter:   Yes
 859
 860             Preferred alignment:
 861                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 862                 Output data:            1 byte
 863
 864         The format of the one-byte Filter Properties field is as
 865         follows:
 866
 867             Bits   Mask   Description
 868             0-5    0x3F   Dictionary Size
 869             6-7    0xC0   Reserved for future use; MUST be zero for now.
 870
 871         Dictionary Size is encoded with one-bit mantissa and five-bit
 872         exponent. The smallest dictionary size is 4 KiB and the biggest
 873         is 4 GiB.
 874
 875             Raw value   Mantissa   Exponent   Dictionary size
 876                 0           2         11         4 KiB
 877                 1           3         11         6 KiB
 878                 2           2         12         8 KiB
 879                 3           3         12        12 KiB
 880                 4           2         13        16 KiB
 881                 5           3         13        24 KiB
 882                 6           2         14        32 KiB
 883               ...         ...        ...      ...
 884                35           3         27       768 MiB
 885                36           2         28      1024 MiB
 886                37           3         29      1536 MiB
 887                38           2         30      2048 MiB
 888                39           3         30      3072 MiB
 889                40           2         31      4096 MiB - 1 B
 890
 891         Instead of having a table in the decoder, the dictionary size
 892         can be decoded using the following C code:
 893
 894             const uint8_t bits = get_dictionary_flags() & 0x3F;
 895             if (bits > 40)
 896                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 897
 898             uint32_t dictionary_size;
 899             if (bits == 40) {
 900                 dictionary_size = UINT32_MAX;
 901             } else {
 902                 dictionary_size = 2 | (bits & 1);
 903                 dictionary_size <<= bits / 2 + 11;
 904             }
 905
 906
 907 5.3.2. Branch/Call/Jump Filters for Executables
 908
 909         These filters convert relative branch, call, and jump
 910         instructions to their absolute counterparts in executable
 911         files. This conversion increases redundancy and thus
 912         compression ratio.
 913
 914             Size of Filter Properties:  0 or 4 bytes
 915             Changes size of data:       No
 916             Allow as a non-last filter: Yes
 917             Allow as the last filter:   No
 918
 919         Below is the list of filters in this category. The alignment
 920         is the same for both input and output data.
 921
 922             Filter ID   Alignment   Description
 923               0x04       1 byte     x86 filter (BCJ)
 924               0x05       4 bytes    PowerPC (big endian) filter
 925               0x06      16 bytes    IA64 filter
 926               0x07       4 bytes    ARM filter [1]
 927               0x08       2 bytes    ARM Thumb filter [1]
 928               0x09       4 bytes    SPARC filter
 929               0x0A       4 bytes    ARM64 filter [2]
 930               0x0B       2 bytes    RISC-V filter
 931
 932               [1] These are for little endian instruction encoding.
 933                   This must not be confused with data endianness.
 934                   A processor configured for big endian data access
 935                   may still use little endian instruction encoding.
 936                   The filters don't care about the data endianness.
 937
 938               [2] 4096-byte alignment gives the best results
 939                   because the address in the ADRP instruction
 940                   is a multiple of 4096 bytes.
 941
 942         If the size of Filter Properties is four bytes, the Filter
 943         Properties field contains the start offset used for address
 944         conversions. It is stored as an unsigned 32-bit little endian
 945         integer. The start offset MUST be a multiple of the alignment
 946         of the filter as listed in the table above; if it isn't, the
 947         decoder MUST indicate an error. If the size of Filter
 948         Properties is zero, the start offset is zero.
 949
 950         Setting the start offset may be useful if an executable has
 951         multiple sections, and there are many cross-section calls.
 952         Taking advantage of this feature usually requires usage of
 953         the Subblock filter, whose design is not complete yet.
 954
 955
 956 5.3.3. Delta
 957
 958         The Delta filter may increase compression ratio when the value
 959         of the next byte correlates with the value of an earlier byte
 960         at specified distance.
 961
 962             Filter ID:                  0x03
 963             Size of Filter Properties:  1 byte
 964             Changes size of data:       No
 965             Allow as a non-last filter: Yes
 966             Allow as the last filter:   No
 967
 968             Preferred alignment:
 969                 Input data:             1 byte
 970                 Output data:            Same as the original input data
 971
 972         The Properties byte indicates the delta distance, which can be
 973         1-256 bytes backwards from the current byte: 0x00 indicates
 974         distance of 1 byte and 0xFF distance of 256 bytes.
 975
 976
 977 5.3.3.1. Format of the Encoded Output
 978
 979         The code below illustrates both encoding and decoding with
 980         the Delta filter.
 981
 982             // Distance is in the range [1, 256].
 983             const unsigned int distance = get_properties_byte() + 1;
 984             uint8_t pos = 0;
 985             uint8_t delta[256];
 986
 987             memset(delta, 0, sizeof(delta));
 988
 989             while (1) {
 990                 const int byte = read_byte();
 991                 if (byte == EOF)
 992                     break;
 993
 994                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 995                 if (is_encoder) {
 996                     tmp = (uint8_t)(byte) - tmp;
 997                     delta[pos] = (uint8_t)(byte);
 998                 } else {
 999                     tmp = (uint8_t)(byte) + tmp;
1000                     delta[pos] = tmp;
1001                 }
1002
1003                 write_byte(tmp);
1004                 --pos;
1005             }
1006
1007
1008 5.4. Custom Filter IDs
1009
1010         If a developer wants to use custom Filter IDs, there are two
1011         choices. The first choice is to contact Lasse Collin and ask
1012         him to allocate a range of IDs for the developer.
1013
1014         The second choice is to generate a 40-bit random integer
1015         which the developer can use as a personal Developer ID.
1016         To minimize the risk of collisions, Developer ID has to be
1017         a randomly generated integer, not manually selected "hex word".
1018         The following command, which works on many free operating
1019         systems, can be used to generate Developer ID:
1020
1021             dd if=/dev/urandom bs=5 count=1 | hexdump
1022
1023         The developer can then use the Developer ID to create unique
1024         (well, hopefully unique) Filter IDs.
1025
1026             Bits    Mask                    Description
1027              0-15   0x0000_0000_0000_FFFF   Filter ID
1028             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1029             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1030
1031         The resulting 63-bit integer will use 9 bytes of space when
1032         stored using the encoding described in Section 1.2. To get
1033         a shorter ID, see the beginning of this Section how to
1034         request a custom ID range.
1035
1036
1037 5.4.1. Reserved Custom Filter ID Ranges
1038
1039         Range                       Description
1040         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1041         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1042         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1043
1044
1045 6. Cyclic Redundancy Checks
1046
1047         There are several incompatible variations to calculate CRC32
1048         and CRC64. For simplicity and clarity, complete examples are
1049         provided to calculate the checks as they are used in this file
1050         format. Implementations MAY use different code as long as it
1051         gives identical results.
1052
1053         The program below reads data from standard input, calculates
1054         the CRC32 and CRC64 values, and prints the calculated values
1055         as big endian hexadecimal strings to standard output.
1056
1057             #include <stddef.h>
1058             #include <inttypes.h>
1059             #include <stdio.h>
1060
1061             uint32_t crc32_table[256];
1062             uint64_t crc64_table[256];
1063
1064             void
1065             init(void)
1066             {
1067                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1068                 static const uint64_t poly64
1069                         = UINT64_C(0xC96C5795D7870F42);
1070
1071                 for (size_t i = 0; i < 256; ++i) {
1072                     uint32_t crc32 = i;
1073                     uint64_t crc64 = i;
1074
1075                     for (size_t j = 0; j < 8; ++j) {
1076                         if (crc32 & 1)
1077                             crc32 = (crc32 >> 1) ^ poly32;
1078                         else
1079                             crc32 >>= 1;
1080
1081                         if (crc64 & 1)
1082                             crc64 = (crc64 >> 1) ^ poly64;
1083                         else
1084                             crc64 >>= 1;
1085                     }
1086
1087                     crc32_table[i] = crc32;
1088                     crc64_table[i] = crc64;
1089                 }
1090             }
1091
1092             uint32_t
1093             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1094             {
1095                 crc = ~crc;
1096                 for (size_t i = 0; i < size; ++i)
1097                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1098                             ^ (crc >> 8);
1099                 return ~crc;
1100             }
1101
1102             uint64_t
1103             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1104             {
1105                 crc = ~crc;
1106                 for (size_t i = 0; i < size; ++i)
1107                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1108                             ^ (crc >> 8);
1109                 return ~crc;
1110             }
1111
1112             int
1113             main()
1114             {
1115                 init();
1116
1117                 uint32_t value32 = 0;
1118                 uint64_t value64 = 0;
1119                 uint64_t total_size = 0;
1120                 uint8_t buf[8192];
1121
1122                 while (1) {
1123                     const size_t buf_size
1124                             = fread(buf, 1, sizeof(buf), stdin);
1125                     if (buf_size == 0)
1126                         break;
1127
1128                     total_size += buf_size;
1129                     value32 = crc32(buf, buf_size, value32);
1130                     value64 = crc64(buf, buf_size, value64);
1131                 }
1132
1133                 printf("Bytes:  %" PRIu64 "\n", total_size);
1134                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1135                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1136
1137                 return 0;
1138             }
1139
1140
1141 7. References
1142
1143         LZMA SDK - The original LZMA implementation
1144         https://7-zip.org/sdk.html
1145
1146         LZMA Utils - LZMA adapted to POSIX-like systems
1147         https://tukaani.org/lzma/
1148
1149         XZ Utils - The next generation of LZMA Utils
1150         https://xz.tukaani.org/xz-utils/
1151
1152         [RFC-1952]
1153         GZIP file format specification version 4.3
1154         https://www.ietf.org/rfc/rfc1952.txt
1155           - Notation of byte boxes in section "2.1. Overall conventions"
1156
1157         [RFC-2119]
1158         Key words for use in RFCs to Indicate Requirement Levels
1159         https://www.ietf.org/rfc/rfc2119.txt
1160
1161         [GNU-tar]
1162         GNU tar 1.35 manual
1163         https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1164           - Node 9.4.2 "Blocking Factor", paragraph that begins
1165             "gzip will complain about trailing garbage"
1166           - Note that this URL points to the latest version of the
1167             manual, and may some day not contain the note which is in
1168             1.35. For the exact version of the manual, download GNU
1169             tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz
1170