doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.1.0 (2022-12-11)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <http://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.1.0     2022-12-11    Added ARM64 filter and clarified 32-bit
  97                                 ARM endianness in Section 5.3.2,
  98                                 language improvements in Section 5.4
  99
 100         1.0.4     2009-08-27    Language improvements in Sections 1.2,
 101                                 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
 102
 103         1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
 104
 105         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
 106
 107         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
 108                                 clarifications to Sections 2, 2.2,
 109                                 3.3, 4.4, and 5.3.2
 110
 111         1.0.0     2009-01-14    The first official version
 112
 113
 114 1. Conventions
 115
 116         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 117         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 118         document are to be interpreted as described in [RFC-2119].
 119
 120         Indicating a warning means displaying a message, returning
 121         appropriate exit status, or doing something else to let the
 122         user know that something worth warning occurred. The operation
 123         SHOULD still finish if a warning is indicated.
 124
 125         Indicating an error means displaying a message, returning
 126         appropriate exit status, or doing something else to let the
 127         user know that something prevented successfully finishing the
 128         operation. The operation MUST be aborted once an error has
 129         been indicated.
 130
 131
 132 1.1. Byte and Its Representation
 133
 134         In this document, byte is always 8 bits.
 135
 136         A "null byte" has all bits unset. That is, the value of a null
 137         byte is 0x00.
 138
 139         To represent byte blocks, this document uses notation that
 140         is similar to the notation used in [RFC-1952]:
 141
 142             +-------+
 143             |  Foo  |   One byte.
 144             +-------+
 145
 146             +---+---+
 147             |  Foo  |   Two bytes; that is, some of the vertical bars
 148             +---+---+   can be missing.
 149
 150             +=======+
 151             |  Foo  |   Zero or more bytes.
 152             +=======+
 153
 154         In this document, a boxed byte or a byte sequence declared
 155         using this notation is called "a field". The example field
 156         above would be called "the Foo field" or plain "Foo".
 157
 158         If there are many fields, they may be split to multiple lines.
 159         This is indicated with an arrow ("--->"):
 160
 161             +=====+
 162             | Foo |
 163             +=====+
 164
 165                  +=====+
 166             ---> | Bar |
 167                  +=====+
 168
 169         The above is equivalent to this:
 170
 171             +=====+=====+
 172             | Foo | Bar |
 173             +=====+=====+
 174
 175
 176 1.2. Multibyte Integers
 177
 178         Multibyte integers of static length, such as CRC values,
 179         are stored in little endian byte order (least significant
 180         byte first).
 181
 182         When smaller values are more likely than bigger values (for
 183         example file sizes), multibyte integers are encoded in a
 184         variable-length representation:
 185           - Numbers in the range [0, 127] are copied as is, and take
 186             one byte of space.
 187           - Bigger numbers will occupy two or more bytes. All but the
 188             last byte of the multibyte representation have the highest
 189             (eighth) bit set.
 190
 191         For now, the value of the variable-length integers is limited
 192         to 63 bits, which limits the encoded size of the integer to
 193         nine bytes. These limits may be increased in the future if
 194         needed.
 195
 196         The following C code illustrates encoding and decoding of
 197         variable-length integers. The functions return the number of
 198         bytes occupied by the integer (1-9), or zero on error.
 199
 200             #include <stddef.h>
 201             #include <inttypes.h>
 202
 203             size_t
 204             encode(uint8_t buf[static 9], uint64_t num)
 205             {
 206                 if (num > UINT64_MAX / 2)
 207                     return 0;
 208
 209                 size_t i = 0;
 210
 211                 while (num >= 0x80) {
 212                     buf[i++] = (uint8_t)(num) | 0x80;
 213                     num >>= 7;
 214                 }
 215
 216                 buf[i++] = (uint8_t)(num);
 217
 218                 return i;
 219             }
 220
 221             size_t
 222             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 223             {
 224                 if (size_max == 0)
 225                     return 0;
 226
 227                 if (size_max > 9)
 228                     size_max = 9;
 229
 230                 *num = buf[0] & 0x7F;
 231                 size_t i = 0;
 232
 233                 while (buf[i++] & 0x80) {
 234                     if (i >= size_max || buf[i] == 0x00)
 235                         return 0;
 236
 237                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 238                 }
 239
 240                 return i;
 241             }
 242
 243
 244 2. Overall Structure of .xz File
 245
 246         A standalone .xz files consist of one or more Streams which may
 247         have Stream Padding between or after them:
 248
 249             +========+================+========+================+
 250             | Stream | Stream Padding | Stream | Stream Padding | ...
 251             +========+================+========+================+
 252
 253         The sizes of Stream and Stream Padding are always multiples
 254         of four bytes, thus the size of every valid .xz file MUST be
 255         a multiple of four bytes.
 256
 257         While a typical file contains only one Stream and no Stream
 258         Padding, a decoder handling standalone .xz files SHOULD support
 259         files that have more than one Stream or Stream Padding.
 260
 261         In contrast to standalone .xz files, when the .xz file format
 262         is used as an internal part of some other file format or
 263         communication protocol, it usually is expected that the decoder
 264         stops after the first Stream, and doesn't look for Stream
 265         Padding or possibly other Streams.
 266
 267
 268 2.1. Stream
 269
 270         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 271         |     Stream Header     | Block | Block | ... | Block |
 272         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 273
 274              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 275         ---> | Index |     Stream Footer     |
 276              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 277
 278         All the above fields have a size that is a multiple of four. If
 279         Stream is used as an internal part of another file format, it
 280         is RECOMMENDED to make the Stream start at an offset that is
 281         a multiple of four bytes.
 282
 283         Stream Header, Index, and Stream Footer are always present in
 284         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 285
 286         There are zero or more Blocks. The maximum number of Blocks is
 287         limited only by the maximum size of the Index field.
 288
 289         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 290         The same limit applies to the total amount of uncompressed
 291         data stored in a Stream.
 292
 293         If an implementation supports handling .xz files with multiple
 294         concatenated Streams, it MAY apply the above limits to the file
 295         as a whole instead of limiting per Stream basis.
 296
 297
 298 2.1.1. Stream Header
 299
 300         +---+---+---+---+---+---+-------+------+--+--+--+--+
 301         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 302         +---+---+---+---+---+---+-------+------+--+--+--+--+
 303
 304
 305 2.1.1.1. Header Magic Bytes
 306
 307         The first six (6) bytes of the Stream are so called Header
 308         Magic Bytes. They can be used to identify the file type.
 309
 310             Using a C array and ASCII:
 311             const uint8_t HEADER_MAGIC[6]
 312                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 313
 314             In plain hexadecimal:
 315             FD 37 7A 58 5A 00
 316
 317         Notes:
 318           - The first byte (0xFD) was chosen so that the files cannot
 319             be erroneously detected as being in .lzma format, in which
 320             the first byte is in the range [0x00, 0xE0].
 321           - The sixth byte (0x00) was chosen to prevent applications
 322             from misdetecting the file as a text file.
 323
 324         If the Header Magic Bytes don't match, the decoder MUST
 325         indicate an error.
 326
 327
 328 2.1.1.2. Stream Flags
 329
 330         The first byte of Stream Flags is always a null byte. In the
 331         future, this byte may be used to indicate a new Stream version
 332         or other Stream properties.
 333
 334         The second byte of Stream Flags is a bit field:
 335
 336             Bit(s)  Mask  Description
 337              0-3    0x0F  Type of Check (see Section 3.4):
 338                               ID    Size      Check name
 339                               0x00   0 bytes  None
 340                               0x01   4 bytes  CRC32
 341                               0x02   4 bytes  (Reserved)
 342                               0x03   4 bytes  (Reserved)
 343                               0x04   8 bytes  CRC64
 344                               0x05   8 bytes  (Reserved)
 345                               0x06   8 bytes  (Reserved)
 346                               0x07  16 bytes  (Reserved)
 347                               0x08  16 bytes  (Reserved)
 348                               0x09  16 bytes  (Reserved)
 349                               0x0A  32 bytes  SHA-256
 350                               0x0B  32 bytes  (Reserved)
 351                               0x0C  32 bytes  (Reserved)
 352                               0x0D  64 bytes  (Reserved)
 353                               0x0E  64 bytes  (Reserved)
 354                               0x0F  64 bytes  (Reserved)
 355              4-7    0xF0  Reserved for future use; MUST be zero for now.
 356
 357         Implementations SHOULD support at least the Check IDs 0x00
 358         (None) and 0x01 (CRC32). Supporting other Check IDs is
 359         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 360         indicate a warning or error.
 361
 362         If any reserved bit is set, the decoder MUST indicate an error.
 363         It is possible that there is a new field present which the
 364         decoder is not aware of, and can thus parse the Stream Header
 365         incorrectly.
 366
 367
 368 2.1.1.3. CRC32
 369
 370         The CRC32 is calculated from the Stream Flags field. It is
 371         stored as an unsigned 32-bit little endian integer. If the
 372         calculated value does not match the stored one, the decoder
 373         MUST indicate an error.
 374
 375         The idea is that Stream Flags would always be two bytes, even
 376         if new features are needed. This way old decoders will be able
 377         to verify the CRC32 calculated from Stream Flags, and thus
 378         distinguish between corrupt files (CRC32 doesn't match) and
 379         files that the decoder doesn't support (CRC32 matches but
 380         Stream Flags has reserved bits set).
 381
 382
 383 2.1.2. Stream Footer
 384
 385         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 386         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 387         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 388
 389
 390 2.1.2.1. CRC32
 391
 392         The CRC32 is calculated from the Backward Size and Stream Flags
 393         fields. It is stored as an unsigned 32-bit little endian
 394         integer. If the calculated value does not match the stored one,
 395         the decoder MUST indicate an error.
 396
 397         The reason to have the CRC32 field before the Backward Size and
 398         Stream Flags fields is to keep the four-byte fields aligned to
 399         a multiple of four bytes.
 400
 401
 402 2.1.2.2. Backward Size
 403
 404         Backward Size is stored as a 32-bit little endian integer,
 405         which indicates the size of the Index field as multiple of
 406         four bytes, minimum value being four bytes:
 407
 408             real_backward_size = (stored_backward_size + 1) * 4;
 409
 410         If the stored value does not match the real size of the Index
 411         field, the decoder MUST indicate an error.
 412
 413         Using a fixed-size integer to store Backward Size makes
 414         it slightly simpler to parse the Stream Footer when the
 415         application needs to parse the Stream backwards.
 416
 417
 418 2.1.2.3. Stream Flags
 419
 420         This is a copy of the Stream Flags field from the Stream
 421         Header. The information stored to Stream Flags is needed
 422         when parsing the Stream backwards. The decoder MUST compare
 423         the Stream Flags fields in both Stream Header and Stream
 424         Footer, and indicate an error if they are not identical.
 425
 426
 427 2.1.2.4. Footer Magic Bytes
 428
 429         As the last step of the decoding process, the decoder MUST
 430         verify the existence of Footer Magic Bytes. If they don't
 431         match, an error MUST be indicated.
 432
 433             Using a C array and ASCII:
 434             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 435
 436             In hexadecimal:
 437             59 5A
 438
 439         The primary reason to have Footer Magic Bytes is to make
 440         it easier to detect incomplete files quickly, without
 441         uncompressing. If the file does not end with Footer Magic Bytes
 442         (excluding Stream Padding described in Section 2.2), it cannot
 443         be undamaged, unless someone has intentionally appended garbage
 444         after the end of the Stream.
 445
 446
 447 2.2. Stream Padding
 448
 449         Only the decoders that support decoding of concatenated Streams
 450         MUST support Stream Padding.
 451
 452         Stream Padding MUST contain only null bytes. To preserve the
 453         four-byte alignment of consecutive Streams, the size of Stream
 454         Padding MUST be a multiple of four bytes. Empty Stream Padding
 455         is allowed. If these requirements are not met, the decoder MUST
 456         indicate an error.
 457
 458         Note that non-empty Stream Padding is allowed at the end of the
 459         file; there doesn't need to be a new Stream after non-empty
 460         Stream Padding. This can be convenient in certain situations
 461         [GNU-tar].
 462
 463         The possibility of Stream Padding MUST be taken into account
 464         when designing an application that parses Streams backwards,
 465         and the application supports concatenated Streams.
 466
 467
 468 3. Block
 469
 470         +==============+=================+===============+=======+
 471         | Block Header | Compressed Data | Block Padding | Check |
 472         +==============+=================+===============+=======+
 473
 474
 475 3.1. Block Header
 476
 477         +-------------------+-------------+=================+
 478         | Block Header Size | Block Flags | Compressed Size |
 479         +-------------------+-------------+=================+
 480
 481              +===================+======================+
 482         ---> | Uncompressed Size | List of Filter Flags |
 483              +===================+======================+
 484
 485              +================+--+--+--+--+
 486         ---> | Header Padding |   CRC32   |
 487              +================+--+--+--+--+
 488
 489
 490 3.1.1. Block Header Size
 491
 492         This field overlaps with the Index Indicator field (see
 493         Section 4.1).
 494
 495         This field contains the size of the Block Header field,
 496         including the Block Header Size field itself. Valid values are
 497         in the range [0x01, 0xFF], which indicate the size of the Block
 498         Header as multiples of four bytes, minimum size being eight
 499         bytes:
 500
 501             real_header_size = (encoded_header_size + 1) * 4;
 502
 503         If a Block Header bigger than 1024 bytes is needed in the
 504         future, a new field can be added between the Block Header and
 505         Compressed Data fields. The presence of this new field would
 506         be indicated in the Block Header field.
 507
 508
 509 3.1.2. Block Flags
 510
 511         The Block Flags field is a bit field:
 512
 513             Bit(s)  Mask  Description
 514              0-1    0x03  Number of filters (1-4)
 515              2-5    0x3C  Reserved for future use; MUST be zero for now.
 516               6     0x40  The Compressed Size field is present.
 517               7     0x80  The Uncompressed Size field is present.
 518
 519         If any reserved bit is set, the decoder MUST indicate an error.
 520         It is possible that there is a new field present which the
 521         decoder is not aware of, and can thus parse the Block Header
 522         incorrectly.
 523
 524
 525 3.1.3. Compressed Size
 526
 527         This field is present only if the appropriate bit is set in
 528         the Block Flags field (see Section 3.1.2).
 529
 530         The Compressed Size field contains the size of the Compressed
 531         Data field, which MUST be non-zero. Compressed Size is stored
 532         using the encoding described in Section 1.2. If the Compressed
 533         Size doesn't match the size of the Compressed Data field, the
 534         decoder MUST indicate an error.
 535
 536
 537 3.1.4. Uncompressed Size
 538
 539         This field is present only if the appropriate bit is set in
 540         the Block Flags field (see Section 3.1.2).
 541
 542         The Uncompressed Size field contains the size of the Block
 543         after uncompressing. Uncompressed Size is stored using the
 544         encoding described in Section 1.2. If the Uncompressed Size
 545         does not match the real uncompressed size, the decoder MUST
 546         indicate an error.
 547
 548         Storing the Compressed Size and Uncompressed Size fields serves
 549         several purposes:
 550           - The decoder knows how much memory it needs to allocate
 551             for a temporary buffer in multithreaded mode.
 552           - Simple error detection: wrong size indicates a broken file.
 553           - Seeking forwards to a specific location in streamed mode.
 554
 555         It should be noted that the only reliable way to determine
 556         the real uncompressed size is to uncompress the Block,
 557         because the Block Header and Index fields may contain
 558         (intentionally or unintentionally) invalid information.
 559
 560
 561 3.1.5. List of Filter Flags
 562
 563         +================+================+     +================+
 564         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 565         +================+================+     +================+
 566
 567         The number of Filter Flags fields is stored in the Block Flags
 568         field (see Section 3.1.2).
 569
 570         The format of each Filter Flags field is as follows:
 571
 572             +===========+====================+===================+
 573             | Filter ID | Size of Properties | Filter Properties |
 574             +===========+====================+===================+
 575
 576         Both Filter ID and Size of Properties are stored using the
 577         encoding described in Section 1.2. Size of Properties indicates
 578         the size of the Filter Properties field as bytes. The list of
 579         officially defined Filter IDs and the formats of their Filter
 580         Properties are described in Section 5.3.
 581
 582         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 583         (2^62) are reserved for implementation-specific internal use.
 584         These Filter IDs MUST never be used in List of Filter Flags.
 585
 586
 587 3.1.6. Header Padding
 588
 589         This field contains as many null byte as it is needed to make
 590         the Block Header have the size specified in Block Header Size.
 591         If any of the bytes are not null bytes, the decoder MUST
 592         indicate an error. It is possible that there is a new field
 593         present which the decoder is not aware of, and can thus parse
 594         the Block Header incorrectly.
 595
 596
 597 3.1.7. CRC32
 598
 599         The CRC32 is calculated over everything in the Block Header
 600         field except the CRC32 field itself. It is stored as an
 601         unsigned 32-bit little endian integer. If the calculated
 602         value does not match the stored one, the decoder MUST indicate
 603         an error.
 604
 605         By verifying the CRC32 of the Block Header before parsing the
 606         actual contents allows the decoder to distinguish between
 607         corrupt and unsupported files.
 608
 609
 610 3.2. Compressed Data
 611
 612         The format of Compressed Data depends on Block Flags and List
 613         of Filter Flags. Excluding the descriptions of the simplest
 614         filters in Section 5.3, the format of the filter-specific
 615         encoded data is out of scope of this document.
 616
 617
 618 3.3. Block Padding
 619
 620         Block Padding MUST contain 0-3 null bytes to make the size of
 621         the Block a multiple of four bytes. This can be needed when
 622         the size of Compressed Data is not a multiple of four. If any
 623         of the bytes in Block Padding are not null bytes, the decoder
 624         MUST indicate an error.
 625
 626
 627 3.4. Check
 628
 629         The type and size of the Check field depends on which bits
 630         are set in the Stream Flags field (see Section 2.1.1.2).
 631
 632         The Check, when used, is calculated from the original
 633         uncompressed data. If the calculated Check does not match the
 634         stored one, the decoder MUST indicate an error. If the selected
 635         type of Check is not supported by the decoder, it SHOULD
 636         indicate a warning or error.
 637
 638
 639 4. Index
 640
 641         +-----------------+===================+
 642         | Index Indicator | Number of Records |
 643         +-----------------+===================+
 644
 645              +=================+===============+-+-+-+-+
 646         ---> | List of Records | Index Padding | CRC32 |
 647              +=================+===============+-+-+-+-+
 648
 649         Index serves several purposes. Using it, one can
 650           - verify that all Blocks in a Stream have been processed;
 651           - find out the uncompressed size of a Stream; and
 652           - quickly access the beginning of any Block (random access).
 653
 654
 655 4.1. Index Indicator
 656
 657         This field overlaps with the Block Header Size field (see
 658         Section 3.1.1). The value of Index Indicator is always 0x00.
 659
 660
 661 4.2. Number of Records
 662
 663         This field indicates how many Records there are in the List
 664         of Records field, and thus how many Blocks there are in the
 665         Stream. The value is stored using the encoding described in
 666         Section 1.2. If the decoder has decoded all the Blocks of the
 667         Stream, and then notices that the Number of Records doesn't
 668         match the real number of Blocks, the decoder MUST indicate an
 669         error.
 670
 671
 672 4.3. List of Records
 673
 674         List of Records consists of as many Records as indicated by the
 675         Number of Records field:
 676
 677             +========+========+
 678             | Record | Record | ...
 679             +========+========+
 680
 681         Each Record contains information about one Block:
 682
 683             +===============+===================+
 684             | Unpadded Size | Uncompressed Size |
 685             +===============+===================+
 686
 687         If the decoder has decoded all the Blocks of the Stream, it
 688         MUST verify that the contents of the Records match the real
 689         Unpadded Size and Uncompressed Size of the respective Blocks.
 690
 691         Implementation hint: It is possible to verify the Index with
 692         constant memory usage by calculating for example SHA-256 of
 693         both the real size values and the List of Records, then
 694         comparing the hash values. Implementing this using
 695         non-cryptographic hash like CRC32 SHOULD be avoided unless
 696         small code size is important.
 697
 698         If the decoder supports random-access reading, it MUST verify
 699         that Unpadded Size and Uncompressed Size of every completely
 700         decoded Block match the sizes stored in the Index. If only
 701         partial Block is decoded, the decoder MUST verify that the
 702         processed sizes don't exceed the sizes stored in the Index.
 703
 704
 705 4.3.1. Unpadded Size
 706
 707         This field indicates the size of the Block excluding the Block
 708         Padding field. That is, Unpadded Size is the size of the Block
 709         Header, Compressed Data, and Check fields. Unpadded Size is
 710         stored using the encoding described in Section 1.2. The value
 711         MUST never be zero; with the current structure of Blocks, the
 712         actual minimum value for Unpadded Size is five.
 713
 714         Implementation note: Because the size of the Block Padding
 715         field is not included in Unpadded Size, calculating the total
 716         size of a Stream or doing random-access reading requires
 717         calculating the actual size of the Blocks by rounding Unpadded
 718         Sizes up to the next multiple of four.
 719
 720         The reason to exclude Block Padding from Unpadded Size is to
 721         ease making a raw copy of Compressed Data without Block
 722         Padding. This can be useful, for example, if someone wants
 723         to convert Streams to some other file format quickly.
 724
 725
 726 4.3.2. Uncompressed Size
 727
 728         This field indicates the Uncompressed Size of the respective
 729         Block as bytes. The value is stored using the encoding
 730         described in Section 1.2.
 731
 732
 733 4.4. Index Padding
 734
 735         This field MUST contain 0-3 null bytes to pad the Index to
 736         a multiple of four bytes. If any of the bytes are not null
 737         bytes, the decoder MUST indicate an error.
 738
 739
 740 4.5. CRC32
 741
 742         The CRC32 is calculated over everything in the Index field
 743         except the CRC32 field itself. The CRC32 is stored as an
 744         unsigned 32-bit little endian integer. If the calculated
 745         value does not match the stored one, the decoder MUST indicate
 746         an error.
 747
 748
 749 5. Filter Chains
 750
 751         The Block Flags field defines how many filters are used. When
 752         more than one filter is used, the filters are chained; that is,
 753         the output of one filter is the input of another filter. The
 754         following figure illustrates the direction of data flow.
 755
 756                     v   Uncompressed Data   ^
 757                     |       Filter 0        |
 758             Encoder |       Filter 1        | Decoder
 759                     |       Filter n        |
 760                     v    Compressed Data    ^
 761
 762
 763 5.1. Alignment
 764
 765         Alignment of uncompressed input data is usually the job of
 766         the application producing the data. For example, to get the
 767         best results, an archiver tool should make sure that all
 768         PowerPC executable files in the archive stream start at
 769         offsets that are multiples of four bytes.
 770
 771         Some filters, for example LZMA2, can be configured to take
 772         advantage of specified alignment of input data. Note that
 773         taking advantage of aligned input can be beneficial also when
 774         a filter is not the first filter in the chain. For example,
 775         if you compress PowerPC executables, you may want to use the
 776         PowerPC filter and chain that with the LZMA2 filter. Because
 777         not only the input but also the output alignment of the PowerPC
 778         filter is four bytes, it is now beneficial to set LZMA2
 779         settings so that the LZMA2 encoder can take advantage of its
 780         four-byte-aligned input data.
 781
 782         The output of the last filter in the chain is stored to the
 783         Compressed Data field, which is is guaranteed to be aligned
 784         to a multiple of four bytes relative to the beginning of the
 785         Stream. This can increase
 786           - speed, if the filtered data is handled multiple bytes at
 787             a time by the filter-specific encoder and decoder,
 788             because accessing aligned data in computer memory is
 789             usually faster; and
 790           - compression ratio, if the output data is later compressed
 791             with an external compression tool.
 792
 793
 794 5.2. Security
 795
 796         If filters would be allowed to be chained freely, it would be
 797         possible to create malicious files, that would be very slow to
 798         decode. Such files could be used to create denial of service
 799         attacks.
 800
 801         Slow files could occur when multiple filters are chained:
 802
 803             v   Compressed input data
 804             |   Filter 1 decoder (last filter)
 805             |   Filter 0 decoder (non-last filter)
 806             v   Uncompressed output data
 807
 808         The decoder of the last filter in the chain produces a lot of
 809         output from little input. Another filter in the chain takes the
 810         output of the last filter, and produces very little output
 811         while consuming a lot of input. As a result, a lot of data is
 812         moved inside the filter chain, but the filter chain as a whole
 813         gets very little work done.
 814
 815         To prevent this kind of slow files, there are restrictions on
 816         how the filters can be chained. These restrictions MUST be
 817         taken into account when designing new filters.
 818
 819         The maximum number of filters in the chain has been limited to
 820         four, thus there can be at maximum of three non-last filters.
 821         Of these three non-last filters, only two are allowed to change
 822         the size of the data.
 823
 824         The non-last filters, that change the size of the data, MUST
 825         have a limit how much the decoder can compress the data: the
 826         decoder SHOULD produce at least n bytes of output when the
 827         filter is given 2n bytes of input. This  limit is not
 828         absolute, but significant deviations MUST be avoided.
 829
 830         The above limitations guarantee that if the last filter in the
 831         chain produces 4n bytes of output, the chain as a whole will
 832         produce at least n bytes of output.
 833
 834
 835 5.3. Filters
 836
 837 5.3.1. LZMA2
 838
 839         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
 840         compression algorithm with high compression ratio and fast
 841         decompression. LZMA is based on LZ77 and range coding
 842         algorithms.
 843
 844         LZMA2 is an extension on top of the original LZMA. LZMA2 uses
 845         LZMA internally, but adds support for flushing the encoder,
 846         uncompressed chunks, eases stateful decoder implementations,
 847         and improves support for multithreading. Thus, the plain LZMA
 848         will not be supported in this file format.
 849
 850             Filter ID:                  0x21
 851             Size of Filter Properties:  1 byte
 852             Changes size of data:       Yes
 853             Allow as a non-last filter: No
 854             Allow as the last filter:   Yes
 855
 856             Preferred alignment:
 857                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 858                 Output data:            1 byte
 859
 860         The format of the one-byte Filter Properties field is as
 861         follows:
 862
 863             Bits   Mask   Description
 864             0-5    0x3F   Dictionary Size
 865             6-7    0xC0   Reserved for future use; MUST be zero for now.
 866
 867         Dictionary Size is encoded with one-bit mantissa and five-bit
 868         exponent. The smallest dictionary size is 4 KiB and the biggest
 869         is 4 GiB.
 870
 871             Raw value   Mantissa   Exponent   Dictionary size
 872                 0           2         11         4 KiB
 873                 1           3         11         6 KiB
 874                 2           2         12         8 KiB
 875                 3           3         12        12 KiB
 876                 4           2         13        16 KiB
 877                 5           3         13        24 KiB
 878                 6           2         14        32 KiB
 879               ...         ...        ...      ...
 880                35           3         27       768 MiB
 881                36           2         28      1024 MiB
 882                37           3         29      1536 MiB
 883                38           2         30      2048 MiB
 884                39           3         30      3072 MiB
 885                40           2         31      4096 MiB - 1 B
 886
 887         Instead of having a table in the decoder, the dictionary size
 888         can be decoded using the following C code:
 889
 890             const uint8_t bits = get_dictionary_flags() & 0x3F;
 891             if (bits > 40)
 892                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 893
 894             uint32_t dictionary_size;
 895             if (bits == 40) {
 896                 dictionary_size = UINT32_MAX;
 897             } else {
 898                 dictionary_size = 2 | (bits & 1);
 899                 dictionary_size <<= bits / 2 + 11;
 900             }
 901
 902
 903 5.3.2. Branch/Call/Jump Filters for Executables
 904
 905         These filters convert relative branch, call, and jump
 906         instructions to their absolute counterparts in executable
 907         files. This conversion increases redundancy and thus
 908         compression ratio.
 909
 910             Size of Filter Properties:  0 or 4 bytes
 911             Changes size of data:       No
 912             Allow as a non-last filter: Yes
 913             Allow as the last filter:   No
 914
 915         Below is the list of filters in this category. The alignment
 916         is the same for both input and output data.
 917
 918             Filter ID   Alignment   Description
 919               0x04       1 byte     x86 filter (BCJ)
 920               0x05       4 bytes    PowerPC (big endian) filter
 921               0x06      16 bytes    IA64 filter
 922               0x07       4 bytes    ARM filter [1]
 923               0x08       2 bytes    ARM Thumb filter [1]
 924               0x09       4 bytes    SPARC filter
 925               0x0A       4 bytes    ARM64 filter [2]
 926
 927               [1] These are for little endian instruction encoding.
 928                   This must not be confused with data endianness.
 929                   A processor configured for big endian data access
 930                   may still use little endian instruction encoding.
 931                   The filters don't care about the data endianness.
 932
 933               [2] 4096-byte alignment gives the best results
 934                   because the address in the ADRP instruction
 935                   is a multiple of 4096 bytes.
 936
 937         If the size of Filter Properties is four bytes, the Filter
 938         Properties field contains the start offset used for address
 939         conversions. It is stored as an unsigned 32-bit little endian
 940         integer. The start offset MUST be a multiple of the alignment
 941         of the filter as listed in the table above; if it isn't, the
 942         decoder MUST indicate an error. If the size of Filter
 943         Properties is zero, the start offset is zero.
 944
 945         Setting the start offset may be useful if an executable has
 946         multiple sections, and there are many cross-section calls.
 947         Taking advantage of this feature usually requires usage of
 948         the Subblock filter, whose design is not complete yet.
 949
 950
 951 5.3.3. Delta
 952
 953         The Delta filter may increase compression ratio when the value
 954         of the next byte correlates with the value of an earlier byte
 955         at specified distance.
 956
 957             Filter ID:                  0x03
 958             Size of Filter Properties:  1 byte
 959             Changes size of data:       No
 960             Allow as a non-last filter: Yes
 961             Allow as the last filter:   No
 962
 963             Preferred alignment:
 964                 Input data:             1 byte
 965                 Output data:            Same as the original input data
 966
 967         The Properties byte indicates the delta distance, which can be
 968         1-256 bytes backwards from the current byte: 0x00 indicates
 969         distance of 1 byte and 0xFF distance of 256 bytes.
 970
 971
 972 5.3.3.1. Format of the Encoded Output
 973
 974         The code below illustrates both encoding and decoding with
 975         the Delta filter.
 976
 977             // Distance is in the range [1, 256].
 978             const unsigned int distance = get_properties_byte() + 1;
 979             uint8_t pos = 0;
 980             uint8_t delta[256];
 981
 982             memset(delta, 0, sizeof(delta));
 983
 984             while (1) {
 985                 const int byte = read_byte();
 986                 if (byte == EOF)
 987                     break;
 988
 989                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 990                 if (is_encoder) {
 991                     tmp = (uint8_t)(byte) - tmp;
 992                     delta[pos] = (uint8_t)(byte);
 993                 } else {
 994                     tmp = (uint8_t)(byte) + tmp;
 995                     delta[pos] = tmp;
 996                 }
 997
 998                 write_byte(tmp);
 999                 --pos;
1000             }
1001
1002
1003 5.4. Custom Filter IDs
1004
1005         If a developer wants to use custom Filter IDs, there are two
1006         choices. The first choice is to contact Lasse Collin and ask
1007         him to allocate a range of IDs for the developer.
1008
1009         The second choice is to generate a 40-bit random integer
1010         which the developer can use as a personal Developer ID.
1011         To minimize the risk of collisions, Developer ID has to be
1012         a randomly generated integer, not manually selected "hex word".
1013         The following command, which works on many free operating
1014         systems, can be used to generate Developer ID:
1015
1016             dd if=/dev/urandom bs=5 count=1 | hexdump
1017
1018         The developer can then use the Developer ID to create unique
1019         (well, hopefully unique) Filter IDs.
1020
1021             Bits    Mask                    Description
1022              0-15   0x0000_0000_0000_FFFF   Filter ID
1023             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1024             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1025
1026         The resulting 63-bit integer will use 9 bytes of space when
1027         stored using the encoding described in Section 1.2. To get
1028         a shorter ID, see the beginning of this Section how to
1029         request a custom ID range.
1030
1031
1032 5.4.1. Reserved Custom Filter ID Ranges
1033
1034         Range                       Description
1035         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1036         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1037         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1038
1039
1040 6. Cyclic Redundancy Checks
1041
1042         There are several incompatible variations to calculate CRC32
1043         and CRC64. For simplicity and clarity, complete examples are
1044         provided to calculate the checks as they are used in this file
1045         format. Implementations MAY use different code as long as it
1046         gives identical results.
1047
1048         The program below reads data from standard input, calculates
1049         the CRC32 and CRC64 values, and prints the calculated values
1050         as big endian hexadecimal strings to standard output.
1051
1052             #include <stddef.h>
1053             #include <inttypes.h>
1054             #include <stdio.h>
1055
1056             uint32_t crc32_table[256];
1057             uint64_t crc64_table[256];
1058
1059             void
1060             init(void)
1061             {
1062                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1063                 static const uint64_t poly64
1064                         = UINT64_C(0xC96C5795D7870F42);
1065
1066                 for (size_t i = 0; i < 256; ++i) {
1067                     uint32_t crc32 = i;
1068                     uint64_t crc64 = i;
1069
1070                     for (size_t j = 0; j < 8; ++j) {
1071                         if (crc32 & 1)
1072                             crc32 = (crc32 >> 1) ^ poly32;
1073                         else
1074                             crc32 >>= 1;
1075
1076                         if (crc64 & 1)
1077                             crc64 = (crc64 >> 1) ^ poly64;
1078                         else
1079                             crc64 >>= 1;
1080                     }
1081
1082                     crc32_table[i] = crc32;
1083                     crc64_table[i] = crc64;
1084                 }
1085             }
1086
1087             uint32_t
1088             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1089             {
1090                 crc = ~crc;
1091                 for (size_t i = 0; i < size; ++i)
1092                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1093                             ^ (crc >> 8);
1094                 return ~crc;
1095             }
1096
1097             uint64_t
1098             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1099             {
1100                 crc = ~crc;
1101                 for (size_t i = 0; i < size; ++i)
1102                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1103                             ^ (crc >> 8);
1104                 return ~crc;
1105             }
1106
1107             int
1108             main()
1109             {
1110                 init();
1111
1112                 uint32_t value32 = 0;
1113                 uint64_t value64 = 0;
1114                 uint64_t total_size = 0;
1115                 uint8_t buf[8192];
1116
1117                 while (1) {
1118                     const size_t buf_size
1119                             = fread(buf, 1, sizeof(buf), stdin);
1120                     if (buf_size == 0)
1121                         break;
1122
1123                     total_size += buf_size;
1124                     value32 = crc32(buf, buf_size, value32);
1125                     value64 = crc64(buf, buf_size, value64);
1126                 }
1127
1128                 printf("Bytes:  %" PRIu64 "\n", total_size);
1129                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1130                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1131
1132                 return 0;
1133             }
1134
1135
1136 7. References
1137
1138         LZMA SDK - The original LZMA implementation
1139         http://7-zip.org/sdk.html
1140
1141         LZMA Utils - LZMA adapted to POSIX-like systems
1142         http://tukaani.org/lzma/
1143
1144         XZ Utils - The next generation of LZMA Utils
1145         http://tukaani.org/xz/
1146
1147         [RFC-1952]
1148         GZIP file format specification version 4.3
1149         http://www.ietf.org/rfc/rfc1952.txt
1150           - Notation of byte boxes in section "2.1. Overall conventions"
1151
1152         [RFC-2119]
1153         Key words for use in RFCs to Indicate Requirement Levels
1154         http://www.ietf.org/rfc/rfc2119.txt
1155
1156         [GNU-tar]
1157         GNU tar 1.21 manual
1158         http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1159           - Node 9.4.2 "Blocking Factor", paragraph that begins
1160             "gzip will complain about trailing garbage"
1161           - Note that this URL points to the latest version of the
1162             manual, and may some day not contain the note which is in
1163             1.21. For the exact version of the manual, download GNU
1164             tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1165