doc/xz-file-format.txt

   1
   2 The .xz File Format
   3 ===================
   4
   5 Version 1.2.1 (2024-04-08)
   6
   7
   8         0. Preface
   9            0.1. Notices and Acknowledgements
  10            0.2. Getting the Latest Version
  11            0.3. Version History
  12         1. Conventions
  13            1.1. Byte and Its Representation
  14            1.2. Multibyte Integers
  15         2. Overall Structure of .xz File
  16            2.1. Stream
  17                 2.1.1. Stream Header
  18                        2.1.1.1. Header Magic Bytes
  19                        2.1.1.2. Stream Flags
  20                        2.1.1.3. CRC32
  21                 2.1.2. Stream Footer
  22                        2.1.2.1. CRC32
  23                        2.1.2.2. Backward Size
  24                        2.1.2.3. Stream Flags
  25                        2.1.2.4. Footer Magic Bytes
  26            2.2. Stream Padding
  27         3. Block
  28            3.1. Block Header
  29                 3.1.1. Block Header Size
  30                 3.1.2. Block Flags
  31                 3.1.3. Compressed Size
  32                 3.1.4. Uncompressed Size
  33                 3.1.5. List of Filter Flags
  34                 3.1.6. Header Padding
  35                 3.1.7. CRC32
  36            3.2. Compressed Data
  37            3.3. Block Padding
  38            3.4. Check
  39         4. Index
  40            4.1. Index Indicator
  41            4.2. Number of Records
  42            4.3. List of Records
  43                 4.3.1. Unpadded Size
  44                 4.3.2. Uncompressed Size
  45            4.4. Index Padding
  46            4.5. CRC32
  47         5. Filter Chains
  48            5.1. Alignment
  49            5.2. Security
  50            5.3. Filters
  51                 5.3.1. LZMA2
  52                 5.3.2. Branch/Call/Jump Filters for Executables
  53                 5.3.3. Delta
  54                        5.3.3.1. Format of the Encoded Output
  55            5.4. Custom Filter IDs
  56                 5.4.1. Reserved Custom Filter ID Ranges
  57         6. Cyclic Redundancy Checks
  58         7. References
  59
  60
  61 0. Preface
  62
  63         This document describes the .xz file format (filename suffix
  64         ".xz", MIME type "application/x-xz"). It is intended that this
  65         this format replace the old .lzma format used by LZMA SDK and
  66         LZMA Utils.
  67
  68
  69 0.1. Notices and Acknowledgements
  70
  71         This file format was designed by Lasse Collin
  72         <lasse.collin@tukaani.org> and Igor Pavlov.
  73
  74         Special thanks for helping with this document goes to
  75         Ville Koskinen. Thanks for helping with this document goes to
  76         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
  77
  78         This document has been put into the public domain.
  79
  80
  81 0.2. Getting the Latest Version
  82
  83         The latest official version of this document can be downloaded
  84         from <https://tukaani.org/xz/xz-file-format.txt>.
  85
  86         Specific versions of this document have a filename
  87         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
  88         For example, the version 1.0.0 of this document is available
  89         at <https://tukaani.org/xz/xz-file-format-1.0.0.txt>.
  90
  91
  92 0.3. Version History
  93
  94         Version   Date          Description
  95
  96         1.2.1     2024-04-08    The URLs of this specification and
  97                                 XZ Utils were changed back to the
  98                                 original ones in Sections 0.2 and 7.
  99
 100         1.2.0     2024-01-19    Added RISC-V filter and updated URLs in
 101                                 Sections 0.2 and 7. The URL of this
 102                                 specification was changed.
 103
 104         1.1.0     2022-12-11    Added ARM64 filter and clarified 32-bit
 105                                 ARM endianness in Section 5.3.2,
 106                                 language improvements in Section 5.4
 107
 108         1.0.4     2009-08-27    Language improvements in Sections 1.2,
 109                                 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
 110
 111         1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
 112
 113         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
 114
 115         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
 116                                 clarifications to Sections 2, 2.2,
 117                                 3.3, 4.4, and 5.3.2
 118
 119         1.0.0     2009-01-14    The first official version
 120
 121
 122 1. Conventions
 123
 124         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
 125         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 126         document are to be interpreted as described in [RFC-2119].
 127
 128         Indicating a warning means displaying a message, returning
 129         appropriate exit status, or doing something else to let the
 130         user know that something worth warning occurred. The operation
 131         SHOULD still finish if a warning is indicated.
 132
 133         Indicating an error means displaying a message, returning
 134         appropriate exit status, or doing something else to let the
 135         user know that something prevented successfully finishing the
 136         operation. The operation MUST be aborted once an error has
 137         been indicated.
 138
 139
 140 1.1. Byte and Its Representation
 141
 142         In this document, byte is always 8 bits.
 143
 144         A "null byte" has all bits unset. That is, the value of a null
 145         byte is 0x00.
 146
 147         To represent byte blocks, this document uses notation that
 148         is similar to the notation used in [RFC-1952]:
 149
 150             +-------+
 151             |  Foo  |   One byte.
 152             +-------+
 153
 154             +---+---+
 155             |  Foo  |   Two bytes; that is, some of the vertical bars
 156             +---+---+   can be missing.
 157
 158             +=======+
 159             |  Foo  |   Zero or more bytes.
 160             +=======+
 161
 162         In this document, a boxed byte or a byte sequence declared
 163         using this notation is called "a field". The example field
 164         above would be called "the Foo field" or plain "Foo".
 165
 166         If there are many fields, they may be split to multiple lines.
 167         This is indicated with an arrow ("--->"):
 168
 169             +=====+
 170             | Foo |
 171             +=====+
 172
 173                  +=====+
 174             ---> | Bar |
 175                  +=====+
 176
 177         The above is equivalent to this:
 178
 179             +=====+=====+
 180             | Foo | Bar |
 181             +=====+=====+
 182
 183
 184 1.2. Multibyte Integers
 185
 186         Multibyte integers of static length, such as CRC values,
 187         are stored in little endian byte order (least significant
 188         byte first).
 189
 190         When smaller values are more likely than bigger values (for
 191         example file sizes), multibyte integers are encoded in a
 192         variable-length representation:
 193           - Numbers in the range [0, 127] are copied as is, and take
 194             one byte of space.
 195           - Bigger numbers will occupy two or more bytes. All but the
 196             last byte of the multibyte representation have the highest
 197             (eighth) bit set.
 198
 199         For now, the value of the variable-length integers is limited
 200         to 63 bits, which limits the encoded size of the integer to
 201         nine bytes. These limits may be increased in the future if
 202         needed.
 203
 204         The following C code illustrates encoding and decoding of
 205         variable-length integers. The functions return the number of
 206         bytes occupied by the integer (1-9), or zero on error.
 207
 208             #include <stddef.h>
 209             #include <inttypes.h>
 210
 211             size_t
 212             encode(uint8_t buf[static 9], uint64_t num)
 213             {
 214                 if (num > UINT64_MAX / 2)
 215                     return 0;
 216
 217                 size_t i = 0;
 218
 219                 while (num >= 0x80) {
 220                     buf[i++] = (uint8_t)(num) | 0x80;
 221                     num >>= 7;
 222                 }
 223
 224                 buf[i++] = (uint8_t)(num);
 225
 226                 return i;
 227             }
 228
 229             size_t
 230             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
 231             {
 232                 if (size_max == 0)
 233                     return 0;
 234
 235                 if (size_max > 9)
 236                     size_max = 9;
 237
 238                 *num = buf[0] & 0x7F;
 239                 size_t i = 0;
 240
 241                 while (buf[i++] & 0x80) {
 242                     if (i >= size_max || buf[i] == 0x00)
 243                         return 0;
 244
 245                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
 246                 }
 247
 248                 return i;
 249             }
 250
 251
 252 2. Overall Structure of .xz File
 253
 254         A standalone .xz files consist of one or more Streams which may
 255         have Stream Padding between or after them:
 256
 257             +========+================+========+================+
 258             | Stream | Stream Padding | Stream | Stream Padding | ...
 259             +========+================+========+================+
 260
 261         The sizes of Stream and Stream Padding are always multiples
 262         of four bytes, thus the size of every valid .xz file MUST be
 263         a multiple of four bytes.
 264
 265         While a typical file contains only one Stream and no Stream
 266         Padding, a decoder handling standalone .xz files SHOULD support
 267         files that have more than one Stream or Stream Padding.
 268
 269         In contrast to standalone .xz files, when the .xz file format
 270         is used as an internal part of some other file format or
 271         communication protocol, it usually is expected that the decoder
 272         stops after the first Stream, and doesn't look for Stream
 273         Padding or possibly other Streams.
 274
 275
 276 2.1. Stream
 277
 278         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 279         |     Stream Header     | Block | Block | ... | Block |
 280         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
 281
 282              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 283         ---> | Index |     Stream Footer     |
 284              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
 285
 286         All the above fields have a size that is a multiple of four. If
 287         Stream is used as an internal part of another file format, it
 288         is RECOMMENDED to make the Stream start at an offset that is
 289         a multiple of four bytes.
 290
 291         Stream Header, Index, and Stream Footer are always present in
 292         a Stream. The maximum size of the Index field is 16 GiB (2^34).
 293
 294         There are zero or more Blocks. The maximum number of Blocks is
 295         limited only by the maximum size of the Index field.
 296
 297         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
 298         The same limit applies to the total amount of uncompressed
 299         data stored in a Stream.
 300
 301         If an implementation supports handling .xz files with multiple
 302         concatenated Streams, it MAY apply the above limits to the file
 303         as a whole instead of limiting per Stream basis.
 304
 305
 306 2.1.1. Stream Header
 307
 308         +---+---+---+---+---+---+-------+------+--+--+--+--+
 309         |  Header Magic Bytes   | Stream Flags |   CRC32   |
 310         +---+---+---+---+---+---+-------+------+--+--+--+--+
 311
 312
 313 2.1.1.1. Header Magic Bytes
 314
 315         The first six (6) bytes of the Stream are so called Header
 316         Magic Bytes. They can be used to identify the file type.
 317
 318             Using a C array and ASCII:
 319             const uint8_t HEADER_MAGIC[6]
 320                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
 321
 322             In plain hexadecimal:
 323             FD 37 7A 58 5A 00
 324
 325         Notes:
 326           - The first byte (0xFD) was chosen so that the files cannot
 327             be erroneously detected as being in .lzma format, in which
 328             the first byte is in the range [0x00, 0xE0].
 329           - The sixth byte (0x00) was chosen to prevent applications
 330             from misdetecting the file as a text file.
 331
 332         If the Header Magic Bytes don't match, the decoder MUST
 333         indicate an error.
 334
 335
 336 2.1.1.2. Stream Flags
 337
 338         The first byte of Stream Flags is always a null byte. In the
 339         future, this byte may be used to indicate a new Stream version
 340         or other Stream properties.
 341
 342         The second byte of Stream Flags is a bit field:
 343
 344             Bit(s)  Mask  Description
 345              0-3    0x0F  Type of Check (see Section 3.4):
 346                               ID    Size      Check name
 347                               0x00   0 bytes  None
 348                               0x01   4 bytes  CRC32
 349                               0x02   4 bytes  (Reserved)
 350                               0x03   4 bytes  (Reserved)
 351                               0x04   8 bytes  CRC64
 352                               0x05   8 bytes  (Reserved)
 353                               0x06   8 bytes  (Reserved)
 354                               0x07  16 bytes  (Reserved)
 355                               0x08  16 bytes  (Reserved)
 356                               0x09  16 bytes  (Reserved)
 357                               0x0A  32 bytes  SHA-256
 358                               0x0B  32 bytes  (Reserved)
 359                               0x0C  32 bytes  (Reserved)
 360                               0x0D  64 bytes  (Reserved)
 361                               0x0E  64 bytes  (Reserved)
 362                               0x0F  64 bytes  (Reserved)
 363              4-7    0xF0  Reserved for future use; MUST be zero for now.
 364
 365         Implementations SHOULD support at least the Check IDs 0x00
 366         (None) and 0x01 (CRC32). Supporting other Check IDs is
 367         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
 368         indicate a warning or error.
 369
 370         If any reserved bit is set, the decoder MUST indicate an error.
 371         It is possible that there is a new field present which the
 372         decoder is not aware of, and can thus parse the Stream Header
 373         incorrectly.
 374
 375
 376 2.1.1.3. CRC32
 377
 378         The CRC32 is calculated from the Stream Flags field. It is
 379         stored as an unsigned 32-bit little endian integer. If the
 380         calculated value does not match the stored one, the decoder
 381         MUST indicate an error.
 382
 383         The idea is that Stream Flags would always be two bytes, even
 384         if new features are needed. This way old decoders will be able
 385         to verify the CRC32 calculated from Stream Flags, and thus
 386         distinguish between corrupt files (CRC32 doesn't match) and
 387         files that the decoder doesn't support (CRC32 matches but
 388         Stream Flags has reserved bits set).
 389
 390
 391 2.1.2. Stream Footer
 392
 393         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 394         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
 395         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
 396
 397
 398 2.1.2.1. CRC32
 399
 400         The CRC32 is calculated from the Backward Size and Stream Flags
 401         fields. It is stored as an unsigned 32-bit little endian
 402         integer. If the calculated value does not match the stored one,
 403         the decoder MUST indicate an error.
 404
 405         The reason to have the CRC32 field before the Backward Size and
 406         Stream Flags fields is to keep the four-byte fields aligned to
 407         a multiple of four bytes.
 408
 409
 410 2.1.2.2. Backward Size
 411
 412         Backward Size is stored as a 32-bit little endian integer,
 413         which indicates the size of the Index field as multiple of
 414         four bytes, minimum value being four bytes:
 415
 416             real_backward_size = (stored_backward_size + 1) * 4;
 417
 418         If the stored value does not match the real size of the Index
 419         field, the decoder MUST indicate an error.
 420
 421         Using a fixed-size integer to store Backward Size makes
 422         it slightly simpler to parse the Stream Footer when the
 423         application needs to parse the Stream backwards.
 424
 425
 426 2.1.2.3. Stream Flags
 427
 428         This is a copy of the Stream Flags field from the Stream
 429         Header. The information stored to Stream Flags is needed
 430         when parsing the Stream backwards. The decoder MUST compare
 431         the Stream Flags fields in both Stream Header and Stream
 432         Footer, and indicate an error if they are not identical.
 433
 434
 435 2.1.2.4. Footer Magic Bytes
 436
 437         As the last step of the decoding process, the decoder MUST
 438         verify the existence of Footer Magic Bytes. If they don't
 439         match, an error MUST be indicated.
 440
 441             Using a C array and ASCII:
 442             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
 443
 444             In hexadecimal:
 445             59 5A
 446
 447         The primary reason to have Footer Magic Bytes is to make
 448         it easier to detect incomplete files quickly, without
 449         uncompressing. If the file does not end with Footer Magic Bytes
 450         (excluding Stream Padding described in Section 2.2), it cannot
 451         be undamaged, unless someone has intentionally appended garbage
 452         after the end of the Stream.
 453
 454
 455 2.2. Stream Padding
 456
 457         Only the decoders that support decoding of concatenated Streams
 458         MUST support Stream Padding.
 459
 460         Stream Padding MUST contain only null bytes. To preserve the
 461         four-byte alignment of consecutive Streams, the size of Stream
 462         Padding MUST be a multiple of four bytes. Empty Stream Padding
 463         is allowed. If these requirements are not met, the decoder MUST
 464         indicate an error.
 465
 466         Note that non-empty Stream Padding is allowed at the end of the
 467         file; there doesn't need to be a new Stream after non-empty
 468         Stream Padding. This can be convenient in certain situations
 469         [GNU-tar].
 470
 471         The possibility of Stream Padding MUST be taken into account
 472         when designing an application that parses Streams backwards,
 473         and the application supports concatenated Streams.
 474
 475
 476 3. Block
 477
 478         +==============+=================+===============+=======+
 479         | Block Header | Compressed Data | Block Padding | Check |
 480         +==============+=================+===============+=======+
 481
 482
 483 3.1. Block Header
 484
 485         +-------------------+-------------+=================+
 486         | Block Header Size | Block Flags | Compressed Size |
 487         +-------------------+-------------+=================+
 488
 489              +===================+======================+
 490         ---> | Uncompressed Size | List of Filter Flags |
 491              +===================+======================+
 492
 493              +================+--+--+--+--+
 494         ---> | Header Padding |   CRC32   |
 495              +================+--+--+--+--+
 496
 497
 498 3.1.1. Block Header Size
 499
 500         This field overlaps with the Index Indicator field (see
 501         Section 4.1).
 502
 503         This field contains the size of the Block Header field,
 504         including the Block Header Size field itself. Valid values are
 505         in the range [0x01, 0xFF], which indicate the size of the Block
 506         Header as multiples of four bytes, minimum size being eight
 507         bytes:
 508
 509             real_header_size = (encoded_header_size + 1) * 4;
 510
 511         If a Block Header bigger than 1024 bytes is needed in the
 512         future, a new field can be added between the Block Header and
 513         Compressed Data fields. The presence of this new field would
 514         be indicated in the Block Header field.
 515
 516
 517 3.1.2. Block Flags
 518
 519         The Block Flags field is a bit field:
 520
 521             Bit(s)  Mask  Description
 522              0-1    0x03  Number of filters (1-4)
 523              2-5    0x3C  Reserved for future use; MUST be zero for now.
 524               6     0x40  The Compressed Size field is present.
 525               7     0x80  The Uncompressed Size field is present.
 526
 527         If any reserved bit is set, the decoder MUST indicate an error.
 528         It is possible that there is a new field present which the
 529         decoder is not aware of, and can thus parse the Block Header
 530         incorrectly.
 531
 532
 533 3.1.3. Compressed Size
 534
 535         This field is present only if the appropriate bit is set in
 536         the Block Flags field (see Section 3.1.2).
 537
 538         The Compressed Size field contains the size of the Compressed
 539         Data field, which MUST be non-zero. Compressed Size is stored
 540         using the encoding described in Section 1.2. If the Compressed
 541         Size doesn't match the size of the Compressed Data field, the
 542         decoder MUST indicate an error.
 543
 544
 545 3.1.4. Uncompressed Size
 546
 547         This field is present only if the appropriate bit is set in
 548         the Block Flags field (see Section 3.1.2).
 549
 550         The Uncompressed Size field contains the size of the Block
 551         after uncompressing. Uncompressed Size is stored using the
 552         encoding described in Section 1.2. If the Uncompressed Size
 553         does not match the real uncompressed size, the decoder MUST
 554         indicate an error.
 555
 556         Storing the Compressed Size and Uncompressed Size fields serves
 557         several purposes:
 558           - The decoder knows how much memory it needs to allocate
 559             for a temporary buffer in multithreaded mode.
 560           - Simple error detection: wrong size indicates a broken file.
 561           - Seeking forwards to a specific location in streamed mode.
 562
 563         It should be noted that the only reliable way to determine
 564         the real uncompressed size is to uncompress the Block,
 565         because the Block Header and Index fields may contain
 566         (intentionally or unintentionally) invalid information.
 567
 568
 569 3.1.5. List of Filter Flags
 570
 571         +================+================+     +================+
 572         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
 573         +================+================+     +================+
 574
 575         The number of Filter Flags fields is stored in the Block Flags
 576         field (see Section 3.1.2).
 577
 578         The format of each Filter Flags field is as follows:
 579
 580             +===========+====================+===================+
 581             | Filter ID | Size of Properties | Filter Properties |
 582             +===========+====================+===================+
 583
 584         Both Filter ID and Size of Properties are stored using the
 585         encoding described in Section 1.2. Size of Properties indicates
 586         the size of the Filter Properties field as bytes. The list of
 587         officially defined Filter IDs and the formats of their Filter
 588         Properties are described in Section 5.3.
 589
 590         Filter IDs greater than or equal to 0x4000_0000_0000_0000
 591         (2^62) are reserved for implementation-specific internal use.
 592         These Filter IDs MUST never be used in List of Filter Flags.
 593
 594
 595 3.1.6. Header Padding
 596
 597         This field contains as many null byte as it is needed to make
 598         the Block Header have the size specified in Block Header Size.
 599         If any of the bytes are not null bytes, the decoder MUST
 600         indicate an error. It is possible that there is a new field
 601         present which the decoder is not aware of, and can thus parse
 602         the Block Header incorrectly.
 603
 604
 605 3.1.7. CRC32
 606
 607         The CRC32 is calculated over everything in the Block Header
 608         field except the CRC32 field itself. It is stored as an
 609         unsigned 32-bit little endian integer. If the calculated
 610         value does not match the stored one, the decoder MUST indicate
 611         an error.
 612
 613         By verifying the CRC32 of the Block Header before parsing the
 614         actual contents allows the decoder to distinguish between
 615         corrupt and unsupported files.
 616
 617
 618 3.2. Compressed Data
 619
 620         The format of Compressed Data depends on Block Flags and List
 621         of Filter Flags. Excluding the descriptions of the simplest
 622         filters in Section 5.3, the format of the filter-specific
 623         encoded data is out of scope of this document.
 624
 625
 626 3.3. Block Padding
 627
 628         Block Padding MUST contain 0-3 null bytes to make the size of
 629         the Block a multiple of four bytes. This can be needed when
 630         the size of Compressed Data is not a multiple of four. If any
 631         of the bytes in Block Padding are not null bytes, the decoder
 632         MUST indicate an error.
 633
 634
 635 3.4. Check
 636
 637         The type and size of the Check field depends on which bits
 638         are set in the Stream Flags field (see Section 2.1.1.2).
 639
 640         The Check, when used, is calculated from the original
 641         uncompressed data. If the calculated Check does not match the
 642         stored one, the decoder MUST indicate an error. If the selected
 643         type of Check is not supported by the decoder, it SHOULD
 644         indicate a warning or error.
 645
 646
 647 4. Index
 648
 649         +-----------------+===================+
 650         | Index Indicator | Number of Records |
 651         +-----------------+===================+
 652
 653              +=================+===============+-+-+-+-+
 654         ---> | List of Records | Index Padding | CRC32 |
 655              +=================+===============+-+-+-+-+
 656
 657         Index serves several purposes. Using it, one can
 658           - verify that all Blocks in a Stream have been processed;
 659           - find out the uncompressed size of a Stream; and
 660           - quickly access the beginning of any Block (random access).
 661
 662
 663 4.1. Index Indicator
 664
 665         This field overlaps with the Block Header Size field (see
 666         Section 3.1.1). The value of Index Indicator is always 0x00.
 667
 668
 669 4.2. Number of Records
 670
 671         This field indicates how many Records there are in the List
 672         of Records field, and thus how many Blocks there are in the
 673         Stream. The value is stored using the encoding described in
 674         Section 1.2. If the decoder has decoded all the Blocks of the
 675         Stream, and then notices that the Number of Records doesn't
 676         match the real number of Blocks, the decoder MUST indicate an
 677         error.
 678
 679
 680 4.3. List of Records
 681
 682         List of Records consists of as many Records as indicated by the
 683         Number of Records field:
 684
 685             +========+========+
 686             | Record | Record | ...
 687             +========+========+
 688
 689         Each Record contains information about one Block:
 690
 691             +===============+===================+
 692             | Unpadded Size | Uncompressed Size |
 693             +===============+===================+
 694
 695         If the decoder has decoded all the Blocks of the Stream, it
 696         MUST verify that the contents of the Records match the real
 697         Unpadded Size and Uncompressed Size of the respective Blocks.
 698
 699         Implementation hint: It is possible to verify the Index with
 700         constant memory usage by calculating for example SHA-256 of
 701         both the real size values and the List of Records, then
 702         comparing the hash values. Implementing this using
 703         non-cryptographic hash like CRC32 SHOULD be avoided unless
 704         small code size is important.
 705
 706         If the decoder supports random-access reading, it MUST verify
 707         that Unpadded Size and Uncompressed Size of every completely
 708         decoded Block match the sizes stored in the Index. If only
 709         partial Block is decoded, the decoder MUST verify that the
 710         processed sizes don't exceed the sizes stored in the Index.
 711
 712
 713 4.3.1. Unpadded Size
 714
 715         This field indicates the size of the Block excluding the Block
 716         Padding field. That is, Unpadded Size is the size of the Block
 717         Header, Compressed Data, and Check fields. Unpadded Size is
 718         stored using the encoding described in Section 1.2. The value
 719         MUST never be zero; with the current structure of Blocks, the
 720         actual minimum value for Unpadded Size is five.
 721
 722         Implementation note: Because the size of the Block Padding
 723         field is not included in Unpadded Size, calculating the total
 724         size of a Stream or doing random-access reading requires
 725         calculating the actual size of the Blocks by rounding Unpadded
 726         Sizes up to the next multiple of four.
 727
 728         The reason to exclude Block Padding from Unpadded Size is to
 729         ease making a raw copy of Compressed Data without Block
 730         Padding. This can be useful, for example, if someone wants
 731         to convert Streams to some other file format quickly.
 732
 733
 734 4.3.2. Uncompressed Size
 735
 736         This field indicates the Uncompressed Size of the respective
 737         Block as bytes. The value is stored using the encoding
 738         described in Section 1.2.
 739
 740
 741 4.4. Index Padding
 742
 743         This field MUST contain 0-3 null bytes to pad the Index to
 744         a multiple of four bytes. If any of the bytes are not null
 745         bytes, the decoder MUST indicate an error.
 746
 747
 748 4.5. CRC32
 749
 750         The CRC32 is calculated over everything in the Index field
 751         except the CRC32 field itself. The CRC32 is stored as an
 752         unsigned 32-bit little endian integer. If the calculated
 753         value does not match the stored one, the decoder MUST indicate
 754         an error.
 755
 756
 757 5. Filter Chains
 758
 759         The Block Flags field defines how many filters are used. When
 760         more than one filter is used, the filters are chained; that is,
 761         the output of one filter is the input of another filter. The
 762         following figure illustrates the direction of data flow.
 763
 764                     v   Uncompressed Data   ^
 765                     |       Filter 0        |
 766             Encoder |       Filter 1        | Decoder
 767                     |       Filter n        |
 768                     v    Compressed Data    ^
 769
 770
 771 5.1. Alignment
 772
 773         Alignment of uncompressed input data is usually the job of
 774         the application producing the data. For example, to get the
 775         best results, an archiver tool should make sure that all
 776         PowerPC executable files in the archive stream start at
 777         offsets that are multiples of four bytes.
 778
 779         Some filters, for example LZMA2, can be configured to take
 780         advantage of specified alignment of input data. Note that
 781         taking advantage of aligned input can be beneficial also when
 782         a filter is not the first filter in the chain. For example,
 783         if you compress PowerPC executables, you may want to use the
 784         PowerPC filter and chain that with the LZMA2 filter. Because
 785         not only the input but also the output alignment of the PowerPC
 786         filter is four bytes, it is now beneficial to set LZMA2
 787         settings so that the LZMA2 encoder can take advantage of its
 788         four-byte-aligned input data.
 789
 790         The output of the last filter in the chain is stored to the
 791         Compressed Data field, which is is guaranteed to be aligned
 792         to a multiple of four bytes relative to the beginning of the
 793         Stream. This can increase
 794           - speed, if the filtered data is handled multiple bytes at
 795             a time by the filter-specific encoder and decoder,
 796             because accessing aligned data in computer memory is
 797             usually faster; and
 798           - compression ratio, if the output data is later compressed
 799             with an external compression tool.
 800
 801
 802 5.2. Security
 803
 804         If filters would be allowed to be chained freely, it would be
 805         possible to create malicious files, that would be very slow to
 806         decode. Such files could be used to create denial of service
 807         attacks.
 808
 809         Slow files could occur when multiple filters are chained:
 810
 811             v   Compressed input data
 812             |   Filter 1 decoder (last filter)
 813             |   Filter 0 decoder (non-last filter)
 814             v   Uncompressed output data
 815
 816         The decoder of the last filter in the chain produces a lot of
 817         output from little input. Another filter in the chain takes the
 818         output of the last filter, and produces very little output
 819         while consuming a lot of input. As a result, a lot of data is
 820         moved inside the filter chain, but the filter chain as a whole
 821         gets very little work done.
 822
 823         To prevent this kind of slow files, there are restrictions on
 824         how the filters can be chained. These restrictions MUST be
 825         taken into account when designing new filters.
 826
 827         The maximum number of filters in the chain has been limited to
 828         four, thus there can be at maximum of three non-last filters.
 829         Of these three non-last filters, only two are allowed to change
 830         the size of the data.
 831
 832         The non-last filters, that change the size of the data, MUST
 833         have a limit how much the decoder can compress the data: the
 834         decoder SHOULD produce at least n bytes of output when the
 835         filter is given 2n bytes of input. This  limit is not
 836         absolute, but significant deviations MUST be avoided.
 837
 838         The above limitations guarantee that if the last filter in the
 839         chain produces 4n bytes of output, the chain as a whole will
 840         produce at least n bytes of output.
 841
 842
 843 5.3. Filters
 844
 845 5.3.1. LZMA2
 846
 847         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
 848         compression algorithm with high compression ratio and fast
 849         decompression. LZMA is based on LZ77 and range coding
 850         algorithms.
 851
 852         LZMA2 is an extension on top of the original LZMA. LZMA2 uses
 853         LZMA internally, but adds support for flushing the encoder,
 854         uncompressed chunks, eases stateful decoder implementations,
 855         and improves support for multithreading. Thus, the plain LZMA
 856         will not be supported in this file format.
 857
 858             Filter ID:                  0x21
 859             Size of Filter Properties:  1 byte
 860             Changes size of data:       Yes
 861             Allow as a non-last filter: No
 862             Allow as the last filter:   Yes
 863
 864             Preferred alignment:
 865                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
 866                 Output data:            1 byte
 867
 868         The format of the one-byte Filter Properties field is as
 869         follows:
 870
 871             Bits   Mask   Description
 872             0-5    0x3F   Dictionary Size
 873             6-7    0xC0   Reserved for future use; MUST be zero for now.
 874
 875         Dictionary Size is encoded with one-bit mantissa and five-bit
 876         exponent. The smallest dictionary size is 4 KiB and the biggest
 877         is 4 GiB.
 878
 879             Raw value   Mantissa   Exponent   Dictionary size
 880                 0           2         11         4 KiB
 881                 1           3         11         6 KiB
 882                 2           2         12         8 KiB
 883                 3           3         12        12 KiB
 884                 4           2         13        16 KiB
 885                 5           3         13        24 KiB
 886                 6           2         14        32 KiB
 887               ...         ...        ...      ...
 888                35           3         27       768 MiB
 889                36           2         28      1024 MiB
 890                37           3         29      1536 MiB
 891                38           2         30      2048 MiB
 892                39           3         30      3072 MiB
 893                40           2         31      4096 MiB - 1 B
 894
 895         Instead of having a table in the decoder, the dictionary size
 896         can be decoded using the following C code:
 897
 898             const uint8_t bits = get_dictionary_flags() & 0x3F;
 899             if (bits > 40)
 900                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
 901
 902             uint32_t dictionary_size;
 903             if (bits == 40) {
 904                 dictionary_size = UINT32_MAX;
 905             } else {
 906                 dictionary_size = 2 | (bits & 1);
 907                 dictionary_size <<= bits / 2 + 11;
 908             }
 909
 910
 911 5.3.2. Branch/Call/Jump Filters for Executables
 912
 913         These filters convert relative branch, call, and jump
 914         instructions to their absolute counterparts in executable
 915         files. This conversion increases redundancy and thus
 916         compression ratio.
 917
 918             Size of Filter Properties:  0 or 4 bytes
 919             Changes size of data:       No
 920             Allow as a non-last filter: Yes
 921             Allow as the last filter:   No
 922
 923         Below is the list of filters in this category. The alignment
 924         is the same for both input and output data.
 925
 926             Filter ID   Alignment   Description
 927               0x04       1 byte     x86 filter (BCJ)
 928               0x05       4 bytes    PowerPC (big endian) filter
 929               0x06      16 bytes    IA64 filter
 930               0x07       4 bytes    ARM filter [1]
 931               0x08       2 bytes    ARM Thumb filter [1]
 932               0x09       4 bytes    SPARC filter
 933               0x0A       4 bytes    ARM64 filter [2]
 934               0x0B       2 bytes    RISC-V filter
 935
 936               [1] These are for little endian instruction encoding.
 937                   This must not be confused with data endianness.
 938                   A processor configured for big endian data access
 939                   may still use little endian instruction encoding.
 940                   The filters don't care about the data endianness.
 941
 942               [2] 4096-byte alignment gives the best results
 943                   because the address in the ADRP instruction
 944                   is a multiple of 4096 bytes.
 945
 946         If the size of Filter Properties is four bytes, the Filter
 947         Properties field contains the start offset used for address
 948         conversions. It is stored as an unsigned 32-bit little endian
 949         integer. The start offset MUST be a multiple of the alignment
 950         of the filter as listed in the table above; if it isn't, the
 951         decoder MUST indicate an error. If the size of Filter
 952         Properties is zero, the start offset is zero.
 953
 954         Setting the start offset may be useful if an executable has
 955         multiple sections, and there are many cross-section calls.
 956         Taking advantage of this feature usually requires usage of
 957         the Subblock filter, whose design is not complete yet.
 958
 959
 960 5.3.3. Delta
 961
 962         The Delta filter may increase compression ratio when the value
 963         of the next byte correlates with the value of an earlier byte
 964         at specified distance.
 965
 966             Filter ID:                  0x03
 967             Size of Filter Properties:  1 byte
 968             Changes size of data:       No
 969             Allow as a non-last filter: Yes
 970             Allow as the last filter:   No
 971
 972             Preferred alignment:
 973                 Input data:             1 byte
 974                 Output data:            Same as the original input data
 975
 976         The Properties byte indicates the delta distance, which can be
 977         1-256 bytes backwards from the current byte: 0x00 indicates
 978         distance of 1 byte and 0xFF distance of 256 bytes.
 979
 980
 981 5.3.3.1. Format of the Encoded Output
 982
 983         The code below illustrates both encoding and decoding with
 984         the Delta filter.
 985
 986             // Distance is in the range [1, 256].
 987             const unsigned int distance = get_properties_byte() + 1;
 988             uint8_t pos = 0;
 989             uint8_t delta[256];
 990
 991             memset(delta, 0, sizeof(delta));
 992
 993             while (1) {
 994                 const int byte = read_byte();
 995                 if (byte == EOF)
 996                     break;
 997
 998                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
 999                 if (is_encoder) {
1000                     tmp = (uint8_t)(byte) - tmp;
1001                     delta[pos] = (uint8_t)(byte);
1002                 } else {
1003                     tmp = (uint8_t)(byte) + tmp;
1004                     delta[pos] = tmp;
1005                 }
1006
1007                 write_byte(tmp);
1008                 --pos;
1009             }
1010
1011
1012 5.4. Custom Filter IDs
1013
1014         If a developer wants to use custom Filter IDs, there are two
1015         choices. The first choice is to contact Lasse Collin and ask
1016         him to allocate a range of IDs for the developer.
1017
1018         The second choice is to generate a 40-bit random integer
1019         which the developer can use as a personal Developer ID.
1020         To minimize the risk of collisions, Developer ID has to be
1021         a randomly generated integer, not manually selected "hex word".
1022         The following command, which works on many free operating
1023         systems, can be used to generate Developer ID:
1024
1025             dd if=/dev/urandom bs=5 count=1 | hexdump
1026
1027         The developer can then use the Developer ID to create unique
1028         (well, hopefully unique) Filter IDs.
1029
1030             Bits    Mask                    Description
1031              0-15   0x0000_0000_0000_FFFF   Filter ID
1032             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1033             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1034
1035         The resulting 63-bit integer will use 9 bytes of space when
1036         stored using the encoding described in Section 1.2. To get
1037         a shorter ID, see the beginning of this Section how to
1038         request a custom ID range.
1039
1040
1041 5.4.1. Reserved Custom Filter ID Ranges
1042
1043         Range                       Description
1044         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1045         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1046         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1047
1048
1049 6. Cyclic Redundancy Checks
1050
1051         There are several incompatible variations to calculate CRC32
1052         and CRC64. For simplicity and clarity, complete examples are
1053         provided to calculate the checks as they are used in this file
1054         format. Implementations MAY use different code as long as it
1055         gives identical results.
1056
1057         The program below reads data from standard input, calculates
1058         the CRC32 and CRC64 values, and prints the calculated values
1059         as big endian hexadecimal strings to standard output.
1060
1061             #include <stddef.h>
1062             #include <inttypes.h>
1063             #include <stdio.h>
1064
1065             uint32_t crc32_table[256];
1066             uint64_t crc64_table[256];
1067
1068             void
1069             init(void)
1070             {
1071                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1072                 static const uint64_t poly64
1073                         = UINT64_C(0xC96C5795D7870F42);
1074
1075                 for (size_t i = 0; i < 256; ++i) {
1076                     uint32_t crc32 = i;
1077                     uint64_t crc64 = i;
1078
1079                     for (size_t j = 0; j < 8; ++j) {
1080                         if (crc32 & 1)
1081                             crc32 = (crc32 >> 1) ^ poly32;
1082                         else
1083                             crc32 >>= 1;
1084
1085                         if (crc64 & 1)
1086                             crc64 = (crc64 >> 1) ^ poly64;
1087                         else
1088                             crc64 >>= 1;
1089                     }
1090
1091                     crc32_table[i] = crc32;
1092                     crc64_table[i] = crc64;
1093                 }
1094             }
1095
1096             uint32_t
1097             crc32(const uint8_t *buf, size_t size, uint32_t crc)
1098             {
1099                 crc = ~crc;
1100                 for (size_t i = 0; i < size; ++i)
1101                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1102                             ^ (crc >> 8);
1103                 return ~crc;
1104             }
1105
1106             uint64_t
1107             crc64(const uint8_t *buf, size_t size, uint64_t crc)
1108             {
1109                 crc = ~crc;
1110                 for (size_t i = 0; i < size; ++i)
1111                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1112                             ^ (crc >> 8);
1113                 return ~crc;
1114             }
1115
1116             int
1117             main()
1118             {
1119                 init();
1120
1121                 uint32_t value32 = 0;
1122                 uint64_t value64 = 0;
1123                 uint64_t total_size = 0;
1124                 uint8_t buf[8192];
1125
1126                 while (1) {
1127                     const size_t buf_size
1128                             = fread(buf, 1, sizeof(buf), stdin);
1129                     if (buf_size == 0)
1130                         break;
1131
1132                     total_size += buf_size;
1133                     value32 = crc32(buf, buf_size, value32);
1134                     value64 = crc64(buf, buf_size, value64);
1135                 }
1136
1137                 printf("Bytes:  %" PRIu64 "\n", total_size);
1138                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1139                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1140
1141                 return 0;
1142             }
1143
1144
1145 7. References
1146
1147         LZMA SDK - The original LZMA implementation
1148         https://7-zip.org/sdk.html
1149
1150         LZMA Utils - LZMA adapted to POSIX-like systems
1151         https://tukaani.org/lzma/
1152
1153         XZ Utils - The next generation of LZMA Utils
1154         https://tukaani.org/xz/
1155
1156         [RFC-1952]
1157         GZIP file format specification version 4.3
1158         https://www.ietf.org/rfc/rfc1952.txt
1159           - Notation of byte boxes in section "2.1. Overall conventions"
1160
1161         [RFC-2119]
1162         Key words for use in RFCs to Indicate Requirement Levels
1163         https://www.ietf.org/rfc/rfc2119.txt
1164
1165         [GNU-tar]
1166         GNU tar 1.35 manual
1167         https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1168           - Node 9.4.2 "Blocking Factor", paragraph that begins
1169             "gzip will complain about trailing garbage"
1170           - Note that this URL points to the latest version of the
1171             manual, and may some day not contain the note which is in
1172             1.35. For the exact version of the manual, download GNU
1173             tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz
1174