6 Although PKWARE will attempt to supply current and accurate
7 information relating to its file formats, algorithms, and the
8 subject programs, the possibility of error can not be eliminated.
9 PKWARE therefore expressly disclaims any warranty that the
10 information contained in the associated materials relating to the
11 subject programs and/or the format of the files created or
12 accessed by the subject programs and/or the algorithms used by
13 the subject programs, or any other matter, is current, correct or
14 accurate as delivered. Any risk of damage due to any possible
15 inaccurate information is assumed by the user of the information.
16 Furthermore, the information relating to the subject programs
17 and/or the file formats created or accessed by the subject
18 programs and/or the algorithms used by the subject programs is
19 subject to change without notice.
21 General Format of a ZIP file
22 ----------------------------
24 Files stored in arbitrary order. Large zipfiles can span multiple
27 Overall zipfile format:
29 [local file header + file data + data_descriptor] . . .
30 [central directory] end of central directory record
35 local file header signature 4 bytes (0x04034b50)
36 version needed to extract 2 bytes
37 general purpose bit flag 2 bytes
38 compression method 2 bytes
39 last mod file time 2 bytes
40 last mod file date 2 bytes
42 compressed size 4 bytes
43 uncompressed size 4 bytes
44 filename length 2 bytes
45 extra field length 2 bytes
47 filename (variable size)
48 extra field (variable size)
53 compressed size 4 bytes
54 uncompressed size 4 bytes
56 This descriptor exists only if bit 3 of the general
57 purpose bit flag is set (see below). It is byte aligned
58 and immediately follows the last byte of compressed data.
59 This descriptor is used only when it was not possible to
60 seek in the output zip file, e.g., when the output zip file
61 was standard output or a non seekable device.
63 C. Central directory structure:
65 [file header] . . . end of central dir record
69 central file header signature 4 bytes (0x02014b50)
70 version made by 2 bytes
71 version needed to extract 2 bytes
72 general purpose bit flag 2 bytes
73 compression method 2 bytes
74 last mod file time 2 bytes
75 last mod file date 2 bytes
77 compressed size 4 bytes
78 uncompressed size 4 bytes
79 filename length 2 bytes
80 extra field length 2 bytes
81 file comment length 2 bytes
82 disk number start 2 bytes
83 internal file attributes 2 bytes
84 external file attributes 4 bytes
85 relative offset of local header 4 bytes
87 filename (variable size)
88 extra field (variable size)
89 file comment (variable size)
91 End of central dir record:
93 end of central dir signature 4 bytes (0x06054b50)
94 number of this disk 2 bytes
95 number of the disk with the
96 start of the central directory 2 bytes
97 total number of entries in
98 the central dir on this disk 2 bytes
99 total number of entries in
100 the central dir 2 bytes
101 size of the central directory 4 bytes
102 offset of start of central
103 directory with respect to
104 the starting disk number 4 bytes
105 zipfile comment length 2 bytes
106 zipfile comment (variable size)
108 D. Explanation of fields:
110 version made by (2 bytes)
112 The upper byte indicates the compatibility of the file
113 attribute information. If the external file attributes
114 are compatible with MS-DOS and can be read by PKZIP for
115 DOS version 2.04g then this value will be zero. If these
116 attributes are not compatible, then this value will
117 identify the host system on which the attributes are
118 compatible. Software can use this information to determine
119 the line record format for text files etc. The current
122 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
123 1 - Amiga 2 - VAX/VMS
125 5 - Atari ST 6 - OS/2 H.P.F.S.
126 7 - Macintosh 8 - Z-System
127 9 - CP/M 10 - Windows NTFS
130 The lower byte indicates the version number of the
131 software used to encode the file. The value/10
132 indicates the major version number, and the value
133 mod 10 is the minor version number.
135 version needed to extract (2 bytes)
137 The minimum software version needed to extract the
138 file, mapped as above.
140 general purpose bit flag: (2 bytes)
142 Bit 0: If set, indicates that the file is encrypted.
144 (For Method 6 - Imploding)
145 Bit 1: If the compression method used was type 6,
146 Imploding, then this bit, if set, indicates
147 an 8K sliding dictionary was used. If clear,
148 then a 4K sliding dictionary was used.
149 Bit 2: If the compression method used was type 6,
150 Imploding, then this bit, if set, indicates
151 3 Shannon-Fano trees were used to encode the
152 sliding dictionary output. If clear, then 2
153 Shannon-Fano trees were used.
155 (For Method 8 - Deflating)
157 0 0 Normal (-en) compression option was used.
158 0 1 Maximum (-ex) compression option was used.
159 1 0 Fast (-ef) compression option was used.
160 1 1 Super Fast (-es) compression option was used.
162 Note: Bits 1 and 2 are undefined if the compression
165 Bit 3: If this bit is set, the fields crc-32, compressed
166 size and uncompressed size are set to zero in the
167 local header. The correct values are put in the
168 data descriptor immediately following the compressed
169 data. (Note: PKZIP version 2.04g for DOS only
170 recognizes this bit for method 8 compression, newer
171 versions of PKZIP recognize this bit for any
174 Bit 4: Reserved for use with method 8, for enhanced
177 Bit 5: If this bit is set, this indicates that the file is
178 compressed patched data. (Note: Requires PKZIP
179 version 2.70 or greater)
181 Bit 6: Currently unused.
183 Bit 7: Currently unused.
185 Bit 8: Currently unused.
187 Bit 9: Currently unused.
189 Bit 10: Currently unused.
191 Bit 11: Currently unused.
193 Bit 12: Reserved by PKWARE for enhanced compression.
195 Bit 13: Reserved by PKWARE.
197 Bit 14: Reserved by PKWARE.
199 Bit 15: Reserved by PKWARE.
201 compression method: (2 bytes)
203 (see accompanying documentation for algorithm
206 0 - The file is stored (no compression)
207 1 - The file is Shrunk
208 2 - The file is Reduced with compression factor 1
209 3 - The file is Reduced with compression factor 2
210 4 - The file is Reduced with compression factor 3
211 5 - The file is Reduced with compression factor 4
212 6 - The file is Imploded
213 7 - Reserved for Tokenizing compression algorithm
214 8 - The file is Deflated
215 9 - Reserved for enhanced Deflating
216 10 - PKWARE Date Compression Library Imploding
218 date and time fields: (2 bytes each)
220 The date and time are encoded in standard MS-DOS format.
221 If input came from standard input, the date and time are
222 those at which compression was started for this data.
226 The CRC-32 algorithm was generously contributed by
227 David Schwaderer and can be found in his excellent
228 book "C Programmers Guide to NetBIOS" published by
229 Howard W. Sams & Co. Inc. The 'magic number' for
230 the CRC is 0xdebb20e3. The proper CRC pre and post
231 conditioning is used, meaning that the CRC register
232 is pre-conditioned with all ones (a starting value
233 of 0xffffffff) and the value is post-conditioned by
234 taking the one's complement of the CRC residual.
235 If bit 3 of the general purpose flag is set, this
236 field is set to zero in the local header and the correct
237 value is put in the data descriptor and in the central
240 compressed size: (4 bytes)
241 uncompressed size: (4 bytes)
243 The size of the file compressed and uncompressed,
244 respectively. If bit 3 of the general purpose bit flag
245 is set, these fields are set to zero in the local header
246 and the correct values are put in the data descriptor and
247 in the central directory.
249 filename length: (2 bytes)
250 extra field length: (2 bytes)
251 file comment length: (2 bytes)
253 The length of the filename, extra field, and comment
254 fields respectively. The combined length of any
255 directory record and these three fields should not
256 generally exceed 65,535 bytes. If input came from standard
257 input, the filename length is set to zero.
259 disk number start: (2 bytes)
261 The number of the disk on which this file begins.
263 internal file attributes: (2 bytes)
265 The lowest bit of this field indicates, if set, that
266 the file is apparently an ASCII or text file. If not
267 set, that the file apparently contains binary data.
268 The remaining bits are unused in version 1.0.
270 Bits 1 and 2 are reserved for use by PKWARE.
272 external file attributes: (4 bytes)
274 The mapping of the external attributes is
275 host-system dependent (see 'version made by'). For
276 MS-DOS, the low order byte is the MS-DOS directory
277 attribute byte. If input came from standard input, this
278 field is set to zero.
280 relative offset of local header: (4 bytes)
282 This is the offset from the start of the first disk on
283 which this file appears, to where the local header should
288 The name of the file, with optional relative path.
289 The path stored should not contain a drive or
290 device letter, or a leading slash. All slashes
291 should be forward slashes '/' as opposed to
292 backwards slashes '\' for compatibility with Amiga
293 and Unix file systems etc. If input came from standard
294 input, there is no filename field.
296 extra field: (Variable)
298 This is for future expansion. If additional information
299 needs to be stored in the future, it should be stored
300 here. Earlier versions of the software can then safely
301 skip this file, and find the next file or header. This
302 field will be 0 length in version 1.0.
304 In order to allow different programs and different types
305 of information to be stored in the 'extra' field in .ZIP
306 files, the following structure should be used for all
307 programs storing data in this field:
309 header1+data1 + header2+data2 . . .
311 Each header should consist of:
316 Note: all fields stored in Intel low-byte/high-byte order.
318 The Header ID field indicates the type of data that is in
319 the following data block.
321 Header ID's of 0 thru 31 are reserved for use by PKWARE.
322 The remaining ID's can be used by third party vendors for
325 The current Header ID mappings defined by PKWARE are:
332 0x000f Patch Descriptor
334 Several third party mappings commonly used are:
336 0x4b46 FWKCS MD5 (see below)
339 0x4453 Windows NT security descriptor (binary ACL)
342 0x4c41 OS/2 access control list (text ACL)
343 0x4d49 Info-ZIP VMS (VAX or Alpha)
344 0x5455 extended timestamp
345 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc)
348 0x7855 Info-ZIP Unix (new)
351 The Data Size field indicates the size of the following
352 data block. Programs can use this value to skip to the
353 next header block, passing over any data blocks that are
356 Note: As stated above, the size of the entire .ZIP file
357 header, including the filename, comment, and extra
358 field should not exceed 64K in size.
360 In case two different programs should appropriate the same
361 Header ID value, it is strongly recommended that each
362 program place a unique signature of at least two bytes in
363 size (and preferably 4 bytes or bigger) at the start of
364 each data area. Every program should verify that its
365 unique signature is present, in addition to the Header ID
366 value being correct, before assuming that it is a block of
371 The following is the layout of the OS/2 attributes "extra"
372 block. (Last Revision 09/05/95)
374 Note: all fields stored in Intel low-byte/high-byte order.
376 Value Size Description
377 ----- ---- -----------
378 (OS/2) 0x0009 2 bytes Tag for this "extra" block type
379 TSize 2 bytes Size for the following data block
380 BSize 4 bytes Uncompressed Block Size
381 CType 2 bytes Compression type
382 EACRC 4 bytes CRC value for uncompress block
383 (var) variable Compressed block
385 The OS/2 extended attribute structure (FEA2LIST) is
386 compressed and then stored in it's entirety within this
387 structure. There will only ever be one "block" of data in
392 The following is the layout of the Unix "extra" block.
393 Note: all fields are stored in Intel low-byte/high-byte
396 Value Size Description
397 ----- ---- -----------
398 (UNIX) 0x000d 2 bytes Tag for this "extra" block type
399 TSize 2 bytes Size for the following data block
400 Atime 4 bytes File last access time
401 Mtime 4 bytes File last modification time
402 Uid 2 bytes File user ID
403 Gid 2 bytes File group ID
404 (var) variable Variable length data field
406 The variable length data field will contain file type
407 specific data. Currently the only values allowed are
408 the original "linked to" file names for hard or symbolic
411 -VAX/VMS Extra Field:
413 The following is the layout of the VAX/VMS attributes
416 Note: all fields stored in Intel low-byte/high-byte order.
418 Value Size Description
419 ----- ---- -----------
420 (VMS) 0x000c 2 bytes Tag for this "extra" block type
421 TSize 2 bytes Size of the total "extra" block
422 CRC 4 bytes 32-bit CRC for remainder of the block
423 Tag1 2 bytes VMS attribute tag value #1
424 Size1 2 bytes Size of attribute #1, in bytes
425 (var.) Size1 Attribute #1 data
429 TagN 2 bytes VMS attribute tage value #N
430 SizeN 2 bytes Size of attribute #N, in bytes
431 (var.) SizeN Attribute #N data
435 1. There will be one or more of attributes present, which
436 will each be preceded by the above TagX & SizeX values.
437 These values are identical to the ATR$C_XXXX and
438 ATR$S_XXXX constants which are defined in ATR.H under
439 VMS C. Neither of these values will ever be zero.
441 2. No word alignment or padding is performed.
443 3. A well-behaved PKZIP/VMS program should never produce
444 more than one sub-block with the same TagX value. Also,
445 there will never be more than one "extra" block of type
446 0x000c in a particular directory record.
450 The following is the layout of the NTFS attributes
453 Note: all fields stored in Intel low-byte/high-byte order.
455 Value Size Description
456 ----- ---- -----------
457 (NTFS) 0x000a 2 bytes Tag for this "extra" block type
458 TSize 2 bytes Size of the total "extra" block
459 Reserved 4 bytes Reserved for future use
460 Tag1 2 bytes NTFS attribute tag value #1
461 Size1 2 bytes Size of attribute #1, in bytes
462 (var.) Size1 Attribute #1 data
466 TagN 2 bytes NTFS attribute tage value #N
467 SizeN 2 bytes Size of attribute #N, in bytes
468 (var.) SizeN Attribute #N data
470 For NTFS, values for Tag1 through TagN are as follows:
471 (currently only one set of attributes is defined for NTFS)
474 ----- ---- -----------
475 0x0001 2 bytes Tag for attribute #1
476 Size1 2 bytes Size of attribute #1, in bytes
477 Mtime 8 bytes File last modification time
478 Atime 8 bytes File last access time
479 Ctime 8 bytes File creation time
481 -PATCH Descriptor Extra Field:
483 The following is the layout of the Patch Descriptor "extra"
486 Note: all fields stored in Intel low-byte/high-byte order.
488 Value Size Description
489 ----- ---- -----------
490 (Patch) 0x000f 2 bytes Tag for this "extra" block type
491 TSize 2 bytes Size of the total "extra" block
492 Version 2 bytes Version of the descriptor
493 Flags 4 bytes Actions and reactions (see below)
494 OldSize 4 bytes Size of the file about to be patched
495 OldCRC 4 bytes 32-bit CRC of the file to be patched
496 NewSize 4 bytes Size of the resulting file
497 NewCRC 4 bytes 32-bit CRC of the resulting file
499 Actions and reactions
502 ---- ----------------
503 0 Use for autodetection
506 4-5 Action (see below)
508 8-9 Reaction (see below) to absent file
509 10-11 Reaction (see below) to newer file
510 12-13 Reaction (see below) to unknown file
532 - FWKCS MD5 Extra Field:
534 The FWKCS Contents_Signature System, used in
535 automatically identifying files independent of filename,
536 optionally adds and uses an extra field to support the
537 rapid creation of an enhanced contents_signature:
541 Preface = 'M','D','5'
542 followed by 16 bytes containing the uncompressed file's
543 128_bit MD5 hash(1), low byte first.
545 When FWKCS revises a zipfile central directory to add
546 this extra field for a file, it also replaces the
547 central directory entry for that file's uncompressed
548 filelength with a measured value.
550 FWKCS provides an option to strip this extra field, if
551 present, from a zipfile central directory. In adding
552 this extra field, FWKCS preserves Zipfile Authenticity
553 Verification; if stripping this extra field, FWKCS
554 preserves all versions of AV through PKZIP version 2.04g.
556 FWKCS, and FWKCS Contents_Signature System, are
557 trademarks of Frederick W. Kantor.
559 (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer
560 Science and RSA Data Security, Inc., April 1992.
561 ll.76-77: "The MD5 algorithm is being placed in the
562 public domain for review and possible adoption as a
565 file comment: (Variable)
567 The comment for this file.
569 number of this disk: (2 bytes)
571 The number of this disk, which contains central
572 directory end record.
574 number of the disk with the start of the central
577 The number of the disk on which the central
580 total number of entries in the central dir on
583 The number of central directory entries on this disk.
585 total number of entries in the central dir: (2 bytes)
587 The total number of files in the zipfile.
589 size of the central directory: (4 bytes)
591 The size (in bytes) of the entire central directory.
593 offset of start of central directory with respect to
594 the starting disk number: (4 bytes)
596 Offset of the start of the central directory on the
597 disk on which the central directory starts.
599 zipfile comment length: (2 bytes)
601 The length of the comment for this zipfile.
603 zipfile comment: (Variable)
605 The comment for this zipfile.
609 1) All fields unless otherwise noted are unsigned and stored
610 in Intel low-byte:high-byte, low-word:high-word order.
612 2) String fields are not null terminated, since the
613 length is given explicitly.
615 3) Local headers should not span disk boundaries. Also, even
616 though the central directory can span disk boundaries, no
617 single record in the central directory should be split
620 4) The entries in the central directory may not necessarily
621 be in the same order that files appear in the zipfile.
623 UnShrinking - Method 1
624 ----------------------
626 Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm
627 with partial clearing. The initial code size is 9 bits, and
628 the maximum code size is 13 bits. Shrinking differs from
629 conventional Dynamic Ziv-Lempel-Welch implementations in several
632 1) The code size is controlled by the compressor, and is not
633 automatically increased when codes larger than the current
634 code size are created (but not necessarily used). When
635 the decompressor encounters the code sequence 256
636 (decimal) followed by 1, it should increase the code size
637 read from the input stream to the next bit size. No
638 blocking of the codes is performed, so the next code at
639 the increased size should be read from the input stream
640 immediately after where the previous code at the smaller
641 bit size was read. Again, the decompressor should not
642 increase the code size used until the sequence 256,1 is
645 2) When the table becomes full, total clearing is not
646 performed. Rather, when the compressor emits the code
647 sequence 256,2 (decimal), the decompressor should clear
648 all leaf nodes from the Ziv-Lempel tree, and continue to
649 use the current code size. The nodes that are cleared
650 from the Ziv-Lempel tree are then re-used, with the lowest
651 code value re-used first, and the highest code value
652 re-used last. The compressor can emit the sequence 256,2
655 Expanding - Methods 2-5
656 -----------------------
658 The Reducing algorithm is actually a combination of two
659 distinct algorithms. The first algorithm compresses repeated
660 byte sequences, and the second algorithm takes the compressed
661 stream from the first algorithm and applies a probabilistic
664 The probabilistic compression stores an array of 'follower
665 sets' S(j), for j=0 to 255, corresponding to each possible
666 ASCII character. Each set contains between 0 and 32
667 characters, to be denoted as S(j)[0],...,S(j)[m], where m<32.
668 The sets are stored at the beginning of the data area for a
669 Reduced file, in reverse order, with S(255) first, and S(0)
672 The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] },
673 where N(j) is the size of set S(j). N(j) can be 0, in which
674 case the follower set for S(j) is empty. Each N(j) value is
675 encoded in 6 bits, followed by N(j) eight bit character values
676 corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If
677 N(j) is 0, then no values for S(j) are stored, and the value
678 for N(j-1) immediately follows.
680 Immediately after the follower sets, is the compressed data
681 stream. The compressed data stream can be interpreted for the
682 probabilistic decompression as follows:
684 let Last-Character <- 0.
686 if the follower set S(Last-Character) is empty then
687 read 8 bits from the input stream, and copy this
688 value to the output stream.
689 otherwise if the follower set S(Last-Character) is non-empty then
690 read 1 bit from the input stream.
691 if this bit is not zero then
692 read 8 bits from the input stream, and copy this
693 value to the output stream.
694 otherwise if this bit is zero then
695 read B(N(Last-Character)) bits from the input
696 stream, and assign this value to I.
697 Copy the value of S(Last-Character)[I] to the
700 assign the last value placed on the output stream to
704 B(N(j)) is defined as the minimal number of bits required to
705 encode the value N(j)-1.
707 The decompressed stream from above can then be expanded to
708 re-create the original file as follows:
713 read 8 bits from the input stream into C.
715 0: if C is not equal to DLE (144 decimal) then
716 copy C to the output stream.
717 otherwise if C is equal to DLE then
720 1: if C is non-zero then
724 otherwise if C is zero then
725 copy the value 144 (decimal) to the output stream.
728 2: let Len <- Len + C
731 3: move backwards D(V,C) bytes in the output stream
732 (if this position is before the start of the output
733 stream, then assume that all the data before the
734 start of the output stream is filled with zeros).
735 copy Len+3 bytes from this position to the output stream.
740 The functions F,L, and D are dependent on the 'compression
741 factor', 1 through 4, and are defined as follows:
743 For compression factor 1:
744 L(X) equals the lower 7 bits of X.
745 F(X) equals 2 if X equals 127 otherwise F(X) equals 3.
746 D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1.
747 For compression factor 2:
748 L(X) equals the lower 6 bits of X.
749 F(X) equals 2 if X equals 63 otherwise F(X) equals 3.
750 D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1.
751 For compression factor 3:
752 L(X) equals the lower 5 bits of X.
753 F(X) equals 2 if X equals 31 otherwise F(X) equals 3.
754 D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1.
755 For compression factor 4:
756 L(X) equals the lower 4 bits of X.
757 F(X) equals 2 if X equals 15 otherwise F(X) equals 3.
758 D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1.
763 The Imploding algorithm is actually a combination of two distinct
764 algorithms. The first algorithm compresses repeated byte
765 sequences using a sliding dictionary. The second algorithm is
766 used to compress the encoding of the sliding dictionary output,
767 using multiple Shannon-Fano trees.
769 The Imploding algorithm can use a 4K or 8K sliding dictionary
770 size. The dictionary size used can be determined by bit 1 in the
771 general purpose flag word; a 0 bit indicates a 4K dictionary
772 while a 1 bit indicates an 8K dictionary.
774 The Shannon-Fano trees are stored at the start of the compressed
775 file. The number of trees stored is defined by bit 2 in the
776 general purpose flag word; a 0 bit indicates two trees stored, a
777 1 bit indicates three trees are stored. If 3 trees are stored,
778 the first Shannon-Fano tree represents the encoding of the
779 Literal characters, the second tree represents the encoding of
780 the Length information, the third represents the encoding of the
781 Distance information. When 2 Shannon-Fano trees are stored, the
782 Length tree is stored first, followed by the Distance tree.
784 The Literal Shannon-Fano tree, if present is used to represent
785 the entire ASCII character set, and contains 256 values. This
786 tree is used to compress any data not compressed by the sliding
787 dictionary algorithm. When this tree is present, the Minimum
788 Match Length for the sliding dictionary is 3. If this tree is
789 not present, the Minimum Match Length is 2.
791 The Length Shannon-Fano tree is used to compress the Length part
792 of the (length,distance) pairs from the sliding dictionary
793 output. The Length tree contains 64 values, ranging from the
794 Minimum Match Length, to 63 plus the Minimum Match Length.
796 The Distance Shannon-Fano tree is used to compress the Distance
797 part of the (length,distance) pairs from the sliding dictionary
798 output. The Distance tree contains 64 values, ranging from 0 to
799 63, representing the upper 6 bits of the distance value. The
800 distance values themselves will be between 0 and the sliding
801 dictionary size, either 4K or 8K.
803 The Shannon-Fano trees themselves are stored in a compressed
804 format. The first byte of the tree data represents the number of
805 bytes of data representing the (compressed) Shannon-Fano tree
806 minus 1. The remaining bytes represent the Shannon-Fano tree
809 High 4 bits: Number of values at this bit length + 1. (1 - 16)
810 Low 4 bits: Bit Length needed to represent value + 1. (1 - 16)
812 The Shannon-Fano codes can be constructed from the bit lengths
813 using the following algorithm:
815 1) Sort the Bit Lengths in ascending order, while retaining the
816 order of the original lengths stored in the file.
818 2) Generate the Shannon-Fano trees:
823 i <- number of Shannon-Fano codes - 1 (either 255 or 63)
826 Code = Code + CodeIncrement
827 if BitLength(i) <> LastBitLength then
828 LastBitLength=BitLength(i)
829 CodeIncrement = 1 shifted left (16 - LastBitLength)
830 ShannonCode(i) = Code
834 3) Reverse the order of all the bits in the above ShannonCode()
835 vector, so that the most significant bit becomes the least
836 significant bit. For example, the value 0x1234 (hex) would
839 4) Restore the order of Shannon-Fano codes as originally stored
844 This example will show the encoding of a Shannon-Fano tree
845 of size 8. Notice that the actual Shannon-Fano trees used
846 for Imploding are either 64 or 256 entries in size.
848 Example: 0x02, 0x42, 0x01, 0x13
850 The first byte indicates 3 values in this table. Decoding the
852 0x42 = 5 codes of 3 bits long
853 0x01 = 1 code of 2 bits long
854 0x13 = 2 codes of 4 bits long
856 This would generate the original bit length array of:
857 (3, 3, 3, 3, 3, 2, 4, 4)
859 There are 8 codes in this table for the values 0 thru 7. Using
860 the algorithm to obtain the Shannon-Fano codes produces:
862 Reversed Order Original
863 Val Sorted Constructed Code Value Restored Length
864 --- ------ ----------------- -------- -------- ------
865 0: 2 1100000000000000 11 101 3
866 1: 3 1010000000000000 101 001 3
867 2: 3 1000000000000000 001 110 3
868 3: 3 0110000000000000 110 010 3
869 4: 3 0100000000000000 010 100 3
870 5: 3 0010000000000000 100 11 2
871 6: 4 0001000000000000 1000 1000 4
872 7: 4 0000000000000000 0000 0000 4
874 The values in the Val, Order Restored and Original Length columns
875 now represent the Shannon-Fano encoding tree that can be used for
876 decoding the Shannon-Fano encoded data. How to parse the
877 variable length Shannon-Fano values from the data stream is beyond
878 the scope of this document. (See the references listed at the end of
879 this document for more information.) However, traditional decoding
880 schemes used for Huffman variable length decoding, such as the
881 Greenlaw algorithm, can be successfully applied.
883 The compressed data stream begins immediately after the
884 compressed Shannon-Fano data. The compressed data stream can be
885 interpreted as follows:
888 read 1 bit from input stream.
890 if this bit is non-zero then (encoded data is literal data)
891 if Literal Shannon-Fano tree is present
892 read and decode character using Literal Shannon-Fano tree.
894 read 8 bits from input stream.
895 copy character to the output stream.
896 otherwise (encoded data is sliding dictionary match)
897 if 8K dictionary size
898 read 7 bits for offset Distance (lower 7 bits of offset).
900 read 6 bits for offset Distance (lower 6 bits of offset).
902 using the Distance Shannon-Fano tree, read and decode the
903 upper 6 bits of the Distance value.
905 using the Length Shannon-Fano tree, read and decode
908 Length <- Length + Minimum Match Length
910 if Length = 63 + Minimum Match Length
911 read 8 bits from the input stream,
912 add this value to Length.
914 move backwards Distance+1 bytes in the output stream, and
915 copy Length characters from this position to the output
916 stream. (if this position is before the start of the output
917 stream, then assume that all the data before the start of
918 the output stream is filled with zeros).
921 Tokenizing - Method 7
924 This method is not used by PKZIP.
929 The Deflate algorithm is similar to the Implode algorithm using
930 a sliding dictionary of up to 32K with secondary compression
931 from Huffman/Shannon-Fano codes.
933 The compressed data is stored in blocks with a header describing
934 the block and the Huffman codes used in the data block. The header
935 format is as follows:
937 Bit 0: Last Block bit This bit is set to 1 if this is the last
938 compressed block in the data.
940 00 (0) - Block is stored - All stored data is byte aligned.
941 Skip bits until next byte, then next word = block
942 length, followed by the ones compliment of the block
943 length word. Remaining data in block is the stored
946 01 (1) - Use fixed Huffman codes for literal and distance codes.
947 Lit Code Bits Dist Code Bits
948 --------- ---- --------- ----
954 Literal codes 286-287 and distance codes 30-31 are
955 never used but participate in the huffman construction.
957 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes)
959 11 (3) - Reserved - Flag a "Error in compressed data" if seen.
961 Expanding Huffman Codes
962 -----------------------
963 If the data block is stored with dynamic Huffman codes, the Huffman
964 codes are sent in the following compressed format:
966 5 Bits: # of Literal codes sent - 256 (256 - 286)
967 All other codes are never sent.
968 5 Bits: # of Dist codes - 1 (1 - 32)
969 4 Bits: # of Bit Length codes - 3 (3 - 19)
971 The Huffman codes are sent as bit lengths and the codes are built as
972 described in the implode algorithm. The bit lengths themselves are
973 compressed with Huffman codes. There are 19 bit length codes:
975 0 - 15: Represent bit lengths of 0 - 15
976 16: Copy the previous bit length 3 - 6 times.
977 The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6)
978 Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will
979 expand to 12 bit lengths of 8 (1 + 6 + 5)
980 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length)
981 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length)
983 The lengths of the bit length codes are sent packed 3 bits per value
984 (0 - 7) in the following order:
986 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
988 The Huffman codes should be built as described in the Implode algorithm
989 except codes are assigned starting at the shortest bit length, i.e. the
990 shortest code should be all 0's rather than all 1's. Also, codes with
991 a bit length of zero do not participate in the tree construction. The
992 codes are then used to decode the bit lengths for the literal and
995 The bit lengths for the literal tables are sent first with the number
996 of entries sent described by the 5 bits sent earlier. There are up
997 to 286 literal characters; the first 256 represent the respective 8
998 bit character, code 256 represents the End-Of-Block code, the remaining
999 29 codes represent copy lengths of 3 thru 258. There are up to 30
1000 distance codes representing distances from 1 thru 32k as described
1005 Extra Extra Extra Extra
1006 Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s)
1007 ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ---------
1008 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162
1009 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194
1010 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226
1011 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257
1012 261 0 7 269 2 19-22 277 4 67-82 285 0 258
1013 262 0 8 270 2 23-26 278 4 83-98
1014 263 0 9 271 2 27-30 279 4 99-114
1015 264 0 10 272 2 31-34 280 4 115-130
1019 Extra Extra Extra Extra
1020 Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
1021 ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
1022 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1023 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
1024 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
1025 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
1026 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
1027 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
1028 6 2 9-12 14 6 129-192 22 10 2049-3072
1029 7 2 13-16 15 6 193-256 23 10 3073-4096
1031 The compressed data stream begins immediately after the
1032 compressed header data. The compressed data stream can be
1033 interpreted as follows:
1036 read header from input stream.
1039 skip bits until byte aligned
1040 read count and 1's compliment of count
1041 copy count bytes data block
1043 loop until end of block code sent
1044 decode literal character from input stream
1046 copy character to the output stream
1048 if literal = end of block
1051 decode distance from input stream
1053 move backwards distance bytes in the output stream, and
1054 copy length characters from this position to the output
1057 while not last block
1059 if data descriptor exists
1060 skip bits until byte aligned
1067 The encryption used in PKZIP was generously supplied by Roger
1068 Schlafly. PKWARE is grateful to Mr. Schlafly for his expert
1069 help and advice in the field of data encryption.
1071 PKZIP encrypts the compressed data stream. Encrypted files must
1072 be decrypted before they can be extracted.
1074 Each encrypted file has an extra 12 bytes stored at the start of
1075 the data area defining the encryption header for that file. The
1076 encryption header is originally set to random values, and then
1077 itself encrypted, using three, 32-bit keys. The key values are
1078 initialized using the supplied encryption password. After each byte
1079 is encrypted, the keys are then updated using pseudo-random number
1080 generation techniques in combination with the same CRC-32 algorithm
1081 used in PKZIP and described elsewhere in this document.
1083 The following is the basic steps required to decrypt a file:
1085 1) Initialize the three 32-bit keys with the password.
1086 2) Read and decrypt the 12-byte encryption header, further
1087 initializing the encryption keys.
1088 3) Read and decrypt the compressed data stream using the
1091 Step 1 - Initializing the encryption keys
1092 -----------------------------------------
1098 loop for i <- 0 to length(password)-1
1099 update_keys(password(i))
1102 Where update_keys() is defined as:
1105 Key(0) <- crc32(key(0),char)
1106 Key(1) <- Key(1) + (Key(0) & 000000ffH)
1107 Key(1) <- Key(1) * 134775813 + 1
1108 Key(2) <- crc32(key(2),key(1) >> 24)
1111 Where crc32(old_crc,char) is a routine that given a CRC value and a
1112 character, returns an updated CRC value after applying the CRC-32
1113 algorithm described elsewhere in this document.
1115 Step 2 - Decrypting the encryption header
1116 -----------------------------------------
1118 The purpose of this step is to further initialize the encryption
1119 keys, based on random data, to render a plaintext attack on the
1122 Read the 12-byte encryption header into Buffer, in locations
1123 Buffer(0) thru Buffer(11).
1125 loop for i <- 0 to 11
1126 C <- buffer(i) ^ decrypt_byte()
1131 Where decrypt_byte() is defined as:
1133 unsigned char decrypt_byte()
1134 local unsigned short temp
1136 decrypt_byte <- (temp * (temp ^ 1)) >> 8
1139 After the header is decrypted, the last 1 or 2 bytes in Buffer
1140 should be the high-order word/byte of the CRC for the file being
1141 decrypted, stored in Intel low-byte/high-byte order. Versions of
1142 PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
1143 used on versions after 2.0. This can be used to test if the password
1144 supplied is correct or not.
1146 Step 3 - Decrypting the compressed data stream
1147 ----------------------------------------------
1149 The compressed data stream can be decrypted as follows:
1152 read a character into C
1153 Temp <- C ^ decrypt_byte()
1158 In addition to the above mentioned contributors to PKZIP and PKUNZIP,
1159 I would like to extend special thanks to Robert Mahoney for suggesting
1160 the extension .ZIP for this software.
1164 Fiala, Edward R., and Greene, Daniel H., "Data compression with
1165 finite windows", Communications of the ACM, Volume 32, Number 4,
1166 April 1989, pages 490-505.
1168 Held, Gilbert, "Data Compression, Techniques and Applications,
1169 Hardware and Software Considerations", John Wiley & Sons, 1987.
1171 Huffman, D.A., "A method for the construction of minimum-redundancy
1172 codes", Proceedings of the IRE, Volume 40, Number 9, September 1952,
1175 Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14,
1176 Number 10, October 1989, pages 29-37.
1178 Nelson, Mark, "The Data Compression Book", M&T Books, 1991.
1180 Storer, James A., "Data Compression, Methods and Theory",
1181 Computer Science Press, 1988
1183 Welch, Terry, "A Technique for High-Performance Data Compression",
1184 IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19.
1186 Ziv, J. and Lempel, A., "A universal algorithm for sequential data
1187 compression", Communications of the ACM, Volume 30, Number 6,
1188 June 1987, pages 520-540.
1190 Ziv, J. and Lempel, A., "Compression of individual sequences via
1191 variable-rate coding", IEEE Transactions on Information Theory,
1192 Volume 24, Number 5, September 1978, pages 530-536.