1 File: APPNOTE.TXT - .ZIP File Format Specification
\r
2 Version: 5.2 - NOTIFICATION OF CHANGE
\r
4 Copyright (c) 1989 - 2003 PKWARE Inc., All Rights Reserved.
\r
10 Although PKWARE will attempt to supply current and accurate
\r
11 information relating to its file formats, algorithms, and the
\r
12 subject programs, the possibility of error or omission can not
\r
13 be eliminated. PKWARE therefore expressly disclaims any warranty
\r
14 that the information contained in the associated materials relating
\r
15 to the subject programs and/or the format of the files created or
\r
16 accessed by the subject programs and/or the algorithms used by
\r
17 the subject programs, or any other matter, is current, correct or
\r
18 accurate as delivered. Any risk of damage due to any possible
\r
19 inaccurate information is assumed by the user of the information.
\r
20 Furthermore, the information relating to the subject programs
\r
21 and/or the file formats created or accessed by the subject
\r
22 programs and/or the algorithms used by the subject programs is
\r
23 subject to change without notice.
\r
25 If the version of this file is marked as a NOTIFICATION OF CHANGE,
\r
26 the content defines an Early Feature Specification (EFS) change
\r
27 to the .ZIP file format that may be subject to modification prior
\r
28 to publication of the Final Feature Specification (FFS). This
\r
29 document may also contain information on Planned Feature
\r
30 Specifications (PFS) defining recognized future extensions.
\r
33 General Format of a .ZIP file
\r
34 -----------------------------
\r
36 Files stored in arbitrary order. Large .ZIP files can span multiple
\r
37 diskette media or be split into user-defined segment sizes.
\r
39 Overall .ZIP file format:
\r
41 [local file header 1]
\r
47 [local file header n]
\r
51 [zip64 end of central directory record]
\r
52 [zip64 end of central directory locator]
\r
53 [end of central directory record]
\r
56 A. Local file header:
\r
58 local file header signature 4 bytes (0x04034b50)
\r
59 version needed to extract 2 bytes
\r
60 general purpose bit flag 2 bytes
\r
61 compression method 2 bytes
\r
62 last mod file time 2 bytes
\r
63 last mod file date 2 bytes
\r
65 compressed size 4 bytes
\r
66 uncompressed size 4 bytes
\r
67 file name length 2 bytes
\r
68 extra field length 2 bytes
\r
70 file name (variable size)
\r
71 extra field (variable size)
\r
75 Immediately following the local header for a file
\r
76 is the compressed or stored data for the file.
\r
77 The series of [local file header][file data][data
\r
78 descriptor] repeats for each file in the .ZIP archive.
\r
83 compressed size 4 bytes
\r
84 uncompressed size 4 bytes
\r
86 This descriptor exists only if bit 3 of the general
\r
87 purpose bit flag is set (see below). It is byte aligned
\r
88 and immediately follows the last byte of compressed data.
\r
89 This descriptor is used only when it was not possible to
\r
90 seek in the output .ZIP file, e.g., when the output .ZIP file
\r
91 was standard output or a non seekable device. For Zip64 format
\r
92 archives, the compressed and uncompressed sizes are 8 bytes each.
\r
94 D. Central directory structure:
\r
101 [digital signature]
\r
105 central file header signature 4 bytes (0x02014b50)
\r
106 version made by 2 bytes
\r
107 version needed to extract 2 bytes
\r
108 general purpose bit flag 2 bytes
\r
109 compression method 2 bytes
\r
110 last mod file time 2 bytes
\r
111 last mod file date 2 bytes
\r
113 compressed size 4 bytes
\r
114 uncompressed size 4 bytes
\r
115 file name length 2 bytes
\r
116 extra field length 2 bytes
\r
117 file comment length 2 bytes
\r
118 disk number start 2 bytes
\r
119 internal file attributes 2 bytes
\r
120 external file attributes 4 bytes
\r
121 relative offset of local header 4 bytes
\r
123 file name (variable size)
\r
124 extra field (variable size)
\r
125 file comment (variable size)
\r
129 header signature 4 bytes (0x05054b50)
\r
130 size of data 2 bytes
\r
131 signature data (variable size)
\r
133 E. Zip64 end of central directory record
\r
135 zip64 end of central dir
\r
136 signature 4 bytes (0x06064b50)
\r
137 size of zip64 end of central
\r
138 directory record 8 bytes
\r
139 version made by 2 bytes
\r
140 version needed to extract 2 bytes
\r
141 number of this disk 4 bytes
\r
142 number of the disk with the
\r
143 start of the central directory 4 bytes
\r
144 total number of entries in the
\r
145 central directory on this disk 8 bytes
\r
146 total number of entries in the
\r
147 central directory 8 bytes
\r
148 size of the central directory 8 bytes
\r
149 offset of start of central
\r
150 directory with respect to
\r
151 the starting disk number 8 bytes
\r
152 zip64 extensible data sector (variable size)
\r
154 F. Zip64 end of central directory locator
\r
156 zip64 end of central dir locator
\r
157 signature 4 bytes (0x07064b50)
\r
158 number of the disk with the
\r
159 start of the zip64 end of
\r
160 central directory 4 bytes
\r
161 relative offset of the zip64
\r
162 end of central directory record 8 bytes
\r
163 total number of disks 4 bytes
\r
165 G. End of central directory record:
\r
167 end of central dir signature 4 bytes (0x06054b50)
\r
168 number of this disk 2 bytes
\r
169 number of the disk with the
\r
170 start of the central directory 2 bytes
\r
171 total number of entries in the
\r
172 central directory on this disk 2 bytes
\r
173 total number of entries in
\r
174 the central directory 2 bytes
\r
175 size of the central directory 4 bytes
\r
176 offset of start of central
\r
177 directory with respect to
\r
178 the starting disk number 4 bytes
\r
179 .ZIP file comment length 2 bytes
\r
180 .ZIP file comment (variable size)
\r
182 H. Explanation of fields:
\r
184 version made by (2 bytes)
\r
186 The upper byte indicates the compatibility of the file
\r
187 attribute information. If the external file attributes
\r
188 are compatible with MS-DOS and can be read by PKZIP for
\r
189 DOS version 2.04g then this value will be zero. If these
\r
190 attributes are not compatible, then this value will
\r
191 identify the host system on which the attributes are
\r
192 compatible. Software can use this information to determine
\r
193 the line record format for text files etc. The current
\r
196 0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
\r
197 1 - Amiga 2 - OpenVMS
\r
198 3 - Unix 4 - VM/CMS
\r
199 5 - Atari ST 6 - OS/2 H.P.F.S.
\r
200 7 - Macintosh 8 - Z-System
\r
201 9 - CP/M 10 - Windows NTFS
\r
202 11 - MVS (OS/390 - Z/OS) 12 - VSE
\r
203 13 - Acorn Risc 14 - VFAT
\r
204 15 - alternate MVS 16 - BeOS
\r
205 17 - Tandem 18 - OS/400
\r
206 19 thru 255 - unused
\r
208 The lower byte indicates the version number of the
\r
209 software used to encode the file. The value/10
\r
210 indicates the major version number, and the value
\r
211 mod 10 is the minor version number.
\r
213 version needed to extract (2 bytes)
\r
215 The minimum software version needed to extract the
\r
216 file, mapped as above. For Zip64 format archives,
\r
217 this value should not be less than 45.
\r
219 general purpose bit flag: (2 bytes)
\r
221 Bit 0: If set, indicates that the file is encrypted.
\r
223 (For Method 6 - Imploding)
\r
224 Bit 1: If the compression method used was type 6,
\r
225 Imploding, then this bit, if set, indicates
\r
226 an 8K sliding dictionary was used. If clear,
\r
227 then a 4K sliding dictionary was used.
\r
228 Bit 2: If the compression method used was type 6,
\r
229 Imploding, then this bit, if set, indicates
\r
230 3 Shannon-Fano trees were used to encode the
\r
231 sliding dictionary output. If clear, then 2
\r
232 Shannon-Fano trees were used.
\r
234 (For Methods 8 and 9 - Deflating)
\r
236 0 0 Normal (-en) compression option was used.
\r
237 0 1 Maximum (-exx/-ex) compression option was used.
\r
238 1 0 Fast (-ef) compression option was used.
\r
239 1 1 Super Fast (-es) compression option was used.
\r
241 Note: Bits 1 and 2 are undefined if the compression
\r
242 method is any other.
\r
244 Bit 3: If this bit is set, the fields crc-32, compressed
\r
245 size and uncompressed size are set to zero in the
\r
246 local header. The correct values are put in the
\r
247 data descriptor immediately following the compressed
\r
248 data. (Note: PKZIP version 2.04g for DOS only
\r
249 recognizes this bit for method 8 compression, newer
\r
250 versions of PKZIP recognize this bit for any
\r
251 compression method.)
\r
253 Bit 4: Reserved for use with method 8, for enhanced
\r
256 Bit 5: If this bit is set, this indicates that the file is
\r
257 compressed patched data. (Note: Requires PKZIP
\r
258 version 2.70 or greater)
\r
260 Bit 6: Strong encryption. If this bit is set, you should
\r
261 set the version needed to extract value to at least
\r
262 50 and you must also set bit 0. If AES encryption
\r
263 is used, the version needed to extract value must
\r
266 Bit 7: Currently unused.
\r
268 Bit 8: Currently unused.
\r
270 Bit 9: Currently unused.
\r
272 Bit 10: Currently unused.
\r
274 Bit 11: Currently unused.
\r
276 Bit 12: Reserved by PKWARE for enhanced compression.
\r
278 Bit 13: Reserved by PKWARE.
\r
280 Bit 14: Reserved by PKWARE.
\r
282 Bit 15: Reserved by PKWARE.
\r
284 compression method: (2 bytes)
\r
286 (see accompanying documentation for algorithm
\r
289 0 - The file is stored (no compression)
\r
290 1 - The file is Shrunk
\r
291 2 - The file is Reduced with compression factor 1
\r
292 3 - The file is Reduced with compression factor 2
\r
293 4 - The file is Reduced with compression factor 3
\r
294 5 - The file is Reduced with compression factor 4
\r
295 6 - The file is Imploded
\r
296 7 - Reserved for Tokenizing compression algorithm
\r
297 8 - The file is Deflated
\r
298 9 - Enhanced Deflating using Deflate64(tm)
\r
299 10 - PKWARE Data Compression Library Imploding
\r
300 11 - Reserved by PKWARE
\r
301 12 - File is compressed using BZIP2 algorithm
\r
303 date and time fields: (2 bytes each)
\r
305 The date and time are encoded in standard MS-DOS format.
\r
306 If input came from standard input, the date and time are
\r
307 those at which compression was started for this data.
\r
311 The CRC-32 algorithm was generously contributed by
\r
312 David Schwaderer and can be found in his excellent
\r
313 book "C Programmers Guide to NetBIOS" published by
\r
314 Howard W. Sams & Co. Inc. The 'magic number' for
\r
315 the CRC is 0xdebb20e3. The proper CRC pre and post
\r
316 conditioning is used, meaning that the CRC register
\r
317 is pre-conditioned with all ones (a starting value
\r
318 of 0xffffffff) and the value is post-conditioned by
\r
319 taking the one's complement of the CRC residual.
\r
320 If bit 3 of the general purpose flag is set, this
\r
321 field is set to zero in the local header and the correct
\r
322 value is put in the data descriptor and in the central
\r
325 compressed size: (4 bytes)
\r
326 uncompressed size: (4 bytes)
\r
328 The size of the file compressed and uncompressed,
\r
329 respectively. If bit 3 of the general purpose bit flag
\r
330 is set, these fields are set to zero in the local header
\r
331 and the correct values are put in the data descriptor and
\r
332 in the central directory. If an archive is in zip64 format
\r
333 and the value in this field is 0xFFFFFFFF, the size will be
\r
334 in the corresponding 8 byte zip64 extended information
\r
337 file name length: (2 bytes)
\r
338 extra field length: (2 bytes)
\r
339 file comment length: (2 bytes)
\r
341 The length of the file name, extra field, and comment
\r
342 fields respectively. The combined length of any
\r
343 directory record and these three fields should not
\r
344 generally exceed 65,535 bytes. If input came from standard
\r
345 input, the file name length is set to zero.
\r
347 disk number start: (2 bytes)
\r
349 The number of the disk on which this file begins. If an
\r
350 archive is in zip64 format and the value in this field is
\r
351 0xFFFF, the size will be in the corresponding 4 byte zip64
\r
352 extended information extra field.
\r
354 internal file attributes: (2 bytes)
\r
356 The lowest bit of this field indicates, if set, that
\r
357 the file is apparently an ASCII or text file. If not
\r
358 set, that the file apparently contains binary data.
\r
359 The remaining bits are unused in version 1.0.
\r
361 Bits 1 and 2 are reserved for use by PKWARE.
\r
363 external file attributes: (4 bytes)
\r
365 The mapping of the external attributes is
\r
366 host-system dependent (see 'version made by'). For
\r
367 MS-DOS, the low order byte is the MS-DOS directory
\r
368 attribute byte. If input came from standard input, this
\r
369 field is set to zero.
\r
371 relative offset of local header: (4 bytes)
\r
373 This is the offset from the start of the first disk on
\r
374 which this file appears, to where the local header should
\r
375 be found. If an archive is in zip64 format and the value
\r
376 in this field is 0xFFFFFFFF, the size will be in the
\r
377 corresponding 8 byte zip64 extended information extra field.
\r
379 file name: (Variable)
\r
381 The name of the file, with optional relative path.
\r
382 The path stored should not contain a drive or
\r
383 device letter, or a leading slash. All slashes
\r
384 should be forward slashes '/' as opposed to
\r
385 backwards slashes '\' for compatibility with Amiga
\r
386 and Unix file systems etc. If input came from standard
\r
387 input, there is no file name field.
\r
389 extra field: (Variable)
\r
391 This is for expansion. If additional information
\r
392 needs to be stored for special needs or for specific
\r
393 platforms, it should be stored here. Earlier versions
\r
394 of the software can then safely skip this file, and
\r
395 find the next file or header. This field will be 0
\r
396 length in version 1.0.
\r
398 In order to allow different programs and different types
\r
399 of information to be stored in the 'extra' field in .ZIP
\r
400 files, the following structure should be used for all
\r
401 programs storing data in this field:
\r
403 header1+data1 + header2+data2 . . .
\r
405 Each header should consist of:
\r
407 Header ID - 2 bytes
\r
408 Data Size - 2 bytes
\r
410 Note: all fields stored in Intel low-byte/high-byte order.
\r
412 The Header ID field indicates the type of data that is in
\r
413 the following data block.
\r
415 Header ID's of 0 thru 31 are reserved for use by PKWARE.
\r
416 The remaining ID's can be used by third party vendors for
\r
419 The current Header ID mappings defined by PKWARE are:
\r
421 0x0001 ZIP64 extended information extra field
\r
423 0x0008 Reserved for future Unicode file name data (PFS)
\r
428 0x000f Patch Descriptor
\r
429 0x0014 PKCS#7 Store for X.509 Certificates
\r
430 0x0015 X.509 Certificate ID and Signature for
\r
432 0x0016 X.509 Certificate ID for Central Directory
\r
433 0x0017 Strong Encryption Header
\r
434 0x0018 Record Management Controls
\r
435 0x0065 IBM S/390 (Z390), AS/400 (I400) attributes
\r
437 0x0066 IBM S/390 (Z390), AS/400 (I400) attributes
\r
440 Third party mappings commonly used are:
\r
443 0x2605 ZipIt Macintosh
\r
444 0x2705 ZipIt Macintosh 1.3.5+
\r
446 0x2805 ZipIt Macintosh 1.3.5+
\r
447 0x334d Info-ZIP Macintosh
\r
448 0x4341 Acorn/SparkFS
\r
449 0x4453 Windows NT security descriptor (binary ACL)
\r
452 0x4b46 FWKCS MD5 (see below)
\r
453 0x4c41 OS/2 access control list (text ACL)
\r
454 0x4d49 Info-ZIP OpenVMS
\r
455 0x4f4c Xceed original location extra field
\r
456 0x5356 AOS/VS (ACL)
\r
457 0x5455 extended timestamp
\r
458 0x554e Xceed unicode extra field
\r
459 0x5855 Info-ZIP Unix (original, also OS/2, NT, etc)
\r
462 0x7855 Info-ZIP Unix (new)
\r
465 Detailed descriptions of Extra Fields defined by third
\r
466 party mappings will be documented as information on
\r
467 these data structures is made available to PKWARE.
\r
468 PKWARE does not guarantee the accuracy of any published
\r
471 The Data Size field indicates the size of the following
\r
472 data block. Programs can use this value to skip to the
\r
473 next header block, passing over any data blocks that are
\r
476 Note: As stated above, the size of the entire .ZIP file
\r
477 header, including the file name, comment, and extra
\r
478 field should not exceed 64K in size.
\r
480 In case two different programs should appropriate the same
\r
481 Header ID value, it is strongly recommended that each
\r
482 program place a unique signature of at least two bytes in
\r
483 size (and preferably 4 bytes or bigger) at the start of
\r
484 each data area. Every program should verify that its
\r
485 unique signature is present, in addition to the Header ID
\r
486 value being correct, before assuming that it is a block of
\r
491 The following is the layout of the OS/2 attributes "extra"
\r
492 block. (Last Revision 09/05/95)
\r
494 Note: all fields stored in Intel low-byte/high-byte order.
\r
496 Value Size Description
\r
497 ----- ---- -----------
\r
498 (OS/2) 0x0009 2 bytes Tag for this "extra" block type
\r
499 TSize 2 bytes Size for the following data block
\r
500 BSize 4 bytes Uncompressed Block Size
\r
501 CType 2 bytes Compression type
\r
502 EACRC 4 bytes CRC value for uncompress block
\r
503 (var) variable Compressed block
\r
505 The OS/2 extended attribute structure (FEA2LIST) is
\r
506 compressed and then stored in it's entirety within this
\r
507 structure. There will only ever be one "block" of data in
\r
512 The following is the layout of the Unix "extra" block.
\r
513 Note: all fields are stored in Intel low-byte/high-byte
\r
516 Value Size Description
\r
517 ----- ---- -----------
\r
518 (UNIX) 0x000d 2 bytes Tag for this "extra" block type
\r
519 TSize 2 bytes Size for the following data block
\r
520 Atime 4 bytes File last access time
\r
521 Mtime 4 bytes File last modification time
\r
522 Uid 2 bytes File user ID
\r
523 Gid 2 bytes File group ID
\r
524 (var) variable Variable length data field
\r
526 The variable length data field will contain file type
\r
527 specific data. Currently the only values allowed are
\r
528 the original "linked to" file names for hard or symbolic
\r
529 links, and the major and minor device node numbers for
\r
530 character and block device nodes. Since device nodes
\r
531 cannot be either symbolic or hard links, only one set of
\r
532 variable length data is stored. Link files will have the
\r
533 name of the original file stored. This name is NOT NULL
\r
534 terminated. Its size can be determined by checking TSize -
\r
535 12. Device entries will have eight bytes stored as two 4
\r
536 byte entries (in little endian format). The first entry
\r
537 will be the major device number, and the second the minor
\r
541 -OpenVMS Extra Field:
\r
543 The following is the layout of the OpenVMS attributes
\r
546 Note: all fields stored in Intel low-byte/high-byte order.
\r
548 Value Size Description
\r
549 ----- ---- -----------
\r
550 (VMS) 0x000c 2 bytes Tag for this "extra" block type
\r
551 TSize 2 bytes Size of the total "extra" block
\r
552 CRC 4 bytes 32-bit CRC for remainder of the block
\r
553 Tag1 2 bytes OpenVMS attribute tag value #1
\r
554 Size1 2 bytes Size of attribute #1, in bytes
\r
555 (var.) Size1 Attribute #1 data
\r
559 TagN 2 bytes OpenVMS attribute tage value #N
\r
560 SizeN 2 bytes Size of attribute #N, in bytes
\r
561 (var.) SizeN Attribute #N data
\r
565 1. There will be one or more of attributes present, which
\r
566 will each be preceded by the above TagX & SizeX values.
\r
567 These values are identical to the ATR$C_XXXX and
\r
568 ATR$S_XXXX constants which are defined in ATR.H under
\r
569 OpenVMS C. Neither of these values will ever be zero.
\r
571 2. No word alignment or padding is performed.
\r
573 3. A well-behaved PKZIP/OpenVMS program should never produce
\r
574 more than one sub-block with the same TagX value. Also,
\r
575 there will never be more than one "extra" block of type
\r
576 0x000c in a particular directory record.
\r
580 The following is the layout of the NTFS attributes
\r
581 "extra" block. (Note: At this time the Mtime, Atime
\r
582 and Ctime values may be used on any WIN32 system.)
\r
584 Note: all fields stored in Intel low-byte/high-byte order.
\r
586 Value Size Description
\r
587 ----- ---- -----------
\r
588 (NTFS) 0x000a 2 bytes Tag for this "extra" block type
\r
589 TSize 2 bytes Size of the total "extra" block
\r
590 Reserved 4 bytes Reserved for future use
\r
591 Tag1 2 bytes NTFS attribute tag value #1
\r
592 Size1 2 bytes Size of attribute #1, in bytes
\r
593 (var.) Size1 Attribute #1 data
\r
597 TagN 2 bytes NTFS attribute tag value #N
\r
598 SizeN 2 bytes Size of attribute #N, in bytes
\r
599 (var.) SizeN Attribute #N data
\r
601 For NTFS, values for Tag1 through TagN are as follows:
\r
602 (currently only one set of attributes is defined for NTFS)
\r
604 Tag Size Description
\r
605 ----- ---- -----------
\r
606 0x0001 2 bytes Tag for attribute #1
\r
607 Size1 2 bytes Size of attribute #1, in bytes
\r
608 Mtime 8 bytes File last modification time
\r
609 Atime 8 bytes File last access time
\r
610 Ctime 8 bytes File creation time
\r
612 -PATCH Descriptor Extra Field:
\r
614 The following is the layout of the Patch Descriptor "extra"
\r
617 Note: all fields stored in Intel low-byte/high-byte order.
\r
619 Value Size Description
\r
620 ----- ---- -----------
\r
621 (Patch) 0x000f 2 bytes Tag for this "extra" block type
\r
622 TSize 2 bytes Size of the total "extra" block
\r
623 Version 2 bytes Version of the descriptor
\r
624 Flags 4 bytes Actions and reactions (see below)
\r
625 OldSize 4 bytes Size of the file about to be patched
\r
626 OldCRC 4 bytes 32-bit CRC of the file to be patched
\r
627 NewSize 4 bytes Size of the resulting file
\r
628 NewCRC 4 bytes 32-bit CRC of the resulting file
\r
630 Actions and reactions
\r
633 ---- ----------------
\r
634 0 Use for autodetection
\r
635 1 Treat as selfpatch
\r
637 4-5 Action (see below)
\r
639 8-9 Reaction (see below) to absent file
\r
640 10-11 Reaction (see below) to newer file
\r
641 12-13 Reaction (see below) to unknown file
\r
663 Patch support is provided by PKPatchMaker(tm) technology and is
\r
664 covered under U.S. Patents and Patents Pending.
\r
666 -PKCS#7 Store for X.509 Certificates:
\r
668 Note: all fields stored in Intel low-byte/high-byte order.
\r
670 Value Size Description
\r
671 ----- ---- -----------
\r
672 (Store) 0x0014 2 bytes Tag for this "extra" block type
\r
673 TSize 2 bytes Size of the store data
\r
674 (var) TSize Data about the store
\r
677 -X.509 Certificate ID and Signature for individual file:
\r
679 Note: all fields stored in Intel low-byte/high-byte order.
\r
681 Value Size Description
\r
682 ----- ---- -----------
\r
683 (CID) 0x0015 2 bytes Tag for this "extra" block type
\r
684 TSize 2 bytes Size of data that follows
\r
687 -X.509 Certificate ID and Signature for central directory:
\r
689 Note: all fields stored in Intel low-byte/high-byte order.
\r
691 Value Size Description
\r
692 ----- ---- -----------
\r
693 (CDID) 0x0016 2 bytes Tag for this "extra" block type
\r
694 TSize 2 bytes Size of data that follows
\r
697 -Strong Encryption Header (EFS):
\r
699 Value Size Description
\r
700 ----- ---- -----------
\r
701 0x0017 2 bytes Tag for this "extra" block type
\r
702 TSize 2 bytes Size of data that follows
\r
703 Format 2 bytes Format definition for this record
\r
704 AlgID 2 bytes Encryption algorithm identifier
\r
705 Bitlen 2 bytes Bit length of encryption key
\r
706 Flags 2 bytes Processing flags
\r
707 (var) TSize Reserved for future certificate data
\r
710 -Record Management Controls:
\r
712 Value Size Description
\r
713 ----- ---- -----------
\r
714 (Rec-CTL) 0x0018 2 bytes Tag for this "extra" block type
\r
715 CSize 2 bytes Size of total extra block data
\r
716 Tag1 2 bytes Record control attribute 1
\r
717 Size1 2 bytes Size of attribute 1, in bytes
\r
718 Data Size1 Attribute 1 data
\r
722 TagN 2 bytes Record control attribute N
\r
723 SizeN 2 bytes Size of attribute N, in bytes
\r
724 Data SizeN Attribute N data
\r
728 The following is the layout of the MVS "extra" block.
\r
729 Note: Some fields are stored in Big Endian format.
\r
730 All text is in EBCDIC format unless otherwise specified.
\r
732 Value Size Description
\r
733 ----- ---- -----------
\r
734 (MVS) 0x0065 2 bytes Tag for this "extra" block type
\r
735 TSize 2 bytes Size for the following data block
\r
736 ID 4 bytes EBCDIC "Z390" 0xE9F3F9F0 or
\r
737 "T4MV" for TargetFour
\r
738 (var) TSize-4 Attribute data
\r
741 -OS/400 Extra Field:
\r
743 The following is the layout of the OS/400 "extra" block.
\r
744 Note: Some fields are stored in Big Endian format.
\r
745 All text is in EBCDIC format unless otherwise specified.
\r
747 Value Size Description
\r
748 ----- ---- -----------
\r
749 (OS400) 0x0065 2 bytes Tag for this "extra" block type
\r
750 TSize 2 bytes Size for the following data block
\r
751 ID 4 bytes EBCDIC "I400" 0xC9F4F0F0 or
\r
752 "T4MV" for TargetFour
\r
753 (var) TSize-4 Attribute data
\r
756 -ZipIt Macintosh Extra Field (long):
\r
758 The following is the layout of the ZipIt extra block
\r
759 for Macintosh. The local-header and central-header versions
\r
760 are identical. This block must be present if the file is
\r
761 stored MacBinary-encoded and it should not be used if the file
\r
762 is not stored MacBinary-encoded.
\r
764 Value Size Description
\r
765 ----- ---- -----------
\r
766 (Mac2) 0x2605 Short tag for this extra block type
\r
767 TSize Short total data size for this block
\r
768 "ZPIT" beLong extra-field signature
\r
769 FnLen Byte length of FileName
\r
770 FileName variable full Macintosh filename
\r
771 FileType Byte[4] four-byte Mac file type string
\r
772 Creator Byte[4] four-byte Mac creator string
\r
775 -ZipIt Macintosh Extra Field (short, for files):
\r
777 The following is the layout of a shortened variant of the
\r
778 ZipIt extra block for Macintosh (without "full name" entry).
\r
779 This variant is used by ZipIt 1.3.5 and newer for entries of
\r
780 files (not directories) that do not have a MacBinary encoded
\r
781 file. The local-header and central-header versions are identical.
\r
783 Value Size Description
\r
784 ----- ---- -----------
\r
785 (Mac2b) 0x2705 Short tag for this extra block type
\r
786 TSize Short total data size for this block (12)
\r
787 "ZPIT" beLong extra-field signature
\r
788 FileType Byte[4] four-byte Mac file type string
\r
789 Creator Byte[4] four-byte Mac creator string
\r
790 fdFlags beShort attributes from FInfo.frFlags,
\r
792 0x0000 beShort reserved, may be omitted
\r
795 -ZipIt Macintosh Extra Field (short, for directories):
\r
797 The following is the layout of a shortened variant of the
\r
798 ZipIt extra block for Macintosh used only for directory
\r
799 entries. This variant is used by ZipIt 1.3.5 and newer to
\r
800 save some optional Mac-specific information about directories.
\r
801 The local-header and central-header versions are identical.
\r
803 Value Size Description
\r
804 ----- ---- -----------
\r
805 (Mac2c) 0x2805 Short tag for this extra block type
\r
806 TSize Short total data size for this block (12)
\r
807 "ZPIT" beLong extra-field signature
\r
808 frFlags beShort attributes from DInfo.frFlags, may
\r
810 View beShort ZipIt view flag, may be omitted
\r
813 The View field specifies ZipIt-internal settings as follows:
\r
816 bit 0 if set, the folder is shown expanded (open)
\r
817 when the archive contents are viewed in ZipIt.
\r
818 bits 1-15 reserved, zero;
\r
821 -ZIP64 Extended Information Extra Field:
\r
823 The following is the layout of the ZIP64 extended
\r
824 information "extra" block. If one of the size or
\r
825 offset fields in the Local or Central directory
\r
826 record is too small to hold the required data,
\r
827 a ZIP64 extended information record is created.
\r
828 The order of the fields in the ZIP64 extended
\r
829 information record is fixed, but the fields will
\r
830 only appear if the corresponding Local or Central
\r
831 directory record field is set to 0xFFFF or 0xFFFFFFFF.
\r
833 Note: all fields stored in Intel low-byte/high-byte order.
\r
835 Value Size Description
\r
836 ----- ---- -----------
\r
837 (ZIP64) 0x0001 2 bytes Tag for this "extra" block type
\r
838 Size 2 bytes Size of this "extra" block
\r
840 Size 8 bytes Original uncompresseed file size
\r
842 Size 8 bytes Size of compressed data
\r
844 Offset 8 bytes Offset of local header record
\r
846 Number 4 bytes Number of the disk on which
\r
849 This entry in the Local header must include BOTH original
\r
850 and compressed file sizes.
\r
852 -FWKCS MD5 Extra Field:
\r
854 The FWKCS Contents_Signature System, used in
\r
855 automatically identifying files independent of file name,
\r
856 optionally adds and uses an extra field to support the
\r
857 rapid creation of an enhanced contents_signature:
\r
861 Preface = 'M','D','5'
\r
862 followed by 16 bytes containing the uncompressed file's
\r
863 128_bit MD5 hash(1), low byte first.
\r
865 When FWKCS revises a .ZIP file central directory to add
\r
866 this extra field for a file, it also replaces the
\r
867 central directory entry for that file's uncompressed
\r
868 file length with a measured value.
\r
870 FWKCS provides an option to strip this extra field, if
\r
871 present, from a .ZIP file central directory. In adding
\r
872 this extra field, FWKCS preserves .ZIP file Authenticity
\r
873 Verification; if stripping this extra field, FWKCS
\r
874 preserves all versions of AV through PKZIP version 2.04g.
\r
876 FWKCS, and FWKCS Contents_Signature System, are
\r
877 trademarks of Frederick W. Kantor.
\r
879 (1) R. Rivest, RFC1321.TXT, MIT Laboratory for Computer
\r
880 Science and RSA Data Security, Inc., April 1992.
\r
881 ll.76-77: "The MD5 algorithm is being placed in the
\r
882 public domain for review and possible adoption as a
\r
885 file comment: (Variable)
\r
887 The comment for this file.
\r
889 number of this disk: (2 bytes)
\r
891 The number of this disk, which contains central
\r
892 directory end record. If an archive is in zip64 format
\r
893 and the value in this field is 0xFFFF, the size will
\r
894 be in the corresponding 4 byte zip64 end of central
\r
898 number of the disk with the start of the central
\r
899 directory: (2 bytes)
\r
901 The number of the disk on which the central
\r
902 directory starts. If an archive is in zip64 format
\r
903 and the value in this field is 0xFFFF, the size will
\r
904 be in the corresponding 4 byte zip64 end of central
\r
907 total number of entries in the central dir on
\r
908 this disk: (2 bytes)
\r
910 The number of central directory entries on this disk.
\r
911 If an archive is in zip64 format and the value in
\r
912 this field is 0xFFFF, the size will be in the
\r
913 corresponding 8 byte zip64 end of central
\r
916 total number of entries in the central dir: (2 bytes)
\r
918 The total number of files in the .ZIP file. If an
\r
919 archive is in zip64 format and the value in this field
\r
920 is 0xFFFF, the size will be in the corresponding 8 byte
\r
921 zip64 end of central directory field.
\r
923 size of the central directory: (4 bytes)
\r
925 The size (in bytes) of the entire central directory.
\r
926 If an archive is in zip64 format and the value in
\r
927 this field is 0xFFFFFFFF, the size will be in the
\r
928 corresponding 8 byte zip64 end of central
\r
931 offset of start of central directory with respect to
\r
932 the starting disk number: (4 bytes)
\r
934 Offset of the start of the central directory on the
\r
935 disk on which the central directory starts. If an
\r
936 archive is in zip64 format and the value in this
\r
937 field is 0xFFFFFFFF, the size will be in the
\r
938 corresponding 8 byte zip64 end of central
\r
941 .ZIP file comment length: (2 bytes)
\r
943 The length of the comment for this .ZIP file.
\r
945 .ZIP file comment: (Variable)
\r
947 The comment for this .ZIP file.
\r
949 zip64 extensible data sector (variable size)
\r
951 (currently reserved for use by PKWARE)
\r
956 1) All fields unless otherwise noted are unsigned and stored
\r
957 in Intel low-byte:high-byte, low-word:high-word order.
\r
959 2) String fields are not null terminated, since the
\r
960 length is given explicitly.
\r
962 3) Local headers should not span disk boundaries. Also, even
\r
963 though the central directory can span disk boundaries, no
\r
964 single record in the central directory should be split
\r
967 4) The entries in the central directory may not necessarily
\r
968 be in the same order that files appear in the .ZIP file.
\r
970 5) Spanned/Split archives created using PKZIP for Windows
\r
971 (V2.50 or greater), PKZIP Command Line (V2.50 or greater),
\r
972 or PKZIP Explorer will include a special spanning
\r
973 signature as the first 4 bytes of the first segment of
\r
974 the archive. This signature (0x08074b50) will be
\r
975 followed immediately by the local header signature for
\r
976 the first file in the archive. A special spanning
\r
977 marker may also appear in spanned/split archives if the
\r
978 spanning or splitting process starts but only requires
\r
979 one segement. In this case the 0x08074b50 signature
\r
980 will be replaced with the temporary spanning marker
\r
981 signature of 0x30304b50. Spanned/split archives
\r
982 created with this special signature are compatible with
\r
983 all versions of PKZIP from PKWARE. Split archives can
\r
984 only be uncompressed by other versions of PKZIP that
\r
985 know how to create a split archive.
\r
987 6) If one of the fields in the end of central directory
\r
988 record is too small to hold required data, the field
\r
989 should be set to -1 (0xFFFF or 0xFFFFFFFF) and the
\r
990 Zip64 format record should be created.
\r
992 7) The end of central directory record and the
\r
993 Zip64 end of central directory locator record must
\r
994 reside on the same disk when splitting or spanning
\r
997 UnShrinking - Method 1
\r
998 ----------------------
\r
1000 Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm
\r
1001 with partial clearing. The initial code size is 9 bits, and
\r
1002 the maximum code size is 13 bits. Shrinking differs from
\r
1003 conventional Dynamic Ziv-Lempel-Welch implementations in several
\r
1006 1) The code size is controlled by the compressor, and is not
\r
1007 automatically increased when codes larger than the current
\r
1008 code size are created (but not necessarily used). When
\r
1009 the decompressor encounters the code sequence 256
\r
1010 (decimal) followed by 1, it should increase the code size
\r
1011 read from the input stream to the next bit size. No
\r
1012 blocking of the codes is performed, so the next code at
\r
1013 the increased size should be read from the input stream
\r
1014 immediately after where the previous code at the smaller
\r
1015 bit size was read. Again, the decompressor should not
\r
1016 increase the code size used until the sequence 256,1 is
\r
1019 2) When the table becomes full, total clearing is not
\r
1020 performed. Rather, when the compressor emits the code
\r
1021 sequence 256,2 (decimal), the decompressor should clear
\r
1022 all leaf nodes from the Ziv-Lempel tree, and continue to
\r
1023 use the current code size. The nodes that are cleared
\r
1024 from the Ziv-Lempel tree are then re-used, with the lowest
\r
1025 code value re-used first, and the highest code value
\r
1026 re-used last. The compressor can emit the sequence 256,2
\r
1029 Expanding - Methods 2-5
\r
1030 -----------------------
\r
1032 The Reducing algorithm is actually a combination of two
\r
1033 distinct algorithms. The first algorithm compresses repeated
\r
1034 byte sequences, and the second algorithm takes the compressed
\r
1035 stream from the first algorithm and applies a probabilistic
\r
1036 compression method.
\r
1038 The probabilistic compression stores an array of 'follower
\r
1039 sets' S(j), for j=0 to 255, corresponding to each possible
\r
1040 ASCII character. Each set contains between 0 and 32
\r
1041 characters, to be denoted as S(j)[0],...,S(j)[m], where m<32.
\r
1042 The sets are stored at the beginning of the data area for a
\r
1043 Reduced file, in reverse order, with S(255) first, and S(0)
\r
1046 The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] },
\r
1047 where N(j) is the size of set S(j). N(j) can be 0, in which
\r
1048 case the follower set for S(j) is empty. Each N(j) value is
\r
1049 encoded in 6 bits, followed by N(j) eight bit character values
\r
1050 corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If
\r
1051 N(j) is 0, then no values for S(j) are stored, and the value
\r
1052 for N(j-1) immediately follows.
\r
1054 Immediately after the follower sets, is the compressed data
\r
1055 stream. The compressed data stream can be interpreted for the
\r
1056 probabilistic decompression as follows:
\r
1058 let Last-Character <- 0.
\r
1060 if the follower set S(Last-Character) is empty then
\r
1061 read 8 bits from the input stream, and copy this
\r
1062 value to the output stream.
\r
1063 otherwise if the follower set S(Last-Character) is non-empty then
\r
1064 read 1 bit from the input stream.
\r
1065 if this bit is not zero then
\r
1066 read 8 bits from the input stream, and copy this
\r
1067 value to the output stream.
\r
1068 otherwise if this bit is zero then
\r
1069 read B(N(Last-Character)) bits from the input
\r
1070 stream, and assign this value to I.
\r
1071 Copy the value of S(Last-Character)[I] to the
\r
1074 assign the last value placed on the output stream to
\r
1078 B(N(j)) is defined as the minimal number of bits required to
\r
1079 encode the value N(j)-1.
\r
1081 The decompressed stream from above can then be expanded to
\r
1082 re-create the original file as follows:
\r
1087 read 8 bits from the input stream into C.
\r
1089 0: if C is not equal to DLE (144 decimal) then
\r
1090 copy C to the output stream.
\r
1091 otherwise if C is equal to DLE then
\r
1094 1: if C is non-zero then
\r
1097 let State <- F(Len).
\r
1098 otherwise if C is zero then
\r
1099 copy the value 144 (decimal) to the output stream.
\r
1102 2: let Len <- Len + C
\r
1105 3: move backwards D(V,C) bytes in the output stream
\r
1106 (if this position is before the start of the output
\r
1107 stream, then assume that all the data before the
\r
1108 start of the output stream is filled with zeros).
\r
1109 copy Len+3 bytes from this position to the output stream.
\r
1114 The functions F,L, and D are dependent on the 'compression
\r
1115 factor', 1 through 4, and are defined as follows:
\r
1117 For compression factor 1:
\r
1118 L(X) equals the lower 7 bits of X.
\r
1119 F(X) equals 2 if X equals 127 otherwise F(X) equals 3.
\r
1120 D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1.
\r
1121 For compression factor 2:
\r
1122 L(X) equals the lower 6 bits of X.
\r
1123 F(X) equals 2 if X equals 63 otherwise F(X) equals 3.
\r
1124 D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1.
\r
1125 For compression factor 3:
\r
1126 L(X) equals the lower 5 bits of X.
\r
1127 F(X) equals 2 if X equals 31 otherwise F(X) equals 3.
\r
1128 D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1.
\r
1129 For compression factor 4:
\r
1130 L(X) equals the lower 4 bits of X.
\r
1131 F(X) equals 2 if X equals 15 otherwise F(X) equals 3.
\r
1132 D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1.
\r
1134 Imploding - Method 6
\r
1135 --------------------
\r
1137 The Imploding algorithm is actually a combination of two distinct
\r
1138 algorithms. The first algorithm compresses repeated byte
\r
1139 sequences using a sliding dictionary. The second algorithm is
\r
1140 used to compress the encoding of the sliding dictionary output,
\r
1141 using multiple Shannon-Fano trees.
\r
1143 The Imploding algorithm can use a 4K or 8K sliding dictionary
\r
1144 size. The dictionary size used can be determined by bit 1 in the
\r
1145 general purpose flag word; a 0 bit indicates a 4K dictionary
\r
1146 while a 1 bit indicates an 8K dictionary.
\r
1148 The Shannon-Fano trees are stored at the start of the compressed
\r
1149 file. The number of trees stored is defined by bit 2 in the
\r
1150 general purpose flag word; a 0 bit indicates two trees stored, a
\r
1151 1 bit indicates three trees are stored. If 3 trees are stored,
\r
1152 the first Shannon-Fano tree represents the encoding of the
\r
1153 Literal characters, the second tree represents the encoding of
\r
1154 the Length information, the third represents the encoding of the
\r
1155 Distance information. When 2 Shannon-Fano trees are stored, the
\r
1156 Length tree is stored first, followed by the Distance tree.
\r
1158 The Literal Shannon-Fano tree, if present is used to represent
\r
1159 the entire ASCII character set, and contains 256 values. This
\r
1160 tree is used to compress any data not compressed by the sliding
\r
1161 dictionary algorithm. When this tree is present, the Minimum
\r
1162 Match Length for the sliding dictionary is 3. If this tree is
\r
1163 not present, the Minimum Match Length is 2.
\r
1165 The Length Shannon-Fano tree is used to compress the Length part
\r
1166 of the (length,distance) pairs from the sliding dictionary
\r
1167 output. The Length tree contains 64 values, ranging from the
\r
1168 Minimum Match Length, to 63 plus the Minimum Match Length.
\r
1170 The Distance Shannon-Fano tree is used to compress the Distance
\r
1171 part of the (length,distance) pairs from the sliding dictionary
\r
1172 output. The Distance tree contains 64 values, ranging from 0 to
\r
1173 63, representing the upper 6 bits of the distance value. The
\r
1174 distance values themselves will be between 0 and the sliding
\r
1175 dictionary size, either 4K or 8K.
\r
1177 The Shannon-Fano trees themselves are stored in a compressed
\r
1178 format. The first byte of the tree data represents the number of
\r
1179 bytes of data representing the (compressed) Shannon-Fano tree
\r
1180 minus 1. The remaining bytes represent the Shannon-Fano tree
\r
1183 High 4 bits: Number of values at this bit length + 1. (1 - 16)
\r
1184 Low 4 bits: Bit Length needed to represent value + 1. (1 - 16)
\r
1186 The Shannon-Fano codes can be constructed from the bit lengths
\r
1187 using the following algorithm:
\r
1189 1) Sort the Bit Lengths in ascending order, while retaining the
\r
1190 order of the original lengths stored in the file.
\r
1192 2) Generate the Shannon-Fano trees:
\r
1195 CodeIncrement <- 0
\r
1196 LastBitLength <- 0
\r
1197 i <- number of Shannon-Fano codes - 1 (either 255 or 63)
\r
1200 Code = Code + CodeIncrement
\r
1201 if BitLength(i) <> LastBitLength then
\r
1202 LastBitLength=BitLength(i)
\r
1203 CodeIncrement = 1 shifted left (16 - LastBitLength)
\r
1204 ShannonCode(i) = Code
\r
1208 3) Reverse the order of all the bits in the above ShannonCode()
\r
1209 vector, so that the most significant bit becomes the least
\r
1210 significant bit. For example, the value 0x1234 (hex) would
\r
1211 become 0x2C48 (hex).
\r
1213 4) Restore the order of Shannon-Fano codes as originally stored
\r
1218 This example will show the encoding of a Shannon-Fano tree
\r
1219 of size 8. Notice that the actual Shannon-Fano trees used
\r
1220 for Imploding are either 64 or 256 entries in size.
\r
1222 Example: 0x02, 0x42, 0x01, 0x13
\r
1224 The first byte indicates 3 values in this table. Decoding the
\r
1226 0x42 = 5 codes of 3 bits long
\r
1227 0x01 = 1 code of 2 bits long
\r
1228 0x13 = 2 codes of 4 bits long
\r
1230 This would generate the original bit length array of:
\r
1231 (3, 3, 3, 3, 3, 2, 4, 4)
\r
1233 There are 8 codes in this table for the values 0 thru 7. Using
\r
1234 the algorithm to obtain the Shannon-Fano codes produces:
\r
1236 Reversed Order Original
\r
1237 Val Sorted Constructed Code Value Restored Length
\r
1238 --- ------ ----------------- -------- -------- ------
\r
1239 0: 2 1100000000000000 11 101 3
\r
1240 1: 3 1010000000000000 101 001 3
\r
1241 2: 3 1000000000000000 001 110 3
\r
1242 3: 3 0110000000000000 110 010 3
\r
1243 4: 3 0100000000000000 010 100 3
\r
1244 5: 3 0010000000000000 100 11 2
\r
1245 6: 4 0001000000000000 1000 1000 4
\r
1246 7: 4 0000000000000000 0000 0000 4
\r
1248 The values in the Val, Order Restored and Original Length columns
\r
1249 now represent the Shannon-Fano encoding tree that can be used for
\r
1250 decoding the Shannon-Fano encoded data. How to parse the
\r
1251 variable length Shannon-Fano values from the data stream is beyond
\r
1252 the scope of this document. (See the references listed at the end of
\r
1253 this document for more information.) However, traditional decoding
\r
1254 schemes used for Huffman variable length decoding, such as the
\r
1255 Greenlaw algorithm, can be successfully applied.
\r
1257 The compressed data stream begins immediately after the
\r
1258 compressed Shannon-Fano data. The compressed data stream can be
\r
1259 interpreted as follows:
\r
1262 read 1 bit from input stream.
\r
1264 if this bit is non-zero then (encoded data is literal data)
\r
1265 if Literal Shannon-Fano tree is present
\r
1266 read and decode character using Literal Shannon-Fano tree.
\r
1268 read 8 bits from input stream.
\r
1269 copy character to the output stream.
\r
1270 otherwise (encoded data is sliding dictionary match)
\r
1271 if 8K dictionary size
\r
1272 read 7 bits for offset Distance (lower 7 bits of offset).
\r
1274 read 6 bits for offset Distance (lower 6 bits of offset).
\r
1276 using the Distance Shannon-Fano tree, read and decode the
\r
1277 upper 6 bits of the Distance value.
\r
1279 using the Length Shannon-Fano tree, read and decode
\r
1282 Length <- Length + Minimum Match Length
\r
1284 if Length = 63 + Minimum Match Length
\r
1285 read 8 bits from the input stream,
\r
1286 add this value to Length.
\r
1288 move backwards Distance+1 bytes in the output stream, and
\r
1289 copy Length characters from this position to the output
\r
1290 stream. (if this position is before the start of the output
\r
1291 stream, then assume that all the data before the start of
\r
1292 the output stream is filled with zeros).
\r
1295 Tokenizing - Method 7
\r
1296 --------------------
\r
1298 This method is not used by PKZIP.
\r
1300 Deflating - Method 8
\r
1301 --------------------
\r
1303 The Deflate algorithm is similar to the Implode algorithm using
\r
1304 a sliding dictionary of up to 32K with secondary compression
\r
1305 from Huffman/Shannon-Fano codes.
\r
1307 The compressed data is stored in blocks with a header describing
\r
1308 the block and the Huffman codes used in the data block. The header
\r
1309 format is as follows:
\r
1311 Bit 0: Last Block bit This bit is set to 1 if this is the last
\r
1312 compressed block in the data.
\r
1313 Bits 1-2: Block type
\r
1314 00 (0) - Block is stored - All stored data is byte aligned.
\r
1315 Skip bits until next byte, then next word = block
\r
1316 length, followed by the ones compliment of the block
\r
1317 length word. Remaining data in block is the stored
\r
1320 01 (1) - Use fixed Huffman codes for literal and distance codes.
\r
1321 Lit Code Bits Dist Code Bits
\r
1322 --------- ---- --------- ----
\r
1323 0 - 143 8 0 - 31 5
\r
1328 Literal codes 286-287 and distance codes 30-31 are
\r
1329 never used but participate in the huffman construction.
\r
1331 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes)
\r
1333 11 (3) - Reserved - Flag a "Error in compressed data" if seen.
\r
1335 Expanding Huffman Codes
\r
1336 -----------------------
\r
1337 If the data block is stored with dynamic Huffman codes, the Huffman
\r
1338 codes are sent in the following compressed format:
\r
1340 5 Bits: # of Literal codes sent - 256 (256 - 286)
\r
1341 All other codes are never sent.
\r
1342 5 Bits: # of Dist codes - 1 (1 - 32)
\r
1343 4 Bits: # of Bit Length codes - 3 (3 - 19)
\r
1345 The Huffman codes are sent as bit lengths and the codes are built as
\r
1346 described in the implode algorithm. The bit lengths themselves are
\r
1347 compressed with Huffman codes. There are 19 bit length codes:
\r
1349 0 - 15: Represent bit lengths of 0 - 15
\r
1350 16: Copy the previous bit length 3 - 6 times.
\r
1351 The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6)
\r
1352 Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will
\r
1353 expand to 12 bit lengths of 8 (1 + 6 + 5)
\r
1354 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length)
\r
1355 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length)
\r
1357 The lengths of the bit length codes are sent packed 3 bits per value
\r
1358 (0 - 7) in the following order:
\r
1360 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
\r
1362 The Huffman codes should be built as described in the Implode algorithm
\r
1363 except codes are assigned starting at the shortest bit length, i.e. the
\r
1364 shortest code should be all 0's rather than all 1's. Also, codes with
\r
1365 a bit length of zero do not participate in the tree construction. The
\r
1366 codes are then used to decode the bit lengths for the literal and
\r
1369 The bit lengths for the literal tables are sent first with the number
\r
1370 of entries sent described by the 5 bits sent earlier. There are up
\r
1371 to 286 literal characters; the first 256 represent the respective 8
\r
1372 bit character, code 256 represents the End-Of-Block code, the remaining
\r
1373 29 codes represent copy lengths of 3 thru 258. There are up to 30
\r
1374 distance codes representing distances from 1 thru 32k as described
\r
1379 Extra Extra Extra Extra
\r
1380 Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s)
\r
1381 ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ---------
\r
1382 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162
\r
1383 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194
\r
1384 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226
\r
1385 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257
\r
1386 261 0 7 269 2 19-22 277 4 67-82 285 0 258
\r
1387 262 0 8 270 2 23-26 278 4 83-98
\r
1388 263 0 9 271 2 27-30 279 4 99-114
\r
1389 264 0 10 272 2 31-34 280 4 115-130
\r
1393 Extra Extra Extra Extra
\r
1394 Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
\r
1395 ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
\r
1396 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
\r
1397 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
\r
1398 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
\r
1399 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
\r
1400 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
\r
1401 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
\r
1402 6 2 9-12 14 6 129-192 22 10 2049-3072
\r
1403 7 2 13-16 15 6 193-256 23 10 3073-4096
\r
1405 The compressed data stream begins immediately after the
\r
1406 compressed header data. The compressed data stream can be
\r
1407 interpreted as follows:
\r
1410 read header from input stream.
\r
1413 skip bits until byte aligned
\r
1414 read count and 1's compliment of count
\r
1415 copy count bytes data block
\r
1417 loop until end of block code sent
\r
1418 decode literal character from input stream
\r
1420 copy character to the output stream
\r
1422 if literal = end of block
\r
1425 decode distance from input stream
\r
1427 move backwards distance bytes in the output stream, and
\r
1428 copy length characters from this position to the output
\r
1431 while not last block
\r
1433 if data descriptor exists
\r
1434 skip bits until byte aligned
\r
1435 read crc and sizes
\r
1438 Enhanced Deflating - Method 9
\r
1439 -----------------------------
\r
1441 The Enhanced Deflating algorithm is similar to Deflate but
\r
1442 uses a sliding dictionary of up to 64K. Deflate64(tm) is supported
\r
1443 by the Deflate extractor.
\r
1446 -----------------------------
\r
1448 BZIP2 is an open-source data compression algorithm developed by
\r
1449 Julian Seward. Information and source code for this algorithm
\r
1450 can be found on the internet.
\r
1452 Traditional PKWARE Encryption
\r
1453 -----------------------------
\r
1455 The following information discusses the decryption steps
\r
1456 required to support traditional PKWARE encryption. This
\r
1457 form of encryption is considered weak by todays standards
\r
1458 and its use is recommended only for situations with
\r
1459 low security needs or for compatiblity with older .ZIP
\r
1465 The encryption used in PKZIP was generously supplied by Roger
\r
1466 Schlafly. PKWARE is grateful to Mr. Schlafly for his expert
\r
1467 help and advice in the field of data encryption.
\r
1469 PKZIP encrypts the compressed data stream. Encrypted files must
\r
1470 be decrypted before they can be extracted.
\r
1472 Each encrypted file has an extra 12 bytes stored at the start of
\r
1473 the data area defining the encryption header for that file. The
\r
1474 encryption header is originally set to random values, and then
\r
1475 itself encrypted, using three, 32-bit keys. The key values are
\r
1476 initialized using the supplied encryption password. After each byte
\r
1477 is encrypted, the keys are then updated using pseudo-random number
\r
1478 generation techniques in combination with the same CRC-32 algorithm
\r
1479 used in PKZIP and described elsewhere in this document.
\r
1481 The following is the basic steps required to decrypt a file:
\r
1483 1) Initialize the three 32-bit keys with the password.
\r
1484 2) Read and decrypt the 12-byte encryption header, further
\r
1485 initializing the encryption keys.
\r
1486 3) Read and decrypt the compressed data stream using the
\r
1489 Step 1 - Initializing the encryption keys
\r
1490 -----------------------------------------
\r
1492 Key(0) <- 305419896
\r
1493 Key(1) <- 591751049
\r
1494 Key(2) <- 878082192
\r
1496 loop for i <- 0 to length(password)-1
\r
1497 update_keys(password(i))
\r
1500 Where update_keys() is defined as:
\r
1502 update_keys(char):
\r
1503 Key(0) <- crc32(key(0),char)
\r
1504 Key(1) <- Key(1) + (Key(0) & 000000ffH)
\r
1505 Key(1) <- Key(1) * 134775813 + 1
\r
1506 Key(2) <- crc32(key(2),key(1) >> 24)
\r
1509 Where crc32(old_crc,char) is a routine that given a CRC value and a
\r
1510 character, returns an updated CRC value after applying the CRC-32
\r
1511 algorithm described elsewhere in this document.
\r
1513 Step 2 - Decrypting the encryption header
\r
1514 -----------------------------------------
\r
1516 The purpose of this step is to further initialize the encryption
\r
1517 keys, based on random data, to render a plaintext attack on the
\r
1520 Read the 12-byte encryption header into Buffer, in locations
\r
1521 Buffer(0) thru Buffer(11).
\r
1523 loop for i <- 0 to 11
\r
1524 C <- buffer(i) ^ decrypt_byte()
\r
1529 Where decrypt_byte() is defined as:
\r
1531 unsigned char decrypt_byte()
\r
1532 local unsigned short temp
\r
1533 temp <- Key(2) | 2
\r
1534 decrypt_byte <- (temp * (temp ^ 1)) >> 8
\r
1537 After the header is decrypted, the last 1 or 2 bytes in Buffer
\r
1538 should be the high-order word/byte of the CRC for the file being
\r
1539 decrypted, stored in Intel low-byte/high-byte order. Versions of
\r
1540 PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is
\r
1541 used on versions after 2.0. This can be used to test if the password
\r
1542 supplied is correct or not.
\r
1544 Step 3 - Decrypting the compressed data stream
\r
1545 ----------------------------------------------
\r
1547 The compressed data stream can be decrypted as follows:
\r
1550 read a character into C
\r
1551 Temp <- C ^ decrypt_byte()
\r
1557 Strong Encryption (EFS)
\r
1558 -----------------------
\r
1560 Version 5.x of this specification includes support for strong
\r
1561 encryption algorithms. These algorithms can be used with either
\r
1562 a password or an X.509v3 digital certificate to encrypt each file.
\r
1563 This format specification supports either password or certificate
\r
1564 based encryption to meet the security needs of today, to enable
\r
1565 interoperability between users within both PKI and non-PKI
\r
1566 environments, and to ensure interoperability between different
\r
1567 computing platforms that are running a ZIP program.
\r
1569 Password based encryption is the most common form of encryption
\r
1570 people are familiar with. However, inherent weaknesses with
\r
1571 passwords (e.g. susceptibility to dictionary/brute force attack)
\r
1572 as well as password management and support issues make certificate
\r
1573 based encryption a more secure and scalable option. Industry
\r
1574 efforts and support are defining and moving towards more advanced
\r
1575 security solutions built around X.509v3 digital certificates and
\r
1576 Public Key Infrastructures(PKI) because of the greater scalability,
\r
1577 administrative options, and more robust security over traditional
\r
1578 password-based encryption.
\r
1580 Most standard encryption algorithms are supported with this
\r
1581 specification. Reference implementations for many of these
\r
1582 algorithms are available from either commercial or open source
\r
1583 distributors. Readily available cryptographic toolkits make
\r
1584 implementation of the encryption features straight-forward.
\r
1586 The algorithms introduced in Version 5.0 of this specificaion
\r
1589 RC2 40 bit, 64 bit, and 128 bit
\r
1590 RC4 40 bit, 64 bit, and 128 bit
\r
1592 3DES 112 bit and 168 bit
\r
1594 Version 5.1 adds support for the following:
\r
1596 AES 128 bit, 192 bit, and 256 bit
\r
1598 The details of the strong encryption specification for
\r
1599 certificates remain under development as design and testing
\r
1600 issues are worked out for the range of algorithms, encryption
\r
1601 methods, certificate processing and cross-platform support
\r
1602 necessary to meet the advanced security needs of .ZIP file
\r
1603 users today and in the future.
\r
1605 This feature specification is intended to support basic
\r
1606 encryption needs of today, such as password support. However
\r
1607 this specification is also designed to lay the foundation for
\r
1608 future advanced security needs.
\r
1610 Password-based encryption using strong encryption algorithms
\r
1611 operates similarly to the traditional PKWARE encryption defined
\r
1612 in this format. Additional data structures are added to
\r
1613 support the processing needs of the strong algorithms.
\r
1615 The Strong Encryption data structures are:
\r
1617 1. Bits 0 and 6 of the General Purpose bit flag in both local
\r
1618 and central header records. Both bits set indicates strong
\r
1622 2. Extra Field 0x0017 in central header only.
\r
1624 Fields to consider in this record are:
\r
1626 Format - the data format identifier for this record. The only
\r
1627 value allowed at this time is the integer value 2.
\r
1629 AlgId - integer identifier of the encryption algorithm from the
\r
1633 0x6602 - RC2 (version needed to extract < 5.2)
\r
1639 0x6702 - RC2 (version needed to extract >= 5.2)
\r
1641 0xFFFF - Unknown algorithm
\r
1643 Bitlen - Explicit bit length of key
\r
1652 Flags - Processing flags needed for decryption
\r
1654 0x0001 - Password is required to decrypt
\r
1655 0x0002 - reserved for certificates only
\r
1656 0x0003 - Password or certificate required to decrypt
\r
1658 Values > 0x0003 reserved for certificate processing
\r
1661 3. Decryption header record preceeding compressed file data.
\r
1663 -Decryption Header:
\r
1665 Value Size Description
\r
1666 ----- ---- -----------
\r
1667 IVSize 2 bytes Size of initialization vector (IV)
\r
1668 IVData IVSize Initialization vector for this file
\r
1669 Size 4 bytes Size of remaining decryption header data
\r
1670 Format 2 bytes Format definition for this record
\r
1671 AlgID 2 bytes Encryption algorithm identifier
\r
1672 Bitlen 2 bytes Bit length of encryption key
\r
1673 Flags 2 bytes Processing flags
\r
1674 ErdSize 2 bytes Size of Encrypted Random Data
\r
1675 ErdData ErdSize Encrypted Random Data
\r
1676 Reserved1 4 bytes Reserved certificate data
\r
1677 Reserved2 (var) Reserved for certificate data
\r
1678 VSize 2 bytes Size of password validation data
\r
1679 VData VSize-4 Password validation data
\r
1680 VCRC32 4 bytes CRC32 of password validation data
\r
1682 IVData - The size of the IV should match the algorithm block size.
\r
1683 The IVData can be completely random data. If the size of
\r
1684 the randomly generated data does not match the block size
\r
1685 it should be complemented with zero's. If IVSize is 0,
\r
1686 then IV = CRC32 + 64-bit File Size.
\r
1688 Format - the data format identifier for this record. The only
\r
1689 value allowed at this time is the integer value 3.
\r
1691 AlgId - integer identifier of the encryption algorithm from the
\r
1695 0x6602 - RC2 (version needed to extract < 5.2)
\r
1701 0x6702 - RC2 (version needed to extract >= 5.2)
\r
1703 0xFFFF - Unknown algorithm
\r
1705 Bitlen - Explicit bit length of key
\r
1714 Flags - Processing flags needed for decryption
\r
1716 0x0001 - Password is required to decrypt
\r
1717 0x0002 - reserved for certificates only
\r
1718 0x0003 - Password or certificate required to decrypt
\r
1720 Values > 0x0003 reserved for certificate processing
\r
1722 ErdData - Encrypted random data is used to generate a file
\r
1723 session key for encrypting each file. SHA1 is
\r
1724 used to calculate hash data used to derive keys.
\r
1725 File session keys are deived from a master session
\r
1726 key generated from the user-supplied password.
\r
1728 Reserved1 - Reserved for certificate processing, if value is
\r
1729 zero, then Reserved2 data is absent.
\r
1731 VSize - This size value will always include the 4 bytes of the
\r
1732 VCRC32 data and will be greater than 4 bytes.
\r
1734 VData - Random data for password validation. This data is VSize
\r
1735 in length and VSize must be a multiple of the encryption
\r
1736 block size. VCRC32 is a checksum value of VData. VSize,
\r
1737 VData, and VCRC32 are stored encrypted and start the
\r
1738 stream of encrypted data for a file.
\r
1740 Strong Encryption is always applied to a file after compression. The
\r
1741 block oriented algorithms all operate in Cypher Block Chaining (CBC)
\r
1742 mode. The block size used for AES encryption is 16. All other block
\r
1743 algorithms use a block size of 8. Two ID's are defined for RC2 to
\r
1744 account for a discrepancy found in the implementation of the RC2
\r
1745 algorithm in the cryptographic library on Windows XP SP1 and all
\r
1746 earlier versions of Windows.
\r
1748 A pseudo-code representation of the encryption process is as follows:
\r
1750 Password = GetUserPassword()
\r
1752 ERD = Encrypt(RD,DeriveKey(SHA1(Password)))
\r
1756 FileSessionKey = DeriveKey(SHA1(RD, IV))
\r
1757 Encrypt(VData + FileData,FileSessionKey)
\r
1760 The function names and parameter requirements will depend on
\r
1761 the choice of the cryptographic toolkit selected. Almost any
\r
1762 toolkit supporting the reference implementations for each
\r
1763 algorithm can be used. The RSA BSAFE(r), OpenSSL, and Microsoft's
\r
1764 CryptoAPI libraries are all known to work well.
\r
1766 The features set forth in the Strong Encryption (EFS) specification are
\r
1767 covered by a pending patent application.
\r
1773 In order for the .ZIP file format to remain a viable definition, this
\r
1774 specification should be considered as open for periodic review and
\r
1775 revision. Although this format was originally designed with a
\r
1776 certain level of extensibility, not all changes in technology
\r
1777 (present or future) were or will be necessarily considered in its
\r
1778 design. If your application requires new definitions to the
\r
1779 extensible sections in this format, or if you would like to
\r
1780 submit new data structures, please forward your request to
\r
1781 zipformat@pkware.com. All submissions will be reviewed by the
\r
1782 ZIP File Specification Committee for possible inclusion into
\r
1783 future versions of this specification. Periodic revisions
\r
1784 to this specification will be published to ensure interoperability.
\r
1789 In addition to the above mentioned contributors to PKZIP and PKUNZIP,
\r
1790 I would like to extend special thanks to Robert Mahoney for suggesting
\r
1791 the extension .ZIP for this software.
\r
1795 Fiala, Edward R., and Greene, Daniel H., "Data compression with
\r
1796 finite windows", Communications of the ACM, Volume 32, Number 4,
\r
1797 April 1989, pages 490-505.
\r
1799 Held, Gilbert, "Data Compression, Techniques and Applications,
\r
1800 Hardware and Software Considerations", John Wiley & Sons, 1987.
\r
1802 Huffman, D.A., "A method for the construction of minimum-redundancy
\r
1803 codes", Proceedings of the IRE, Volume 40, Number 9, September 1952,
\r
1806 Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14,
\r
1807 Number 10, October 1989, pages 29-37.
\r
1809 Nelson, Mark, "The Data Compression Book", M&T Books, 1991.
\r
1811 Storer, James A., "Data Compression, Methods and Theory",
\r
1812 Computer Science Press, 1988
\r
1814 Welch, Terry, "A Technique for High-Performance Data Compression",
\r
1815 IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19.
\r
1817 Ziv, J. and Lempel, A., "A universal algorithm for sequential data
\r
1818 compression", Communications of the ACM, Volume 30, Number 6,
\r
1819 June 1987, pages 520-540.
\r
1821 Ziv, J. and Lempel, A., "Compression of individual sequences via
\r
1822 variable-rate coding", IEEE Transactions on Information Theory,
\r
1823 Volume 24, Number 5, September 1978, pages 530-536.
\r