source/libs/zziplib/zziplib-0.13.62/docs/zzip-parse.htm

   1 <section> <date> 17. December 2002 </date>
   2 <h2> ZIP Format </h2>        About Zip Parsing Internals...
   3
   4 <!--border-->
   5
   6 <section>
   7 <h3> ZIP Trailer Block </h3>
   8
   9 <P>
  10   The general ZIP file format is written sequentially - each file
  11   being added gets a local file header and its inflated data. When
  12   all files are written then a central directory is written - and
  13   this central directory may even span multiple disks. And each
  14   disk gets a descriptor block that contains a pointer to the start
  15   of the central directory. This descriptor is always written last
  16   and therefore we call it the "ZIP File Trailer Block".
  17 </P>
  18 <P>
  19   Okay, so we know that this ZIP Trailer is always at the end of a zip
  20   file and that is has a fixed length, and a magic four-byte value at
  21   its block start. That should make it easy to detect zip files but in
  22   the real world it is not that easy - it is allowed to add a zip
  23   archive comment text <em>after</em> the Trailer block. It's rarely
  24   used these days but it turns out that a zip reader must be ready
  25   to search for the Trailer block starting at the end of the file
  26   and looking upwards for the Trailer magic (it's "PK\5\6" btw).
  27 </P>
  28 <P>
  29   Now that's what the internal function __zip_find_disk_trailer is
  30   used for. It's somewhat optimized as we try to use mmap features
  31   of the underlying operating system. The returned structure is
  32   called zzip_disk_trailer in the library source code, and we only
  33   need two values actually: u_rootseek and u_rootsize. The first of
  34   these can be used to lseek to the place of the central directory
  35   and the second value tells us the byte size of the central directory.
  36 </P>
  37
  38 </section><section>
  39 <h3> ZIP Central Directory </h3>
  40
  41 <P>
  42   So here we are at the central directory. The disk trailer did also
  43   tell us how many entries are there but it is not that easy to read
  44   them. Each directory entry (zzip_root_dirent type) has again a
  45   magic value up front followed by a few items but they all have some
  46   dos format - consider the timestamps, and atleast size/seek values
  47   are in intel byteorder. So we might want to parse them into a format
  48   that is easier to handle in internal code.
  49 </P>
  50 <P>
  51   That is also needed for another reason - there are three items in that
  52   directory entry being size values of three variadic fields following
  53   right after the directory. That's right, three of these. The first
  54   variadic field is the filename of this directory entry. In other
  55   words, the root directory entry does not contain a seek value of
  56   where the filename starts off, the start of the filename is
  57   implicitly given with the end address of the directory entry.
  58 </P>
  59 <P>
  60   The size value for the filename does simply say how long the
  61   filename is - however, and more importantly, it allows us to
  62   compute the start of the next variadic field, called the extra
  63   info field. Well, we do not need any value from that extra info
  64   block (it has unix filemode bits when packed under unix) but we
  65   can be quite sure that this field is not null either. And that
  66   was the second variadic field.
  67 </P>
  68 <P>
  69   There is a third variadic field however - it's the comment field.
  70   That was pretty heavily used in the good old DOS days. We are not
  71   used to it anymore since filenames are generally self-descriptive
  72   today but in the DOS days a filename was 8+3 chars maximum - and
  73   it was in the comment field that told users what's in there. It
  74   turned out that many software archives used zip format for just
  75   that purpose as their primary distribution format - for being
  76   able to attach a comment line with each entry.
  77 </P>
  78 <P>
  79   Now, these three variadic fields have each an entry in the
  80   directory entry header telling of their size. And after these
  81   three variadic fields the next directory entry follows right in.
  82   Yes, again there is no seek value here - we have to take the sum
  83   of the three field sizes and add that to the end address of the
  84   directory entry - just to be able to get to the next entry.
  85 </P>
  86
  87 </section><section>
  88 <h3> Internal Directory </h3>
  89
  90 <P>
  91   Now, the external ZIP format is too complicated. We cut it down
  92   to the bare minimum we actually need. The fields in the entry
  93   are parsed into a format directly usable, and from the variadic
  94   fields we only keep the filename. Oh, and we ensure that the
  95   filename gets a trailing null byte, so it can surely be passed
  96   down into libc routines.
  97 </P>
  98 <P>
  99   There is another trick by the way - we use the u_rootsize value
 100   to malloc a block for the internal directory. That ensures the
 101   internal root directory entries are in nearby locations, and
 102   including the filenames themselves which we put in between the
 103   dirent entries. That's not only similar to the external directory
 104   format, but when calling readdir and looking for a matching
 105   filename of an zzip_open call, this will ensure the memory is
 106   fetched in a linear fashion. Modern cpu architectures are able
 107   to burst through it.
 108 </P>
 109 <P>
 110   One might think to use a more complicated internal directory
 111   format - like hash tables or something. However, they all suffer
 112   from the fact that memory access patterns will be somewhat random
 113   which eats a lot of speed. It is hardly predictable under what
 114   circumstances it gets us a benefit, but the problem is certainly
 115   not off-world: there are zzip archives with 13k+ entries. In a real
 116   filesystem people will not put 13k files into one directory, of
 117   course - but for the zip central directory all entries are listed
 118   in parallel with their subdirectory paths attached. So, if the
 119   original subtree had a number of directories, they'll end up in
 120   parallel in the zip's central directory.
 121 </P>
 122
 123 </section><section>
 124 <h3> File Entry </h3>
 125
 126 <P>
 127   The zip directory entry has one value that is called z_off in the
 128   zziplib sources - it's the seek value to the start of the actual
 129   file data, or more correctly it points to the "local file header".
 130   Each file data block is preceded/followed with a little frame.
 131   There is not much interesting information in these framing blocks,
 132   the values are duplicates of the ones found in the zip central
 133   directory - however, we must skip the local file header (and a
 134   possible duplicate of filename and extrainfo) to arrive at the
 135   actual file data.
 136 </P>
 137 <P>
 138   When the start of the actual file data, we can finally read data.
 139   The zziplib library does only know about two choices defined by
 140   the value in the z_compr field - a value of "0" means "stored"
 141   and data has been stored in uncompresed format, so that we can
 142   just copy it out of the file to the application buffer.
 143 </P>
 144 <P>
 145   A value of "8" means "deflated", and here we initialize the zlib
 146   and every file data is decompressed before copying it to the
 147   application buffer. Care must be taken here since zlib input data
 148   and decompressed data may differ significantly. The zlib compression
 149   will not even obey byte boundaries - a single bit may expand to
 150   hundreds of bytes. That's why each ZZIP_FILE has a decompression
 151   buffer attached.
 152 </P>
 153 <P>
 154   All the other z_compr values are only of historical meaning,
 155   the infozip unix tools will only create deflated content, and
 156   the same applies to pkzip 2.x tools. If there would be any other
 157   value than "0" or "8" then zziplib can not decompress it, simple
 158   as that.
 159 </P>
 160
 161 </section><section>
 162 <h3> ZZIP_DIR / ZZIP_FILE </h3>
 163
 164 <P>
 165   The ZZIP_DIR internal structures stores a posix handle to the
 166   zip file, and a pointer to the parsed central directory block.
 167   One can use readdir/rewinddir to walk each entry in the central
 168   directory and compare with the filenames attached. And that's
 169   what will be done at a zzip_open call to find the file entry.
 170 </P>
 171 <P>
 172   There are a few more fields in the ZZIP_DIR structure, where
 173   most of these are related to the use of this struct as a
 174   shared recource. You can use zzip_file_open to walk the
 175   preparsed central directory and return a new ZZIP_FILE handle
 176   for that entry.
 177 </P>
 178 <P>
 179   That ZZIP_FILE handle contains a back pointer its ZZIP_DIR
 180   that it was made from - and the back pointer also serves as flag
 181   that the ZZIP_FILE handle points to a file within a ZIP file as
 182   opposed to wrapping a real file in the real directory tree.
 183   Each ZZIP_FILE will increment a shared counter, so that the
 184   next dir_close will be deferred until all ZZIP_FILE have been
 185   destroyed.
 186 </P>
 187 <P>
 188   Another optmization is the cache-pointer in the ZZIP_DIR. It is
 189   quite common to read data entries sequentially, as that the
 190   zip directory is scanned for files matching a specific pattern,
 191   and when a match is seen, that file is openened. However, each
 192   ZZIP_FILE needs a decompression buffer, and we keep a cache of
 193   the last one freed so that it can be picked up right away for the
 194   next zzip_file_open.
 195 </P>
 196 <P>
 197   Note that using multiple zzip_open() directly, each will open
 198   and parse a zip directory of its own. That's bloat both in
 199   terms of memory consumption and execution speed. One should try
 200   to take advantage of the feature that multiple ZZIP_FILE's can
 201   share a common ZZIP_DIR with a common preparsed copy of the
 202   zip's central directory. That can be done directly with using
 203   zzip_file_open to use a ZZIP_DIR as a factory for ZZIP_FILE,
 204   but also zzip_freopen can be used to reuse the old internal
 205   central directory, instead of parsing it again.
 206 </P>
 207 <P>
 208   And while zzip_freopen would release the old ZZIP_FILE handle
 209   only resuing the ZZIP_DIR attached, one can use another routine
 210   directly called zzip_open_shared that will create a ZZIP_FILE
 211   from an existing ZZIP_FILE. Oh, and not need to worry about
 212   problems when a filepath given to zzip_freopen() happens to
 213   be in another place, another directory, another zip archive.
 214   In that case, the old zzip's internal directory is freed and
 215   the others directory read - the preparsed central directory
 216   is only used if that is actually possible.
 217 </P>
 218
 219 </section></section>