Public Git Hosting - git.git/commit

cat-file: support "unordered" output for --batch-all-objects

If you're going to access the contents of every object in a
packfile, it's generally much more efficient to do so in
pack order, rather than in hash order. That increases the
locality of access within the packfile, which in turn is
friendlier to the delta base cache, since the packfile puts
related deltas next to each other. By contrast, hash order
is effectively random, since the sha1 has no discernible
relationship to the content.

This patch introduces an "--unordered" option to cat-file
which iterates over packs in pack-order under the hood. You
can see the results when dumping all of the file content:

  $ time ./git cat-file --batch-all-objects --buffer --batch | wc -c
  6883195596

  real 0m44.491s
  user 0m42.902s
  sys 0m5.230s

  $ time ./git cat-file --unordered \
                        --batch-all-objects --buffer --batch | wc -c
  6883195596

  real 0m6.075s
  user 0m4.774s
  sys 0m3.548s

Same output, different order, way faster. The same speed-up
applies even if you end up accessing the object content in a
different process, like:

  git cat-file --batch-all-objects --buffer --batch-check |
  grep blob |
  git cat-file --batch='%(objectname) %(rest)' |
  wc -c

Adding "--unordered" to the first command drops the runtime
in git.git from 24s to 3.5s.

  Side note: there are actually further speedups available
  for doing it all in-process now. Since we are outputting
  the object content during the actual pack iteration, we
  know where to find the object and could skip the extra
  lookup done by oid_object_info(). This patch stops short
  of that optimization since the underlying API isn't ready
  for us to make those sorts of direct requests.

So if --unordered is so much better, why not make it the
default? Two reasons:

  1. We've promised in the documentation that --batch-all-objects
     outputs in hash order. Since cat-file is plumbing,
     people may be relying on that default, and we can't
     change it.

  2. It's actually _slower_ for some cases. We have to
     compute the pack revindex to walk in pack order. And
     our de-duplication step uses an oidset, rather than a
     sort-and-dedup, which can end up being more expensive.
     If we're just accessing the type and size of each
     object, for example, like:

       git cat-file --batch-all-objects --buffer --batch-check

     my best-of-five warm cache timings go from 900ms to
     1100ms using --unordered. Though it's possible in a
     cold-cache or under memory pressure that we could do
     better, since we'd have better locality within the
     packfile.

And one final question: why is it "--unordered" and not
"--pack-order"? The answer is again two-fold:

  1. "pack order" isn't a well-defined thing across the
     whole set of objects. We're hitting loose objects, as
     well as objects in multiple packs, and the only
     ordering we're promising is _within_ a single pack. The
     rest is apparently random.

  2. The point here is optimization. So we don't want to
     promise any particular ordering, but only to say that
     we will choose an ordering which is likely to be
     efficient for accessing the object content. That leaves
     the door open for further changes in the future without
     having to add another compatibility option.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

commit	0750bb5b51f021ecad6f33b7ec88cdfc2a8cdff4
author	Jeff King <peff@peff.net>
	Fri, 10 Aug 2018 23:24:57 +0000 (10 19:24 -0400)
committer	Junio C Hamano <gitster@pobox.com>
	Mon, 13 Aug 2018 20:48:31 +0000 (13 13:48 -0700)
tree	526409555378591f44aeeaf319189c0d49d46d93	tree \| snapshot (tar.gz zip)
parent	b1adb38458679af98a057bcdc988a7f6ce1247d6	commit \| diff

Documentation/git-cat-file.txt		diff \| blob \| blame \| history
builtin/cat-file.c		diff \| blob \| blame \| history
t/t1006-cat-file.sh		diff \| blob \| blame \| history