1 From: Junio C Hamano <junkio@cox.net>
2 Subject: Re: Make "git clone" less of a deathly quiet experience
3 Date: Sun, 12 Feb 2006 19:36:41 -0800
4 Message-ID: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net>
5 References: <Pine.LNX.4.64.0602102018250.3691@g5.osdl.org>
6 <7vwtg2o37c.fsf@assigned-by-dhcp.cox.net>
7 <Pine.LNX.4.64.0602110943170.3691@g5.osdl.org>
8 <1139685031.4183.31.camel@evo.keithp.com> <43EEAEF3.7040202@op5.se>
9 <1139717510.4183.34.camel@evo.keithp.com>
10 <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
11 Content-Type: text/plain; charset=us-ascii
12 Cc: Keith Packard <keithp@keithp.com>, Andreas Ericsson <ae@op5.se>,
13 Linus Torvalds <torvalds@osdl.org>,
14 Git Mailing List <git@vger.kernel.org>,
15 Petr Baudis <pasky@suse.cz>
16 Return-path: <git-owner@vger.kernel.org>
17 In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
18 (Martin Langhoff's message of "Mon, 13 Feb 2006 15:06:42 +1300")
20 Martin Langhoff <martin.langhoff@gmail.com> writes:
22 > +1... there should be an easy-to-compute threshold trigger to say --
23 > hey, let's quit being smart and send this client the packs we got and
24 > get it over with. Or perhaps a client flag so large projects can
25 > recommend that uses do their initial clone with --gimme-all-packs?
27 What upload-pack does boils down to:
29 * find out the latest of what client has and what client asked.
31 * run "rev-list --objects ^client ours" to make a list of
32 objects client needs. The actual command line has multiple
33 "clients" to exclude what is unneeded to be sent, and
34 multiple "ours" to include refs asked. When you are doing
35 a full clone, ^client is empty and ours is essentially
38 * feed that output to "pack-objects --stdout" and send out
41 If you run this command:
43 $ git-rev-list --objects --all |
44 git-pack-objects --stdout >/dev/null
46 It would say some things. The phases of operations are:
49 Counting objects XXXX...
50 Done counting XXXX objects.
51 Packing XXXXX objects.....
53 Phase (1). Between the time it says "Generating pack..." upto
54 "Done counting XXXX objects.", the time is spent by rev-list to
55 list up all the objects to be sent out.
57 Phase (2). After that, it tries to make decision what object to
58 delta against what other object, while twenty or so dots are
59 printed after "Packing XXXXX objects." (see #git irc log a
60 couple of days ago; Linus describes how pack building works).
62 Phase (3). After the dot stops, the program becomes silent.
63 That is where it actually does delta compression and writeout.
65 You would notice that quite a lot of time is spent in all
68 There is an internal hook to create full repository pack inside
69 upload-pack (which is what runs on the other end when you run
70 fetch-pack or clone-pack), but it works slightly differently
71 from what you are suggesting, in that it still tries to do the
72 "correct" thing. It still runs "rev-list --objects --all", so
73 "dangling objects" are never sent out.
75 We could cheat in all phases to speed things up, at the expense
76 of ending up sending excess objects. So let's pretend we
77 decided to treat everything in .git/objects/packs/pack-* (and
78 the ones found in alternates as well) have interesting objects
81 (1) This part unfortunately cannot be totally eliminated. By
82 assume all packs are interesting, we could use the object
83 names from the pack index, which is a lot cheaper than
84 rev-list object traversal. We still need to run rev-list
85 --objects --all --unpacked to pick up loose objects we would
86 not be able to tell by looking at the pack index to cover
89 This however needs to be done in conjunction with the second
90 phase change. pack-objects depends on the hint rev-list
91 --objects output gives it to group the blobs and trees with
92 the same pathnames together, and that greatly affects the
93 packing efficiency. Unfortunately pack index does not have
94 that information -- it does not know type, nor pathnames.
95 Type is relatively cheap to obtain but pathnames for blob
96 objects are inherently unavailable.
98 (2) This part can be mostly eliminated for already packed
99 objects, because we have already decided to cheat by sending
100 everything, so we can just reuse how objects are deltified
101 in existing packs. It still needs to be done for loose
102 objects we collected to fill the gap in (1).
104 (3) This also can be sped up by reusing what are already in
105 packs. Pack index records starting (but not end) offset of
106 each object in the pack, so we can sort by offset to find
107 out which part of the existing pack corresponds to what
108 object, to reorder the objects in the final pack. This
109 needs to be done somewhat carefully to preserve the locality
110 of objects (again, see #git log). The deltifying and
111 compressing for loose objects cannot be avoided.
113 While we are writing things out in (3), we need to keep
114 track of running SHA1 sum of what we write out so that we
115 can fill out the correct checksum at the end, but I am
116 guessing that is relatively cheap compared to the
117 deltification and compression cost we are currently paying
120 NB. In the #git log, Linus made it sound like I am clueless
121 about how pack is generated, but if you check commit 9d5ab96,
122 the "recency of delta is inherited from base", one of the tricks
123 that have a big performance impact, was done by me ;-).