Documentation/technical/partial-clone.txt

   1 Partial Clone Design Notes
   2 ==========================
   3
   4 The "Partial Clone" feature is a performance optimization for Git that
   5 allows Git to function without having a complete copy of the repository.
   6 The goal of this work is to allow Git better handle extremely large
   7 repositories.
   8
   9 During clone and fetch operations, Git downloads the complete contents
  10 and history of the repository.  This includes all commits, trees, and
  11 blobs for the complete life of the repository.  For extremely large
  12 repositories, clones can take hours (or days) and consume 100+GiB of disk
  13 space.
  14
  15 Often in these repositories there are many blobs and trees that the user
  16 does not need such as:
  17
  18   1. files outside of the user's work area in the tree.  For example, in
  19      a repository with 500K directories and 3.5M files in every commit,
  20      we can avoid downloading many objects if the user only needs a
  21      narrow "cone" of the source tree.
  22
  23   2. large binary assets.  For example, in a repository where large build
  24      artifacts are checked into the tree, we can avoid downloading all
  25      previous versions of these non-mergeable binary assets and only
  26      download versions that are actually referenced.
  27
  28 Partial clone allows us to avoid downloading such unneeded objects *in
  29 advance* during clone and fetch operations and thereby reduce download
  30 times and disk usage.  Missing objects can later be "demand fetched"
  31 if/when needed.
  32
  33 Use of partial clone requires that the user be online and the origin
  34 remote be available for on-demand fetching of missing objects.  This may
  35 or may not be problematic for the user.  For example, if the user can
  36 stay within the pre-selected subset of the source tree, they may not
  37 encounter any missing objects.  Alternatively, the user could try to
  38 pre-fetch various objects if they know that they are going offline.
  39
  40
  41 Non-Goals
  42 ---------
  43
  44 Partial clone is a mechanism to limit the number of blobs and trees downloaded
  45 *within* a given range of commits -- and is therefore independent of and not
  46 intended to conflict with existing DAG-level mechanisms to limit the set of
  47 requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
  48
  49
  50 Design Overview
  51 ---------------
  52
  53 Partial clone logically consists of the following parts:
  54
  55 - A mechanism for the client to describe unneeded or unwanted objects to
  56   the server.
  57
  58 - A mechanism for the server to omit such unwanted objects from packfiles
  59   sent to the client.
  60
  61 - A mechanism for the client to gracefully handle missing objects (that
  62   were previously omitted by the server).
  63
  64 - A mechanism for the client to backfill missing objects as needed.
  65
  66
  67 Design Details
  68 --------------
  69
  70 - A new pack-protocol capability "filter" is added to the fetch-pack and
  71   upload-pack negotiation.
  72
  73   This uses the existing capability discovery mechanism.
  74   See "filter" in Documentation/technical/pack-protocol.txt.
  75
  76 - Clients pass a "filter-spec" to clone and fetch which is passed to the
  77   server to request filtering during packfile construction.
  78
  79   There are various filters available to accommodate different situations.
  80   See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
  81
  82 - On the server pack-objects applies the requested filter-spec as it
  83   creates "filtered" packfiles for the client.
  84
  85   These filtered packfiles are *incomplete* in the traditional sense because
  86   they may contain objects that reference objects not contained in the
  87   packfile and that the client doesn't already have.  For example, the
  88   filtered packfile may contain trees or tags that reference missing blobs
  89   or commits that reference missing trees.
  90
  91 - On the client these incomplete packfiles are marked as "promisor packfiles"
  92   and treated differently by various commands.
  93
  94 - On the client a repository extension is added to the local config to
  95   prevent older versions of git from failing mid-operation because of
  96   missing objects that they cannot handle.
  97   See "extensions.partialClone" in Documentation/technical/repository-version.txt"
  98
  99
 100 Handling Missing Objects
 101 ------------------------
 102
 103 - An object may be missing due to a partial clone or fetch, or missing due
 104   to repository corruption.  To differentiate these cases, the local
 105   repository specially indicates such filtered packfiles obtained from the
 106   promisor remote as "promisor packfiles".
 107
 108   These promisor packfiles consist of a "<name>.promisor" file with
 109   arbitrary contents (like the "<name>.keep" files), in addition to
 110   their "<name>.pack" and "<name>.idx" files.
 111
 112 - The local repository considers a "promisor object" to be an object that
 113   it knows (to the best of its ability) that the promisor remote has promised
 114   that it has, either because the local repository has that object in one of
 115   its promisor packfiles, or because another promisor object refers to it.
 116
 117   When Git encounters a missing object, Git can see if it a promisor object
 118   and handle it appropriately.  If not, Git can report a corruption.
 119
 120   This means that there is no need for the client to explicitly maintain an
 121   expensive-to-modify list of missing objects.[a]
 122
 123 - Since almost all Git code currently expects any referenced object to be
 124   present locally and because we do not want to force every command to do
 125   a dry-run first, a fallback mechanism is added to allow Git to attempt
 126   to dynamically fetch missing objects from the promisor remote.
 127
 128   When the normal object lookup fails to find an object, Git invokes
 129   fetch-object to try to get the object from the server and then retry
 130   the object lookup.  This allows objects to be "faulted in" without
 131   complicated prediction algorithms.
 132
 133   For efficiency reasons, no check as to whether the missing object is
 134   actually a promisor object is performed.
 135
 136   Dynamic object fetching tends to be slow as objects are fetched one at
 137   a time.
 138
 139 - `checkout` (and any other command using `unpack-trees`) has been taught
 140   to bulk pre-fetch all required missing blobs in a single batch.
 141
 142 - `rev-list` has been taught to print missing objects.
 143
 144   This can be used by other commands to bulk prefetch objects.
 145   For example, a "git log -p A..B" may internally want to first do
 146   something like "git rev-list --objects --quiet --missing=print A..B"
 147   and prefetch those objects in bulk.
 148
 149 - `fsck` has been updated to be fully aware of promisor objects.
 150
 151 - `repack` in GC has been updated to not touch promisor packfiles at all,
 152   and to only repack other objects.
 153
 154 - The global variable "fetch_if_missing" is used to control whether an
 155   object lookup will attempt to dynamically fetch a missing object or
 156   report an error.
 157
 158   We are not happy with this global variable and would like to remove it,
 159   but that requires significant refactoring of the object code to pass an
 160   additional flag.  We hope that concurrent efforts to add an ODB API can
 161   encompass this.
 162
 163
 164 Fetching Missing Objects
 165 ------------------------
 166
 167 - Fetching of objects is done using the existing transport mechanism using
 168   transport_fetch_refs(), setting a new transport option
 169   TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
 170   desired, not any object that they refer to.
 171
 172   Because some transports invoke fetch_pack() in the same process, fetch_pack()
 173   has been updated to not use any object flags when the corresponding argument
 174   (no_dependents) is set.
 175
 176 - The local repository sends a request with the hashes of all requested
 177   objects as "want" lines, and does not perform any packfile negotiation.
 178   It then receives a packfile.
 179
 180 - Because we are reusing the existing fetch-pack mechanism, fetching
 181   currently fetches all objects referred to by the requested objects, even
 182   though they are not necessary.
 183
 184
 185 Current Limitations
 186 -------------------
 187
 188 - The remote used for a partial clone (or the first partial fetch
 189   following a regular clone) is marked as the "promisor remote".
 190
 191   We are currently limited to a single promisor remote and only that
 192   remote may be used for subsequent partial fetches.
 193
 194   We accept this limitation because we believe initial users of this
 195   feature will be using it on repositories with a strong single central
 196   server.
 197
 198 - Dynamic object fetching will only ask the promisor remote for missing
 199   objects.  We assume that the promisor remote has a complete view of the
 200   repository and can satisfy all such requests.
 201
 202 - Repack essentially treats promisor and non-promisor packfiles as 2
 203   distinct partitions and does not mix them.  Repack currently only works
 204   on non-promisor packfiles and loose objects.
 205
 206 - Dynamic object fetching invokes fetch-pack once *for each item*
 207   because most algorithms stumble upon a missing object and need to have
 208   it resolved before continuing their work.  This may incur significant
 209   overhead -- and multiple authentication requests -- if many objects are
 210   needed.
 211
 212 - Dynamic object fetching currently uses the existing pack protocol V0
 213   which means that each object is requested via fetch-pack.  The server
 214   will send a full set of info/refs when the connection is established.
 215   If there are large number of refs, this may incur significant overhead.
 216
 217
 218 Future Work
 219 -----------
 220
 221 - Allow more than one promisor remote and define a strategy for fetching
 222   missing objects from specific promisor remotes or of iterating over the
 223   set of promisor remotes until a missing object is found.
 224
 225   A user might want to have multiple geographically-close cache servers
 226   for fetching missing blobs while continuing to do filtered `git-fetch`
 227   commands from the central server, for example.
 228
 229   Or the user might want to work in a triangular work flow with multiple
 230   promisor remotes that each have an incomplete view of the repository.
 231
 232 - Allow repack to work on promisor packfiles (while keeping them distinct
 233   from non-promisor packfiles).
 234
 235 - Allow non-pathname-based filters to make use of packfile bitmaps (when
 236   present).  This was just an omission during the initial implementation.
 237
 238 - Investigate use of a long-running process to dynamically fetch a series
 239   of objects, such as proposed in [5,6] to reduce process startup and
 240   overhead costs.
 241
 242   It would be nice if pack protocol V2 could allow that long-running
 243   process to make a series of requests over a single long-running
 244   connection.
 245
 246 - Investigate pack protocol V2 to avoid the info/refs broadcast on
 247   each connection with the server to dynamically fetch missing objects.
 248
 249 - Investigate the need to handle loose promisor objects.
 250
 251   Objects in promisor packfiles are allowed to reference missing objects
 252   that can be dynamically fetched from the server.  An assumption was
 253   made that loose objects are only created locally and therefore should
 254   not reference a missing object.  We may need to revisit that assumption
 255   if, for example, we dynamically fetch a missing tree and store it as a
 256   loose object rather than a single object packfile.
 257
 258   This does not necessarily mean we need to mark loose objects as promisor;
 259   it may be sufficient to relax the object lookup or is-promisor functions.
 260
 261
 262 Non-Tasks
 263 ---------
 264
 265 - Every time the subject of "demand loading blobs" comes up it seems
 266   that someone suggests that the server be allowed to "guess" and send
 267   additional objects that may be related to the requested objects.
 268
 269   No work has gone into actually doing that; we're just documenting that
 270   it is a common suggestion.  We're not sure how it would work and have
 271   no plans to work on it.
 272
 273   It is valid for the server to send more objects than requested (even
 274   for a dynamic object fetch), but we are not building on that.
 275
 276
 277 Footnotes
 278 ---------
 279
 280 [a] expensive-to-modify list of missing objects:  Earlier in the design of
 281     partial clone we discussed the need for a single list of missing objects.
 282     This would essentially be a sorted linear list of OIDs that the were
 283     omitted by the server during a clone or subsequent fetches.
 284
 285     This file would need to be loaded into memory on every object lookup.
 286     It would need to be read, updated, and re-written (like the .git/index)
 287     on every explicit "git fetch" command *and* on any dynamic object fetch.
 288
 289     The cost to read, update, and write this file could add significant
 290     overhead to every command if there are many missing objects.  For example,
 291     if there are 100M missing blobs, this file would be at least 2GiB on disk.
 292
 293     With the "promisor" concept, we *infer* a missing object based upon the
 294     type of packfile that references it.
 295
 296
 297 Related Links
 298 -------------
 299 [0] https://bugs.chromium.org/p/git/issues/detail?id=2
 300     Chromium work item for: Partial Clone
 301
 302 [1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
 303     Subject: [RFC] Add support for downloading blobs on demand
 304     Date: Fri, 13 Jan 2017 10:52:53 -0500
 305
 306 [2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
 307     Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
 308     Date: Fri, 29 Sep 2017 13:11:36 -0700
 309
 310 [3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/
 311     Subject: Proposal for missing blob support in Git repos
 312     Date: Wed, 26 Apr 2017 15:13:46 -0700
 313
 314 [4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
 315     Subject: [PATCH 00/10] RFC Partial Clone and Fetch
 316     Date: Wed,  8 Mar 2017 18:50:29 +0000
 317
 318 [5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/
 319     Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
 320     Date: Fri,  5 May 2017 11:27:52 -0400
 321
 322 [6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
 323     Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
 324     Date: Fri, 14 Jul 2017 09:26:50 -0400