From 0df1370df327fbc3a6e940c251ac5704d6dc5b1f Mon Sep 17 00:00:00 2001 From: "Kyle J. McKay" Date: Sat, 16 Dec 2017 03:57:04 -0800 Subject: [PATCH] gc.sh: the new order In the very beginning Girocco used repack -A -d when running garbage collection. That quickly became -a -d instead. Eventually though that migrated back to -A -d. The flip-flops represent a switch between wanting to be as efficient as possible and avoid having loose objects lying around if at all possible and the need to keep objects that have recently become unreachable or are in the process of becoming reachable around long enough for them to become reachable (via a ref change) or no longer be needed (for a ref notification message). On the one hand the "-a -d" options give efficiency but risk repository corruption due to Git race conditions with siumltaneous pushes and garbage collection. On the other hand the "-A -d" options can produce extensive loose object splatter resulting in lost efficiency and disk space. With "-A -d" the "git prune" command must be used and that has its own race conditions. By using a number of extra helpers in the hooks and configuration options, Girocco has gotten along with a "git repack -A -d" + "git prune --expire=1.day.ago" + hard-linking loose objects into child forks technique that eliminates most of the race conditions (the "Push Pack Redux" described in README-GC was not handled) and makes simultaneous pushing and garbage collection safe at the expense of some serious loose object splatter when commits are discared (either via non-fast-forward updates or outright ref deletion). The current situation remains unsatisfactory from an efficiency standpoint. Now we revamp the way Girocco does garbage collection and eliminate use of both the -A and -d options. In fact, we eliminate use of git repack and git prune entirely in order to remedy the situation. To get a finer amount of control over what happens and when, we switch to using git pack-objects directly. With the exception of the "--write-bitmap-index" option, all the options we might want to pass to pack-objects have been available since before Git v1.6.6 which is now the minimum Girocco requires. That makes it not really all that complex to use pack-objects directly. The new garbage collection strategy itself is not quite so simple. We adopt a new four phase garbage collection strategy: I) Create a new all-in-one pack (this pack may have a bitmap) II) Create a pack of "recently reachable" objects and "friends" III) Create a pack of "unreachable" objects if any child forks exist IV) Migrate all remaining loose objects into a pack Phase I corresponds basically to "git repack -a". Note that there's no deletion or removal going on during phase I. Non-forks will get a bitmap (with Git v2.1.0 or later) and this pack will serve as the "virtual bundle". Phase II became possible when Girocco started keeping a record of ref changes in the reflogs subdirectory (the pre-receive hook and the update.sh script are responsible for this). We simply do another full packing after adding temporary refs for everything in the reflogs directory (that's not too old) and, in a nod to linked working trees, ref logs, index files and detached HEADs are included as "friends". By manipulating the packing options it's easy to exclude everything that was already packed in phase I. Phase III will be skipped unless forks are present. One more iteration over the same set of refs Phase II used but passing the "--keep-unreachable" option and excluding everything packed in either phase I or phase II (or in phase III packs hard-linked into the fork from its parent) gives us this pack. It won't appear in the project itself, but will be hard-linked to all immediate child forks which will in turn hard-link it down to their children, if any, when they run gc. At this point any non-keep packs that existed prior to phase I can be removed (this includes phase III packs that had been hard-linked down from the parent). Special techniques are used to avoid a "Push Pack Redux" race and to not remove any packs that have been "freshened" by some kind of simultaneous non-Girocco loose object activity. Phase IV can be accomplished simply by packing all loose objects that are not already in a phase I or phase II pack. We deliberately exclude any phase III packs here in an attempt to make sure that any unreachable loose objects live on for at least one min_gc_interval. This again is a nod to possible linked working trees. A final "git prune-packed" takes care of removing loose object detritus. At first blush this may appear to be more work. It is. But only very slightly. Previously the "git repack -A -d" was doing a full reachability trace and so was the following "git prune". The "git prune-packed" operation (which is very fast, slow file systems notwithstanding), was being done internally by the "git repack" command. In effect there's only one additional reachability trace that was not being done before and then only if any forks are present. It has the benefit of having a nice new all-in-one pack to work with that has just recently been in the disk cache. There is the savings from not having to deal with loose object splatter, but unless the file system really under performs that will probably not quite offset the time for the extra reachability trace in phase III. Also it will not actually pack anything unless there have been any discarded commits. For that matter, phase II also will not pack anything unless there have been discarded commits. Phase IV will not produce any pack unless some non-Girocco activity has produced loose objects (Girocco takes care to make sure its external VCS imports pack up their leftings). Also phase IV does NOT do a reachability trace, it only looks in the loose object directories. In the absence of loose objects it takes practically no time at all. This means that in the absense of discarded commits there will still be only one pack remaining after gc runs. (Unless the pack.packSizeLimit config option has been set and it caused multiple packs to be produced which is tolerated but not recommended.) The only thing that will be hard-linked down to child forks are the phase III packs (and then only if they're non-empty). This represents a huge improvement over hard-linking all the loose object splatter created by "git repack -A -d". As part of this "new order," non-Girocco operations on the repository (such as use of linked working trees) are "tolerated." Since Git has no such thing as a "looseobject.keep" file, there will always be a race possible between new objects becoming reachable (via a ref update) and garbage collection having decided they're unreachable and removing them. The window remains very small. Miniscule in fact. But it does exist. There is no such problem with simulataneous incoming pushes and garbage collection (with the new order even the "Push Pack Redux" hole has been closed). If that "miniscule" window of opportunity for repository corruption cannot be accepted, linked working trees (or even using Girocco on non-bare repositories -- oh, the horror) must not be used and all new content needs to be "pushed" into the repository from a clone (one that does _not_ make use of the --reference facility either) or else the timing of gc operations must be strictly controlled to guarantee that no preexisting unreachable objects are becoming reachable during gc. We also take this opportunity to update the README-GC documentation and modify the perform-pre-gc-linking.sh script to reflect the new order of things. Signed-off-by: Kyle J. McKay --- jobd/README-GC | 306 +++++++++------ jobd/gc.sh | 795 ++++++++++++++++++++++++++++++-------- toolbox/perform-pre-gc-linking.sh | 429 +++++++------------- 3 files changed, 972 insertions(+), 558 deletions(-) rewrite toolbox/perform-pre-gc-linking.sh (76%) diff --git a/jobd/README-GC b/jobd/README-GC index 3a4681c..fc206e3 100644 --- a/jobd/README-GC +++ b/jobd/README-GC @@ -126,6 +126,25 @@ Girocco does not use `git repack -a -d` for garbage collection. (Girocco had a dubious flirtation with `-a -d` until [a7ea68a8a80e43d5][1].) +Girocco does, however, use a technique that bears some resemblance to the +mechanism that `git repack -a -d` uses in that it records the names of all +preexisting non-keep packs just before starting gc and then removes them +after gc completes (provided they have not been "freshened" in the meantime). + +Girocco deals with the "Push Pack Redux" problem by renaming the preexisting +packs while collecting their names. After the rename is complete it checks to +see if any .keep files have appeared for the original name and creates a +corresponding .keep file for the new name (and excludes it from the list). + +If the rename happens in the midst of a "Push Pack Redux", only the renamed +version might be left (hence the transfer of the .keep file). If there is +no .keep file then either no "Push Pack Redux" has started or it's already +finished its ref updates in which case it will be picked up by the future +reachability trace that hasn't started yet. + +The renamed packs get an added suffix so that no incoming push can ever +possibly generate a pack name collision thereby avoiding the problem. + Simultaneous Pushes ~~~~~~~~~~~~~~~~~~~ @@ -186,59 +205,24 @@ at least one day after they initially become unreachable. It uses two mechanisms in combination to achieve this: - 1. A pre-receive hook "touch"es any loose objects that match either the new - _or_ the old ref values for any of the incoming ref changes. - - 2. At the beginning of garbage collection (in `gc.sh`) all pre-existing packs - are "touch"ed. - -The `gc.sh` script uses `git repack -A -d` and `git prune --expire=1.day.ago` -which then guarantees that objects are kept for at least 24 hours from the time -they most recently became unreachable. - -When the `-A -d` options cause unreachable objects to pop out of their packs, -Git gives them a modification time matching that of the pack. Failing to -"touch" the packs before starting garbage collection could allow loose objects -to pop out of their packs already more than one day old and get pruned -immediately thereby failing to address the problem. - -By having the pre-receive hook touch loose objects matching either the old or -new value for each ref it guarantees they will be less than one day old. Since -the pre-receive hook runs before refs are updated, any old ref values that -become unreachable and are loose objects will be "fresh" and less than one day -old. - -Git itself will only "freshen" loose objects (or the pack that contains a -non-loose object) when the object is written. This would happen when an -incoming push pack gets unpacked, but that's being disabled by -transfer.unpackLimit = 1 (and very large pushes would never be unpacked -anyway). Git never makes any attempt to "freshen" old ref values at all. - -Since Git v2.2.0, loose objects that are old enough to be pruned will be kept -alive and not pruned if they are reachable from other loose objects that are -not yet old enough to be pruned (see [d3038d22f91aad96][2]). - -Reconsidring the 5 step scenario above, this prevents the problem so long as -Client A takes less than one day to complete its push. - -Consider a slightly different variation where Client B performs the rewind of -`master` (step 2) several days before and then garbage collection happens -shortly thereafter. The rewound commit would end up popping out of its pack -and by the time Client A does its push, it would already be over one day old -and garbage collection would remove it and the incoming push from Client A -would not freshen it (it would be neither the old nor the new ref value for the -`some-topic` branch). - -But, because `master` has already been rewound in this variation by the time -Client A does its push, it will always include that rewound and dropped commit -in its push pack and since we have transfer.unpackLimit set to 1 a simultaneous -garbage collection might very well remove the loose object version of the -rewound and dropped commit, but there will be another copy in the push pack -from Client A that persists because its incoming push pack was not unpacked. - -The workaround depends on several things being done, but it is effective -provided pushes don't take more than a day (but that time limit could easily be -extended if necessary). + 1. A pre-receive hook records both the old and new hash values for all ref + changes in a file in the `reflogs` subdirectory. If it cannot record the + hash values for some reason it aborts the push. (The update.sh script does + the same thing for mirrors to guarantee availability for ref change + notifications.) + + 2. As described later, Girocco's garbage collection consists of four phases. + During phase II, all valid recent hashes from the `reflogs` directory are + briefly resurrected to create a pack that contains all recently reachable + objects that are now unreachable. + +The result is that any objects that have become unreachable within the last +day are always kept during gc. (The interval is configurable to be longer, +but is always forced to be a minimum of one day.) + +This gc behavior effectively eliminates the problem provided pushes don't take +more than a day (but that time limit could easily be extended if necessary by +simply bumping up the config item from the default of one day). Shared Repositories @@ -268,14 +252,21 @@ But this has an undesirable side effect. Since mode 0444 files can only be "touched" by the owner, only the initial creator of the loose object or pack will be able to change its modification time (aka "freshen" it). -This would seem to defeat the workaround described in the previous section. +Girocco does not rely on "freshening" happening for normal Girocco operations. + +However, Girocco now "tolerates" linked working trees to its repositories and +while it's not possible to eliminate the race described in (i) from occurring +when non-Girocco activities transition loose objects from unreachable to +reachable during a garbage collection, it can be minimized. + +The "freshening" of loose objects (or their containing packs) plays a crucial +role in minimizing the race condition window. -In order to make the workaround feasible, both the pre-receive hook and the -garbage collection script take great pains to make sure that loose objects and -packs remain writable by both owner and group. By doing this the "touch"ing -that happens in the pre-receive hook (for loose objects) and garbage collection -script (for pre-existing packs and newly loosened objects) will always effect a -change in the modification time of the files being touched. +For this reason, both the pre-receive hook and the garbage collection script +take great pains to make sure that loose objects and packs remain writable by +both owner and group. By doing this the "touch"ing that happens when an object +that already exists is re-created will always effect a change in the +modification time of the files being touched thereby "freshening" them. ------------ Fork Follies @@ -297,71 +288,148 @@ condition with regard to incoming pushes during the reachability trace. One way to prevent corruption of the forks would be to never remove any objects at all. Not very conducive to efficient disk space usage. -But that does point out the problem. Forks are potentially corrupted by object -removal. By using `git repack -A -d` that will never remove any objects -- -they may migrate from a loose into a packed state (for reachable objects since -an implicit `git prune-packed` occurs at the end of `git repack -A -d`) or they -may migrate from a packed state to a loose state (for unreachable objects), but -by using the `-A` option no objects will ever be lost. - -That leaves as the only command that can ever actually remove objects the -`git prune --expire=1.day.ago` command. - -After running `git repack -A -d` but _before_ running git prune, all existing -loose objects are hard-linked down into the next level of forks (if any). +If a project has forks, during phase III of garbage collection, any objects +that are found to be unreachable and are currently in packs are repacked into +their own pack which is then hard-linked down to all child forks (but removed +from the project itself). -As far as the forks are concerned the objects haven't been removed even if they -end up being pruned from the parent. Ergo no corruption of the forks can ever -occur. Problem Solved. - ------------------ -Future Directions ------------------ - -Non-bare Repositories -~~~~~~~~~~~~~~~~~~~~~ - -Girocco currently only officially supports "bare" repositories without any ref -logs or linked working trees (linked working trees were introduced with Git -v2.5.0). - -There will always be a race condition with simultaneous garbage collection and -creation of new commits being built with loose objects. This is just inherent -in the way Git works. However, the window for a problem is relatively small -and if that's understood and endurable, Girocco could be more tolerant of -non-bare and/or ref logs and/or linked working trees and be better about not -ignoring them (i.e. dropping those objects) during repacking. - - -Loose Object Explosion -~~~~~~~~~~~~~~~~~~~~~~ - -Use of `git repack -A -d` can have nasty repercussions when large numbers of -objects become unreachable due to non-fast-forward ref updates. - -Git deals with objects far more efficiently when they are all packed up nice -and neat in one (or just a few) packs. Having thousands of loose objects -around impairs efficiency from both a space and speed standpoint. - -By using the reflogs files that Girocco maintains (the pre-receive hook and the -update.sh script write them and the gc.sh script prunes them), it's possible to -create another pack of "recently become unreachable" objects in lieu of using -`git repack -A -d` rather than popping unreachable objects out of their pack -(or just keeping them around indefinitely with the associated disk bloat). +These special "unreachable" packs are given a unique extension and any such +packs received from a parent project are treated as .keep packs and also +hard-linked down into child forks along with any freshly generated phase III +pack. -For forks yet another `--keep-unreachable` pack could be made and hard-linked -into the next level of forks (if any) but _not_ the parent. This would only be -necessary if forks were present. +Since Girocco garbage collection never removes loose objects but only redundant +packs this is enough to eliminate the problem. + +(Loose objects are migrated into a "loose objects pack" but that pack then +remains until the following gc cycle thereby guaranteeing all loose objects a +minimum lifetime not withstanding untimely administrator intervention.) + +-------------------- +Linked Working Trees +-------------------- + +Along with the new order for garbage collection, Girocco now "tolerates" use +of linked working trees on its repositories. Since there's little difference +between a linked working tree and a non-bare repository this could allow use +of non-bare repositories with Girocco too, but that's highly discouraged due +to the problems with pushing to a non-bare HEAD branch and shall not be +discussed any further. ;) + +(For those determined to pursue this option a `push-to-checkout` hook will +likely be required as well as setting the `receive.denyCurrentBranch` config +option along with Git version 2.4.0 or later.) + +Girocco supports linked working trees by making sure that all objects reachable +from any index, ref log or detached HEAD that do not end up in the phase I pack +(and are not borrowed from an alternate) end up in the phase II pack. + +Additionally, in order to minimize the window for race condition loss of +objects due to simultaneous garbage collection while, for example, creating a +new commit in a linked working tree, Girocco attempts to make sure that all +unreachable loose objects persist for at least one min_gc_interval time period. + +Furthermore, if one of the redundant packs gets "freshened" during garbage +collection (which could happen during commit creation if one of the loose +objects being created already exists contained in a pack), that pack will not +be removed. There's a very miniscule window where the check for "freshened" +packs could take place, not find any "freshened" packs, the pack gets freshened +immediately thereafter and then it gets removed anyway. + +Can't be helped, Git does not have any .keep mechanism. Girocco +itself avoids the problem, case (i) above, by preventing incoming pushes from +unpacking their packs. + +If the miniscule race condition risk remains unacceptable, either linked +working trees must not be used, the time when Girocco garbage collection runs +must be strictly controlled or all updates must be pushed in from a clone that +does _not_ use the `--reference` mechanism to refer to the Girocco repository. + +---------------------------------- +Girocco Garbage Collection Process +---------------------------------- + +Girocco now uses the new order for garbage collection. + +None of the "git gc", "git repack" or "git prune" commands are run either +directly or indirectly. New packs are created via use of the +"git pack-objects" command. The limited "pruning" that takes places involves +only pruning loose objects that are packed and removing redundant packs after +creating new all-in-one packs. + +It is the removal of redundant packs that can permanently remove some +unreachable objects. + +Girocco uses a four phase garbage collection strategy (the new order): + + I) Create a new all-in-one pack (this pack may have a bitmap) + II) Create a pack of "recently reachable" objects and "friends" +III) Create a pack of "unreachable" objects if any child forks exist + IV) Migrate all remaining loose objects into a pack + +Phase I corresponds basically to "git repack -a". Note that there's no +deletion or removal going on during phase I. Non-forks will get a bitmap (with +Git v2.1.0 or later) and this pack will serve as the "virtual bundle". + +Phase II became possible when Girocco started keeping a record of ref changes +in the reflogs subdirectory (the pre-receive hook and the update.sh script are +responsible for this). We simply do another full packing after adding +temporary refs for everything in the reflogs directory (that's not too old). +In a nod to linked working trees, anything reachable from ref logs, index files +or detached HEADs is included as "friends". By manipulating the packing +options it's easy to exclude everything that was already packed in phase I. + +Phase III will be skipped unless forks are present. One more iteration over +the same set of refs Phase II used but passing the "--keep-unreachable" option +and excluding everything packed in either phase I or phase II (or in phase III +packs hard-linked into the fork from its parent) gives us this pack. It won't +appear in the project itself, but will be hard-linked to all immediate child +forks (along with any phase III packs from the parent) which will in turn +hard-link it (and them) down to their children, if any, when they run gc. + +At this point any non-keep packs that existed prior to phase I are removed +(this includes phase III packs that had been hard-linked down from the parent). +Special techniques are used to avoid a "Push Pack Redux" race and to not remove +any packs that have been "freshened" by some kind of simultaneous non-Girocco +loose object activity. + +Phase IV is accomplished simply by packing all loose objects that are not +already in a phase I or phase II pack. Any phase III packs are deliberately +excluded here in an attempt to make sure that any unreachable loose objects +live on for at least one min_gc_interval. + +A final "git prune-packed" takes care of removing loose object detritus. + +Unless there's some non-Girocco activity going on (such as in a linked working +tree) or objects are being discarded (non-fast-forward ref changes or ref +deletions), phases II, III and IV will never create any packs. + +Phase III packs are hard-linked all the way to the bottom of the project's +fork tree as gc progresses through the tree so that the space they occupy does +not balloon up and get multiplied by the total number of forks as it would if +they were allowed to be repacked into a new phase III pack in each fork. + +While this strategy may seem complex, it: + + a) avoids ever creating loose objects for much better efficiency and speed + b) avoids corrupting forks that use the `alternates` mechanism + c) avoids using large amounts of disk space for packs containing mostly + redundant objects + d) maintains "recently become unreachable" objects long enough for meaningful + ref change notifications + e) presents a risk of corruption when simultaneous garbage collection occurs + during new commit creation that's approximately the same as non-Girocco + Git use would have + +Overall this represents a huge improvement over hard-linking all the loose +object splatter created by "git repack -A -d" into child forks. + +As part of this "new order," non-Girocco operations on the repository (such as +use of linked working trees) are now reluctantly "tolerated." -Finally any remaining objects after prune-packable could be migrated into their -own pack which should address non-bare repository usage (with approximately the -same race condition as mentioned above). [1]: http://repo.or.cz/girocco.git/a7ea68a8a80e43d5 "gc: retain recently unreferenced objects for 1 day, 2015-10-02" -[2]: http://repo.or.cz/git.git/d3038d22f91aad96 - "prune: keep objects reachable from recent objects, 2014-10-15, v2.2.0" - diff --git a/jobd/gc.sh b/jobd/gc.sh index 2a8c94b..ee9fc18 100755 --- a/jobd/gc.sh +++ b/jobd/gc.sh @@ -196,8 +196,9 @@ combine_small_packs() { if [ -n "$1" ] && [ -n "$noreusedeltaopt" ]; then _minsmallpacks=1 fi - _lpo="--exclude-no-idx --exclude-keep --exclude-bitmap --exclude-bndl --quiet" - _lpo="$_lpo --object-limit $var_redelta_threshold objects/pack" + _lpo="--exclude-no-idx --exclude-keep --exclude-bitmap --exclude-bndl" + _lpo="$_lpo --exclude-sfx _u --exclude-sfx _o" + _lpo="$_lpo --quiet --object-limit $var_redelta_threshold objects/pack" while _cnt="$(list_packs --count $_lpo)" || : test "${_cnt:-0}" -ge $_minsmallpacks @@ -346,17 +347,18 @@ lock_gc() { # because it will cause the original packed-refs file to be re-written, but # the new branch creation will not unless we do another pack-refs which might # lead to having in incomplete bundle). Therefore we want to keep a copy of -# the original packed-refs file around. +# the original packed-refs file around. We do the same thing for HEAD. make_repack_dir() { ! [ -d repack ] || rm -rf repack ! [ -d repack ] || { echo >&2 "[$proj] cannot remove repack subdirectory"; exit 1; } - mkdir repack repack/refs + mkdir repack repack/refs repack/alt repack/alt/pack [ -d info ] || mkdir info ln -s ../config repack/config ln -s ../info repack/info ln -s ../objects repack/objects ln -s ../../refs repack/refs/refs _lines=$(( $(LC_ALL=C wc -l repack/HEAD.orig cat packed-refs >repack/packed-refs.orig if [ $(LC_ALL=C wc -l &2 "[$proj] error: make_repack_dir failed original packed-refs line count sanity check" @@ -529,6 +531,40 @@ noreusedeltaopt="--no-reuse-delta" alwaysredelta= [ "$(git config --get girocco.redelta 2>/dev/null || :)" != "always" ] || alwaysredelta=1 +# Extract any -f or -F or --no-reuse-object or --no-reuse-delta options +# to be compatible with the old and new gc.sh versions and avoid ugly argument +# duplication in process lists at the same time +# Any options found will override the "girocco.redelta" setting +recompress= +idx=$# +while [ $idx -gt 0 ]; do + idx=$(( $idx - 1 )) + opt="$1" + shift + case "$opt" in + -f|--no-reuse-delta) + alwaysredelta=1 + continue + ;; + -F|--no-reuse-object) + alwaysredelta=1 + recompress=1 + continue + ;; + -?*) + ;; + *) + printf >&2 '%s\n' "bad non-option argument: $opt" + echo >&2 "(Did you perhaps intend to use a --xxx=yyy form?)" + exit 1 + esac + [ -z "$opt" ] || set -- "$@" "$opt" +done +if [ -n "$alwaysredelta" ]; then + noreusedeltaopt="--no-reuse-delta" + [ -z "$recompress" ] || noreusedeltaopt="--no-reuse-object" +fi + trap 'e=$?; rm -f .gc_in_progress; if [ $e != 0 ]; then echo "gc failed dir: $PWD" >&2; fi' EXIT trap 'exit 130' INT trap 'exit 143' TERM @@ -555,6 +591,7 @@ if [ -n "$isminigc" ]; then # Note that .delaygc is ignored here as that's only intended for full gc lock_gc rm -f .allowgc .needsgc + rm -f objects/pack/pack-*_r.keep remove_crud coalesce_reflogs prune_reflogs @@ -650,8 +687,8 @@ fi # Prevent any other simultaneous gc operations lock_gc -# At this point, if .allowgc exists, it's now crud to be removed -rm -f .allowgc +# At this point, if .allowgc or .gc_failed exists, it's now crud to be removed +rm -f .allowgc .gc_failed # Ideally we would do this in post-receive, but that would mean duplicating the # logic so it's available in the chroot jail and that's highly undesirable @@ -706,23 +743,18 @@ fi progress "+ [$proj] garbage check ($(date))" newdeltas= -[ -z "$alwaysredelta" ] || newdeltas=-f -for _arg in "$@"; do - case "$_arg" in "-f"|"-F") - newdeltas="$_arg" - esac -done +[ -z "$alwaysredelta" ] || newdeltas="$noreusedeltaopt" if [ -z "$newdeltas" ] && [ -n "$gfi_mirror" ]; then if [ $(list_packs --exclude-no-idx --count objects/pack) -le \ $(list_packs --exclude-no-idx --count --quiet --only gfi-packs) ]; then # Don't bother with repack_gfi_packs since everything's being repacked - newdeltas=-f + newdeltas="--no-reuse-delta" fi fi if [ -z "$newdeltas" ] && [ -n "$noreusedeltaopt" ] && [ $(list_packs --exclude-no-idx --count-objects objects/pack) -le $var_redelta_threshold ]; then # There aren't enough objects to worry about so just redelta to get the best pack - newdeltas=-f + newdeltas="--no-reuse-delta" fi if [ -z "$newdeltas" ]; then # Since we're not going to recompute deltas overall, we need to do the @@ -737,13 +769,15 @@ fi # ## Safe Pruning In Forks ## -## We are about to perform garbage collection. We do NOT use the "git gc" -## command directly as it does not provide enough control over the fine details -## that we require. However, we DO maintain a "gc.pid" file during our garbage +## We are about to perform garbage collection. We do NOT use the "git gc" or +## the "git repack" commands directly as they do not provide enough control over +## the fine details. However, we DO maintain a "gc.pid" file during our garbage ## collection so that a simultaneous "git gc" by an administrator will be ## blocked (and similarly we refuse to start garbage collection if we cannot -## create the "gc.pid" file). When we say "gc" in the below description we are -## referring to our "gc.sh" script, NOT the "git gc" command. +## create the "gc.pid" file). +## +## When we say "gc" in the below description we are referring to our "gc.sh" +## script, NOT the "git gc" command. ## ## If the project we are running garbage collection (gc) on has any forks we ## must be careful not to remove any objects that while no longer referenced by @@ -764,8 +798,8 @@ fi ## hard-linked yet another old pack down to its children (not to mention ## loose objects). ## -## 2. As we are now using the "-A" option with "git repack", any new objects -## in the parent that are not referenced by children will continually get +## 2. When using the "-A" option with "git repack", any new objects in the +## parent that are not referenced by children will continually get ## exploded out of the hard-linked pack in the children whenever the ## children run gc. ## @@ -774,61 +808,339 @@ fi ## perform the hard-linking into the children which provides yet another ## source of inefficiency. ## -## Since we are using the "-A" option to "git repack" (that was not always the -## case) to guarantee we can access old ref values for long enough to send out -## a meaningful mail.sh notification, we now have another, more efficient, -## option available to prevent corruption of child forks that continue to refer -## to objects that are no longer reachable from any ref in the parent. +## While we were still using the "-A" option to "git repack" (that was not +## always the case) to guarantee we can access old ref values for long enough +## to send out a meaningful mail.sh notification, another, more efficient, +## option became available to prevent corruption of child forks that continue +## to refer to objects that are no longer reachable from any ref in the parent. ## ## The only things that need be copied (or hard-linked) into the child fork(s) ## are those objects that have become unreachable from any ref in the parent. -## They are the only things that could ever be removed by "git prune" and -## therefore the only things we need to prevent the loss of in order to avoid -## corruption of the child fork(s). ## -## Therefore we now use the following strategy instead to avoid excessive disk -## use and lots of unnecessary loose objects in child forks: +## When we were using the "git repack -A -d" + "git prune --expire=1.day.ago" +## technique, the only objects that could ever be removed were loose objects +## that "git prune" determined were expired. In that case, loose objects were +## all that need be hard-linked down to child forks in order to avoid +## corruption of any child fork(s). +## +## The "git repack -A -d" + "git prune --expire=1.day.ago" + hard-linking loose +## objects to child forks technique remains fundamentally sound from the +## perspective of supporting simultaneous gc and push and keeping newly +## unreachable objects around long enough to be sure we can send out meaningful +## ref change notifications and never corrupting any child forks and never +## persisting the lifetime of large old packs containing mostly duplicate or +## unreachable objects as gc percolates through a project's entire fork tree. +## +## However, that technique suffers from one potential prodigious pitfall. +## +## Unreachable objects come flying out of their packs to splatter all over the +## objects subdirectories possibly creating a huge, inefficient mess. +## +## Often this is not an issue. Even with a lot of rebasing going on, usually +## the only objects that will splatter are some commits, trees and the odd blob +## here and there. Not enough to be overly concerned about. +## +## However, for the reppository that frequently experiences a lot of non-fast- +## forward updates and/or outright ref deletion, the number of objects suddenly +## popping out of their packs at "git repack -A -d" time can be overwhelming. +## +## To avoid this issue we now use a four phase pack creation strategy. +## This will result in creation of up to four packs (instead of at most one). +## +## I. A complete pack (with bitmaps if appropriate) gets created including +## only "reachable" objects from all refs/... refs plus HEAD. This will +## also serve as the virtual bundle for the repository. +## +## II. A pack of recently-became-unreachable objects and friends is created. +## (The "friends" are ref logs, linked working tree HEADs and indicies.) +## Because both the pre-receive and update.sh script record all ref +## changes we can easily choose the cut off point for "recently". +## It is only the fact we maintain those logs in the reflogs subdirectory +## that allows this step to be possible. +## +## III. If the repository has any forks with a non-zero length alternates file, +## yet another pack of "--keep-unreachable" objects is generated that will +## not actually be kept in the parent, but hard-linked into all the forks. +## +## IV. Finally, after running "git prune-packed", any remaining loose objects +## are migrated into a pack of their own. +## +## We then remove any non-.keep packs that existed before we started the +## process being careful to keep any same-pack pushes for the "Push Pack Redux" +## race condition (see README-GC). ## -## 1. Run "git repack -A -d -l" in the parent BEFORE doing anything about -## child forks. +## By using "git pack-objects" directly we are able to accomplish this with +## very little additional effort. ## -## 2. Hard-link all remaining existing loose objects in the parent into the -## immediate child forks. +## The packs produced by (III) are treated almost like ".keep" packs by child +## forks in that the objects in them are never repacked into any other +## "--keep-unreachable" packs (but they can migrate into phase I or II packs) +## and those phase III packs are then hard-linked into any grandchild forks. ## -## 3. Now run "git prune" in the parent. +## This avoids the space explosion that could occur if each fork level ended +## up duplicating the "--keep-unreachable" pack space by repacking those +## objects (essentially breaking the hard-link to the single copy of those +## objects). ## -## With this new strategy we avoid the need to run any "mini" gc maintenance -## before copying (or hard-linking) anything down to the child forks. -## Furthermore, only when the parent performs a non-fast-forward update will -## anything ever be transferred to the children leaving them unperturbed in the -## vast majority of cases. Finally, even if the parent references objects the -## children do not, those objects will no longer continually end up in the -## children as unreachable loose objects after the children run gc. +## While it is true that each level of forks could potentially add yet another +## phase III pack to be hard-linked down to its children, such packs will only +## include unreachable objects not already in any phase III packs that were +## received from the parent. +## +## The space for the phase III packs will not be reclaimed until the gc +## finishes percolating through the entire "fork tree" of a project. +## +## This is not much different than the "git repack -A -d" situation where +## all the loose objects are hard-linked down into child forks. In that +## case forks that actually need any of those objects could gradually reduce +## the number of objects hard-linked into deeper fork levels. +## +## The difference with a phase III "--keep-unreachable" pack is that there +## cannot be any gradual reduction like that since it would require repacking +## the pack and breaking the hard-link thereby increasing storage space. The +## storage will instead always be reclaimed all at once when all of the +## projects in the "fork tree" complete their gc. +## +## However, the belief is that the huge space win by having all the +## unreachable objects packed up together far eclipses (when many objects are +## involved, the single-pack version can end up using 1/20th or less of the +## disk space compared to having them all as loose objects) any brief minor +## space savings that might occur under the "git repack -A -d" loose object +## system prior to the gc collection completing for all the projects in the +## "fork tree". +# + +# +## utility functions +# + +# rename_pack oldnamepath newnamepath +# note that .keep files are left untouched and not moved at all! +rename_pack() { + [ $# -eq 2 ] && [ "$1" != "$2" ] || { + echo >&2 "[$proj] incorrect use of rename_pack function" + exit 1 + } + # Git assumes that if the destination of the rename already exists + # that it is, in fact, a copy of the same bytes so silently succeeds + # without doing anything. We duplicate that logic here. + # Git checks for the .idx file first before even trying to use a pack + # so it should be the last moved and the first removed. + for ext in pack bitmap idx; do + [ -f "$1.$ext" ] || continue + ln "$1.$ext" "$2.$ext" >/dev/null 2>&1 || + [ -f "$2.$ext" ] || { + echo >&2 "[$proj] unable to move $1.$ext to $2.$ext" + exit 1 + } + done + for ext in idx pack bitmap; do + rm -f "$1.$ext" + done + return 0 +} + +make_packs_ugw() { + find "$1" -maxdepth 1 -type f ! -perm -ug+w \ + -name "pack-$octet20*.pack" -exec chmod ug+w '{}' + || : +} 2>/dev/null + +vcnt() { + eval "$1="'$(( $# - 1 ))' +} + +get_index_tree() { + if [ -s "$1" ]; then + GIT_INDEX_FILE="$1" + export GIT_INDEX_FILE + git write-tree 2>/dev/null || : + unset GIT_INDEX_FILE + fi +} + +get_detached_head() { + if [ -s "$1" ] && read -r _head <"$1" 2>/dev/null; then + case "$_head" in $octet20*) + echo "$_head" + esac + fi +} + +# compute_extra_reachables +# create lines suitable for a packed-refs file mentioning all the +# other refs we might like to keep. +# the current directory MUST be set to the repository's --git-dir +# the following are included: +# * refs mentioned in reflogs/... files +# * tree(s) created from index file(s) +# * detached linked working tree heads +# Resulting objects are tested for existence and uniqified then output +# one per line under a refs/z* namespace +compute_extra_reachables() { + { + digits8='[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' + find reflogs -mindepth 1 -maxdepth 1 -type f -name "$digits8*" -exec gzip -c -d -f '{}' + | + awk '{print $2; print $3}' + ! [ -f index ] || get_index_tree index + if [ -d worktrees ]; then + find worktrees -mindepth 2 -maxdepth 2 -name HEAD -type f -print | + while read -r lwth; do + get_detached_head "$lwth" + get_index_tree "${lwth%HEAD}index" + done + fi + } | LC_ALL=C sort -u | + git cat-file ${var_have_git_260:+--buffer} --batch-check"${var_have_git_185:+=%(objectname)}" | + awk '!/missing/ {num++; print $1 " " "refs/" substr("zzzzzzzzzzzz", 1, length(num)) "/" num}' +} + +# +## main gc logic # -git pack-refs --all +# Everything else is more efficient if we do this first +# The "--prune" option is the default since v1.5.0 but it serves as "documentation" here +git pack-refs --all --prune +[ -e packed-refs ] || >>packed-refs # should never happen... + +# If we have a logs directory or a worktrees directory expire the ref logs now +# Note that Git itself does not use either --rewrite or --updateref, so neither do we +! [ -d logs ] && ! [ -d worktrees ] || eval git reflog expire --all "${quiet:+>/dev/null 2>&1}" || : + make_repack_dir -touch .gc_in_progress -rm -f .gc_failed bundles/* -rm -f objects/pack/pack-*.bndl -# We use the -A option with git repack so that unreachable objects can live -# on for a time as loose objects. This is particularly helpful if we just -# happen to be in the process of sending out a ref update for a ref that was -# force updated and the old ref value would have otherwise been removed by -# repack because it was now unreachable. Admittedly the window for gc to run -# and do that before we manage to send out the ref update is not large, but -# it would not be difficult to create such a situation. Unfortunately, when -# Git unpacks these unreachable objects it will give them the modification -# time of the *.pack file they came out of. This could be very, very old. -# If that happens, the subsequent git prune --expire some_time_ago will still -# remove the object(s) and our pending ref update will still lose out. -# To prevent this from happening and to get the behavior we want, we now -# touch the modification time of all pack-.pack files so that any -# loosened objects get a current time. Git does not provide any other -# mechanism to do this. We do not want to just touch all loose objects -# left after the repack because that would cause objects that were loosened -# previously to live on which we definitely do not want. -list_packs --exclude-no-idx objects/pack | xargs touch -c 2>/dev/null || : +! [ -e .gc_failed ] || exit 1 +rm -f .gc_in_progress # make sure +touch .gc_in_progress # it's truly fresh +rm -f bundles/* objects/pack/pack-*.bndl +# This is perhaps a bit aggressive in that if we're suffering from "Push Pack Redux" +# and somehow we get run again immediately after the run where "Push Pack Redux" happened +# and we have garbage collection forced, there's just the barest, almost negligible, +# possibility that the "Push Pack Redux" ref updates _still_ have not happened and we +# should not be removing _r .keep files. None of the normal Girocco processing can +# cause this. The second run of this script would have to use the force gc option +# for it to even be possible in the first place. What's much more likely is that +# the initial run of this script was somehow interrupted in the middle before it +# could get rid of the _r .keep file itself in which case it's better to get rid of +# it now to avoid keeping something around that would perturb our nice and neat gc +rm -f objects/pack/pack-*_r.keep +# We will add .keep files for _u packs if and when we run phase III +# Otherwise they need to not have any .keep files during phases I and II +rm -f objects/pack/pack-*_u.keep + +# We need to make sure that any non-Girocco (barely tolerated) Git object creation +# activity will be able to "freshen" the pack containing a pre-existing object +# that's being written. This really should not be necessary as the pre-receive +# hook should make sure this takes place for any incoming pushes. +# However, do it here anyway just in case. +make_packs_ugw objects/pack + +# This is only effective with Git v2.3.5 and later and it will only matter when +# we are using one of the "internal_rev_list" modes of pack-objects +# (the combine-packs.sh script never uses any of those modes) +# The "git repack" and "git prune" commands always sets this internally themselves +# It makes no difference if there's no repository corruption +GIT_REF_PARANOIA=1 && export GIT_REF_PARANOIA + +# All of the options we might want to use with pack-objects were supported +# at some point prior to Git version v1.6.6 which is the minimum version that +# Girocco now requires. Except for one (--use-bitmap-index). Several of them +# are "boiler plate" options we always want to use so we bundle them up here. +pkopt="--delta-base-offset --keep-true-parents --non-empty --all-progress-implied" +# We want to use --include-tag, but before Git v2.10.1 it would leave out +# "middle" tags (e.g. a tag of a tag of a commit would omit the tagged tag) +# See http://repo.or.cz/git.git/b773ddea2cd3b08c for details +# ("pack-objects: walk tag chains for --include-tag", 2016-09-07, v2.10.1) +# This is not a free check as it matches all refs against refs/tags/ then +# peels all the annotated tags and checks for inclusion. The situation in +# which it would add a tag that was not already included by a reachability +# trace that included tag starting points can only occur if a new tag gets +# pushed during gc pointing to something that would have been packed anyway. +# But, it could happen and, really, campared to gc as a whole it's not that +# expensive to perform (provided we do not get an unconnected pack). +[ -z "$var_have_git_2101" ] || pkopt="$pkopt --include-tag" +pkopt="$pkopt ${quiet:---progress} $packopts" + +# The git pack-objects command only supports bitmaps if all objects are being +# packed (the "--all" option) and the "--stdout" option is NOT being used. +# Additionally, while packing, if any encountered reachable objects are +# determined to be "not wanted" then no bitmap index will be written anyway. +# While it is theoretically possible that a project with a non-empty alternates +# file ends up packing all objects (because it does not actually use any of the +# objects found in the alternates), it's very unlikely. And, in the unlikely +# event that did occur, clients would see a message about only using one bitmap +# because Git can only use one bitmap at a time and at least one of the +# alternates is bound to have a bitmap. Therefore if we see a non-empty +# alternates file, we disable writing bitmaps which avoids the warning and any +# possibility of a client warning as well. Also if we are running anything +# before Git v2.1.0 (the effective version for repack.writeBitmaps=true) then +# we also always disable bitmap writing. +wbmopt= +[ -z "$var_have_git_210" ] || wbmopt="--write-bitmap-index" +# More recent versions of pack-objects have optimizations when not using the +# --local option. If we do not have any alternates it's a pointless option. +# If we do have alternates we need to skip writing a bitmap and we cannot +# have a bundle since it must contain all objects. +if [ -n "$isfork" ]; then + lclopt="--local" + wbmopt= + makebndl= +else + lclopt= + makebndl=1 +fi + +# +## Phase I +# + +wbmstr= +[ -n "$wbmopt" ] || wbmstr=" (bitmaps disabled)" +progress "~ [$proj] running primary full gc pack-objects$wbmstr" + +gotforks= +! has_forks_with_alternates "$proj" || gotforks=1 + +# To avoid "Push Pack Redux" (see README-GC), after collecting the initial +# preexisting non-keep pack list, we rename them so that an incoming push +# pack cannot possibly experience a pack name collision. Git does not require +# use of the "default" pack names, simply that the proper extensions are used. +# We rename to insert an "_r" just before the extension to avoid "Push Pack Redux" +# name collisions. Later on we may create an "unreachable" pack for hard-linking +# down into forks and it will have an "_u" inserted just before its extension. +packlist="$(list_packs -C objects/pack --exclude-no-idx --exclude-keep --quiet .)" || : +oldpacks= +for oldpack in $packlist; do + oldpack="${oldpack#pack-}" + oldpack="${oldpack%.pack}" + [ -f "objects/pack/pack-$oldpack.pack" ] || { + echo >&2 "[$proj] unable to list old pack files" + exit 1 + } + if [ "${oldpack#*[!0-9a-fA-F]}" != "$oldpack" ]; then + # names not exclusively hexadecimal do not need renaming + oldpacks="${oldpacks:+$oldpacks }$oldpack" + continue + fi + rename_pack "objects/pack/pack-$oldpack" "objects/pack/pack-${oldpack}_r" || { + echo >&2 "[$proj] unable to rename old pack files" + exit 1 + } + # If the oldpack has a .keep now it means a "Push Pack Redux" is actually + # in progress at this moment and we need to .keep the renamed pack, + # otherwise no "Push Pack Redux" has started yet or it has already finished. + # In either case we're okay because if it's just finished then all ref + # changes have already been made so we don't need a .keep and we will + # see the ref changes and grab all the objects via a reachability trace. + # If it hasn't started yet that's okay because we're done moving that + # name so a complete pack will appear under the old name that we'll + # leave alone. + if [ -f "objects/pack/pack-$oldpack.keep" ]; then + echo "Push Pack Redux" >"objects/pack/pack-${oldpack}_r.keep" + else + oldpacks="${oldpacks:+$oldpacks }${oldpack}_r" + fi +done + # We wish to keep deltas from our last full pack so if we're not redeltaing # then make sure the .pack associated with the .bitmap has a newer mod time # (If there is no .bitmap then touch the pack with the most objects instead.) @@ -838,98 +1150,242 @@ if [ -z "$newdeltas" ]; then if [ -n "$bmpack" ] && [ -f "$bmpack" ] && [ -s "$bmpack" ]; then sleep 1 touch -c "$bmpack" 2>/dev/null || : + # We must touch .gc_in_progress here to avoid $bmpack looking + # like it's been "freshened" when redundant packs are removed + # It's okay if they have the same mod time, but POSIX does not + # guarantee an ordering for the "touching" that occurs which is + # why this must be a separate command but needs no "sleep 1" + touch .gc_in_progress fi fi -# The git repack command only supports bitmaps if all objects are being packed. -# While it is theoretically possible that a project with a non-empty alternates -# file ends up packing all objects (because it does not actually use any of the -# objects found in the alternates), it's very unlikely. And, in the unlikely -# event that did occur, clients would see a message about only using one bitmap -# because Git can only use one bitmap at a time and at least one of the -# alternates is bound to have a bitmap. Therefore if we see a non-empty -# alternates file, we disable writing bitmaps which avoids the warning and any -# possibility of a client warning as well. -nobm= -[ -z "$var_have_git_172" ] || ! [ -s objects/info/alternates ] || - nobm='-c repack.writebitmaps=false -c pack.writebitmaps=false' -progress "~ [$proj] running full gc repack${nobm:+ (bitmaps disabled)}" -# We run git repack from the repack subdirectory so we can force optimized packs -# to be generated even for repositories that do not have any tagged commits -git --git-dir=repack $nobm repack $packopts -A -d -l $quiet $newdeltas $@ -rm -rf repack -! [ -e .gc_failed ] || exit 1 -# These, if they exist, are now meaningless and need to be removed -rm -f gfi-packs .needsgc .svnpack .svnpackgc -allpacks="$(echo objects/pack/pack-$octet20*.pack)" -curhead="$(cat HEAD)" -pkrf= -! [ -e packed-refs ] || pkrf=packed-refs -eval "reposizek=$(( $(echo 0 $(du -k $pkrf $allpacks 2>/dev/null | LC_ALL=C awk '{print $1}') | - LC_ALL=C sed -e 's/ / + /g') ))" -git update-server-info -# The -A option to `git repack` may have caused some loose objects to pop -# out of their packs. We must make these objects group writable so that they -# can be freshened by other pushers. Technically we need only do this for -# push projects but to enable mirror projects to be more easily converted to -# push projects, we go ahead and do it for all projects. -{ find objects/$octet -type f -name "$octet19*" -exec chmod ug+w '{}' + || :; } 2>/dev/null -if has_forks_with_alternates "$proj"; then - progress "~ [$proj] hard-linking loose objects into immediate child forks" +# Now we need to make sure that any "freshening" that takes place will actually +# result in a "newer" modification time than the .gc_in_progress file now has +sleep 1 + +# We run git pack-objects from the repack subdirectory so we can force +# optimized packs to be generated even for repositories that do not have any +# tagged commits +packs="$(git --git-dir=repack pack-objects >repack/packed-refs + +# Subtract the primary refs +GIT_ALTERNATE_OBJECT_DIRECTORIES="$PWD/repack/alt" +export GIT_ALTERNATE_OBJECT_DIRECTORIES + +# For this one we MUST use --local and MUST NOT use --write-bitmap-index +# However, if there is a "logs" subdirectory we need to use --reflog +# We do add it, just in case, if the linked working trees dir is present +# We do not add --indexed-objects as that requires v2.2.0 and it's unclear +# if it properly includes linked working tree index files or not. The +# above compute_extra_reachables has already included all index trees (thereby +# providing proper --indexed-objects support for all Git versions) making the +# option completely unnecessary. +rflopt= +! [ -d logs ] && ! [ -d worktrees ] || rflopt=--reflog +spacks="$(git --git-dir=repack pack-objects "$upack.keep" + hlpacks="${hlpacks:+$hlpacks }${upack#objects/pack/pack-}" + done + fi + # Using either --no-reuse-delta or --no-reuse-object together with the + # --keep-unreachable option is a very, very, very bad idea when good + # packs are the desired outcome. If newdeltas are being generated + # then we pack to a temp name, and use combine-packs.sh to get a better + # pack as the result to avoid making a bad --keep-unreachable pack + pfx= + [ -z "$newdeltas" ] || pfx="ku" + upacks="$(git --git-dir=repack pack-objects "$cfg_reporoot/$fork.git/.needsgc" + fi + fi + [ -z "$runupdate" ] || git --git-dir="$cfg_reporoot/$fork.git" update-server-info + # Update the fork's lastparentgc date (must be more recent than $gcstart) + git --git-dir="$cfg_reporoot/$fork.git" config gitweb.lastparentgc "$lastparentgc" + done +fi + +# Now move any primary/supplementary packs back into objects/pack +# then drop any "unfreshened" redundant packs and clear repack/alt + +# First make sure the primary pack(s) have the most recent mod time +[ -z "$packs" ] || printf 'repack/alt/pack/pack-%s.pack\n' $packs | xargs touch -c 2>/dev/null || : + +# Move the packs into place +for pack in $packs $spacks; do + rename_pack "repack/alt/pack/pack-$pack" "objects/pack/pack-$pack" +done + +# It's possible that one of the $oldpacks had a .bitmap, got renamed (along +# with its .bitmap) and then got "freshened" causing us to not remove it +# However, if $wbmopt is set we most likely now have TWO .bitmap packs! +# This can produce ugly warnings we don't want and possibly get the wrong +# bitmap used since only one .bitmap file can ever be used by Git. +# If this has happened, the .bitmap we want to discard will always have +# an _r infix so we can just zap any such now since it will leave the pack. +[ -z "$wbmopt" ] || rm -f objects/pack/pack-*_r.bitmap || : + +# Remove the redundant packs that have not since been "freshened" +# This does not completely eliminate the race condition window (Girocco's own +# activites -- gc/fetch/receive are immune to the race) but it substantially +# shrinks it down to just the time after the find but before the following rm +>repack/oldpacks +[ -z "$oldpacks" ] || +printf 'objects/pack/pack-%s.pack\n' $oldpacks | +LC_ALL=C sort >repack/oldpacks +find objects/pack -maxdepth 1 -type f -name "pack-$octet20*.pack" -newer .gc_in_progress -print | +LC_ALL=C sort >repack/freshened +deadpacks="$(LC_ALL=C join -v 1 repack/oldpacks repack/freshened | LC_ALL=C sed 's/\.pack$//')" +[ -z "$deadpacks" ] || +eval echo "$(printf '"%s".* ' $deadpacks)" | xargs rm -f || : + +# No need for this anymore +rm -rf repack/alt +unset GIT_ALTERNATE_OBJECT_DIRECTORIES + +# +## Phase IV +# + +progress "~ [$proj] running gc prune-packed" + +# We do not want the redundant packs or any new "--keep-unreachable" pack(s) to be +# present while running prune-packed. We try to guarantee that any loose object +# that's unreachable persists for at least one $Girocco::Config::min_gc_interval +# (not withstanding administrator interference to force earlier gc to occur). +# If we were to include the redundant/keep-unreachable pack(s) when running +# prune-packed and a loose unreachable object happened to be duplicated in one +# of them we would end up removing it too soon and void our guarantee. +git prune-packed $quiet + +progress "~ [$proj] running loose objects gc pack-objects" + +# Although Git v2.10.0 and later support a --pack-loose-unreachable option, +# we MUST NOT use it for these reasons: +# 1) We're not interested in expensive "unreachable" at this point, only "loose" +# 2) It produces simply horrid packs about 3.8x times larger than they should be +# 3) We don't require anything more than Git v1.6.6 +lpacks="$(run_combine_packs /dev/null || : + fi + # We need to identify these packs later so we don't combine_packs them + for objpack in $lpacks; do + rename_pack "objects/pack/pack-$objpack" "objects/pack/pack-${objpack}_o" || : + done + # Finally zap the corresponding loose objects + progress "~ [$proj] running packed loose objects gc prune-packed" + git prune-packed $quiet fi -# The git prune command does not take a -q or --quiet but started outputting -# 'Checking connectivity' progress messages in v1.7.9. However, we can -# suppress those by piping through cat as it only activates the progress -# messages when stderr is a tty. We only expire loose objects older than one -# day just in case there's some pending action (such as sending out a ref -# update) in progress that might want to examine them. This may leave us with -# loose objects. That's okay because at the next gc interval, we will always -# run gc if we see any loose objects regardless of whether or not we've seen -# any updates or we've received new linked objects from our parent. Note that -# in order to keep loose objects that just recently became unreferenced but -# have a very old modification date around we rely on some help from both the -# update.sh and hooks/pre-receive scripts. Furthermore, since Git v2.2.0 -# (d3038d22 prune: keep objects reachable from recent objects) an unreachable -# object that would otherwise be pruned (because it's too old) will be kept -# alive by an unreachable object that refers to it that's not old enough to -# be pruned yet. -prunecmd='git prune --expire 1_day_ago' -[ -n "$show_progress" ] || -prunecmd="{ $prunecmd 2>&1 || touch .gc_failed; } | cat" -progress "~ [$proj] pruning expired unreachable loose objects" -eval "$prunecmd" ! [ -e .gc_failed ] || exit 1 +# These, if they exist, are now meaningless and need to be removed +rm -f gfi-packs .needsgc .svnpack .svnpackgc + +# Make sure this stays up to date +git update-server-info + +# We must make loose objects group writable so that they +# can be freshened by other pushers. Technically we need only do this for +# push projects but to enable mirror projects to be more easily converted to +# push projects, we go ahead and do it for all projects. +# By the time we get here we really shouldn't have any of these, but just in case. +{ find objects/$octet -type f -name "$octet19*" -exec chmod ug+w '{}' + || :; } 2>/dev/null # darcs:// mirrors have a xxx.log file that will grow endlessly # if this is a mirror and the file exists, shorten it to 10000 lines @@ -960,24 +1416,23 @@ fi # Create a matching .bndl header file for the all-in-one pack we just created # but only if we're not a fork (otherwise the bundle would not be complete) # and we are running at least Git version 1.7.2 (pack_is_complete always fails otherwise) -if ! [ -s objects/info/alternates ] && [ -n "$var_have_git_172" ]; then - # There should only be one pack in $allpacks but if there was a - # simultaneous push... +if [ -n "$makebndl" ] && [ -n "$var_have_git_172" ]; then + # There should only be one pack in $packs but do some checking... # The one we just created will have a .idx and will NOT have a .keep progress "~ [$proj] creating downloadble bundle header" - pkfound= + pkbase= pkhead= - for pk in $allpacks; do - [ -s "$pk" ] || continue - pkbase="${pk%.pack}" - [ -s "$pkbase.idx" ] || continue - ! [ -e "$pkbase.keep" ] || continue - if pkhead="$(pack_is_complete "$PWD/$pk" "$PWD/packed-refs" "$curhead")"; then - pkfound="$pkbase" - break; - fi - done - if [ -n "$pkfound" ] && [ -n "$pkhead" ]; then + IFS= read -r curhead /dev/null | + LC_ALL=C awk '{print $1}') | + LC_ALL=C sed -e 's/ / + /g') ))" config_set_raw girocco.reposizek "${reposizek:-0}" +# Now we're finally done with this +rm -rf repack + +# We didn't used to do anything about rerere or worktrees but we're +# trying to make nice with linked working trees these days :) +# Maybe even non-bare repositories too, but *shush* about those ;) +if [ -n "$var_have_git_250" ] && [ -d worktrees ]; then + # The value "3.months.ago" is hard-coded into gc.c rather than + # having the default be in worktree.c so we must provide it if + # we get nothing out of the gc.worktreePruneExpire config item + # Prior to Git v2.6.0 the config item was gc.pruneworktreesexpire + # however we just always use the newer name no matter what Git version + expiry="$(git config --get gc.worktreePruneExpire 2>/dev/null)" || : + eval git worktree prune --expire '"${expiry:-3.months.ago}"' "${quiet:+>/dev/null 2>&1}" || : +fi +# git rerere does it right and handles its own default/config'd expiration values +! [ -d rr-cache ] || eval git rerere gc "${quiet:+>/dev/null 2>&1}" || : + # We use $gcstart here to avoid a race where a push occurs during the gc itself # and the next future gc could be incorrectly skipped if we used the current # timestamp here instead diff --git a/toolbox/perform-pre-gc-linking.sh b/toolbox/perform-pre-gc-linking.sh dissimilarity index 76% index be1f405..93f34e5 100755 --- a/toolbox/perform-pre-gc-linking.sh +++ b/toolbox/perform-pre-gc-linking.sh @@ -1,280 +1,149 @@ -#!/bin/sh - -# Perform pre-gc linking of unreachable objects to forks - -# It may, under unusual circumstances, be desirable to run git gc -# manually. However, running git gc on a project that has forks is -# dangerous as it can reap objects not in use by the project itself -# but which are still in use by one or more forks which do not have -# their own copy since they use an alternates file to refer to them. -# -# Note that a .nogc file should really be created during the manual -# gc operation! -# -# Running this script on a project BEFORE manually running git gc -# on that project prevents this problem from occuring PROVIDED the -# "git gc --prune=all" or "git gc --prune=now" or "git repack -a -d -l" -# options are NOT used. In other words NEVER prune objects immediately -# if the project has ANY forks at all! -# -# During normal gc.sh operations, what this script does is essentially -# what happens AFTER "git repack -A -d -l" but BEFORE "git prune". This -# script does, however, also touch all the .pack files to make sure nothing -# accidentally gets pruned early (just like gc.sh does). -# -# It is enough to run this script before a "git gc" rather than in the -# middle (i.e. after "git repack -A -d -l" but before "git prune") although -# that could result in forks referring to now-loosened objects in the parent. -# This is not a terrible thing and those objects will be propagated into the -# forks before any future "git prune" so nothing will be lost. -# -# However, having any child forks depend on loose objects only available via -# their alternates as unreachable loose objects in those alternates can easily -# be avoided by simply running this script again after running "git gc". In -# other words do this: -# -# 1. Run this script on the project -# 2. Run "git gc" with any options AVOIDING any than cause immediate pruning -# 3. Run this script again on the project (optional but desirable) -# -# Note that running this script WILL make all the project's child forks eligible -# for gc at their next interval (i.e. they will not skip running gc even if it -# ends up not actually doing anything). -# -# Loose objects are normally just hard-linked into the child forks, but if the -# "--single-pack" option is used they will instead be combined into a single -# pack and that will be hard-linked into the child forks instead. -# With the --include-packs option packs will also be hard-linked into the -# forks (the old behavior) -- useful before extreme modification of the parent. - -set -e - -. @basedir@/shlib.sh - -umask 002 - -force= -singlepack= -packstoo= -while case "$1" in - --help|-h) - cat < - --force Run even though no .nogc or .bypass file present - --single-pack Hard-link a pack of loose objects down to forks - --include-packs Hard-link packs down to forks as well as loose objects - Name of project (e.g. "git" or "git/fork" etc.) -The --single-pack and --include-packs options are currently incompatible. -EOT - --force) - force=1;; - --single-pack) - singlepack=1;; - --include-packs) - packstoo=1;; - --) - shift; break;; - -?*) - echo "Unknown option: $1" >&2; exit 1;; - *) - ! :;; -esac; do shift; done - -if [ -n "$singlepack" ] && [ -n "$packstoo" ]; then - echo "Currently --include-packs and --single-pack are incompatible." - exit 1 -fi - -proj="${1%.git}" -if [ "$#" -ne 1 ] || [ -z "$proj" ]; then - echo "I need a project name (e.g. \"$(basename "$0") example\")" - echo "(See also help -- \"$(basename "$0") --help\")" - exit 1 -fi -if ! cd "$cfg_reporoot/$proj.git"; then - echo "no such directory: $cfg_reporoot/$proj.git" - exit 1 -fi -apid= -ahost= -{ read -r apid ahost ajunk /dev/null 2>&1 || : -if [ -n "$apid" ] && [ -n "$ahost" ]; then - echo "ERROR: refusing to run, $cfg_reporoot/$proj.git/gc.pid file exists" - echo "ERROR: is gc already running on machine '$ahost' pid '$apid'?" - exit 1 -fi - -if [ -z "$force" ] && ! [ -e .nogc ] && ! [ -e .bypass ]; then - echo "WARNING: no .nogc or .bypass file found in $cfg_reporoot/$proj.git" - echo "WARNING: jobd.pl could run gc.sh while you're fussing with $proj" - echo "WARNING: either create one of those files or re-run with --force" - echo "WARNING: (e.g. \"$(basename "$0") --force ${singlepack:+--single-pack }$proj\") to bypass this warning" - echo "WARNING: please remember to remove the file after you're done fussing" - exit 1 -fi - -# date -R is linux-only, POSIX equivalent is '+%a, %d %b %Y %T %z' -datefmt='+%a, %d %b %Y %T %z' - -# make sure combine-packs uses the correct Git executable -run_combine_packs() { - PATH="$var_git_exec_path:$cfg_basedir/bin:$PATH" @basedir@/jobd/combine-packs.sh "$@" -} - -trap 'echo "pre-packing-and-linking failed" >&2; exit 1' EXIT - - -# The following is taken verbatim from gc.sh (with some whitespace adjustment -# and comment removal and ">.gc_failed" commented out and .pack linking added) -# and should be kept in sync with it - - -# This part touches any packs to make sure loosened objects avoid immediate pruning - -# ---- BEGIN DUPLICATED CODE SECTION ONE ---- - -list_packs --exclude-no-idx objects/pack | xargs touch -c 2>/dev/null || : -bmpack="$(list_packs --exclude-no-bitmap --exclude-no-idx --max-matches 1 objects/pack)" -[ -n "$bmpack" ] || bmpack="$(list_packs --exclude-no-idx --max-matches 1 --object-limit -1 --include-boundary objects/pack)" -if [ -n "$bmpack" ] && [ -f "$bmpack" ] && [ -s "$bmpack" ]; then - sleep 1 - touch -c "$bmpack" 2>/dev/null || : -fi - -# ---- END DUPLICATED CODE SECTION ONE ---- - - -# This part creates a pack of all loose objects and hard-links it into any children -# This logic is no longer used by default but may be selected with the "--pack" option - -propagate_single_pack() { -# ---- BEGIN DUPLICATED CODE SECTION TWO ---- - -if has_forks_with_alternates "$proj"; then - # Pack up all the loose objects and copy (actually hard link) them into all the forks - progress "~ [$proj] creating pack of loose objects for forks" - lpacks="$(find objects/$octet -maxdepth 1 -type f -name "$octet19*" -print 2>/dev/null | - LC_ALL=C awk -F / '{print $2 $3}' | - run_combine_packs --objects --names $packopts --incremental --all-progress-implied $quiet --non-empty)" || { - #>.gc_failed - exit 1 - } - # We have to update the lastparentgc time in the child forks even if they do not get any - # new "loose objects" pack(s) because they need to run gc just in case the parent now has - # some objects that used to only be in the child so they can be removed from the child. - # For example, a "patch" might be developed first in a fork and then later accepted into - # the parent in which case the objects making up the patch in the child fork are now - # redundant (since they're now in the parent as well) and need to be removed from the - # child fork which can only happen if the child fork runs gc. - forkdir="$proj" - # It is enough to copy objects just one level down and get_repo_list - # takes a regular expression (which is automatically prefixed with '^') - # so we can easily match forks exactly one level down from this project - get_repo_list "$forkdir/[^/]*:" | - while read fork; do - # Ignore forks that do not exist or are symbolic links - ! [ -L "$cfg_reporoot/$fork.git" ] && [ -d "$cfg_reporoot/$fork.git" ] || - continue - # Or do not have a non-zero length alternates file - [ -s "$cfg_reporoot/$fork.git/objects/info/alternates" ] || - continue - if [ -n "$lpacks" ]; then - # Install the "loose objects" pack(s) into the fork - [ -d "$cfg_reporoot/$fork.git/objects/pack" ] || ( - cd "$cfg_reporoot/$fork.git" && - mkdir -p objects/pack - ) - for lpack in $lpacks; do - ln -f objects/pack/"pack-$lpack.pack" objects/pack/"pack-$lpack.idx" \ - "$cfg_reporoot/$fork.git/objects/pack/" || : - done - if ! [ -e "$cfg_reporoot/$fork.git/.needsgc" ]; then - # Trigger a mini gc in the fork if it now has too many packs - packs="$(list_packs --quiet --count --exclude-no-idx --exclude-keep "$cfg_reporoot/$fork.git/objects/pack")" || : - if [ -n "$packs" ] && [ "$packs" -ge 20 ]; then - >"$cfg_reporoot/$fork.git/.needsgc" - fi - fi - git --git-dir="$cfg_reporoot/$fork.git" update-server-info - fi - # Update the fork's lastparentgc date (must be current, not $gcstart) - git --git-dir="$cfg_reporoot/$fork.git" config \ - gitweb.lastparentgc "$(date "$datefmt")" - done - if [ -n "$lpacks" ]; then - # Remove the "loose objects" pack(s) from the parent - for lpack in $lpacks; do - rm -f objects/pack/"pack-$lpack.idx" objects/pack/"pack-$lpack.pack" - done - fi -fi - -# ---- END DUPLICATED CODE SECTION TWO ---- -} - - -# This part hard-links all loose objects into any children - -propagate_objects() { -# ---- BEGIN DUPLICATED CODE SECTION THREE ---- - -if has_forks_with_alternates "$proj"; then - progress "~ [$proj] hard-linking loose objects${packstoo:+and packs } into immediate child forks" - # We have to update the lastparentgc time in the child forks even if they do not get any - # new "loose objects" because they need to run gc just in case the parent now has some - # objects that used to only be in the child so they can be removed from the child. - # For example, a "patch" might be developed first in a fork and then later accepted into - # the parent in which case the objects making up the patch in the child fork are now - # redundant (since they're now in the parent as well) and need to be removed from the - # child fork which can only happen if the child fork runs gc. - forkdir="$proj" - # It is enough to copy objects just one level down and get_repo_list - # takes a regular expression (which is automatically prefixed with '^') - # so we can easily match forks exactly one level down from this project - get_repo_list "$forkdir/[^/]*:" | - while read fork; do - # Ignore forks that do not exist or are symbolic links - ! [ -L "$cfg_reporoot/$fork.git" ] && [ -d "$cfg_reporoot/$fork.git" ] || - continue - # Or do not have a non-zero length alternates file - [ -s "$cfg_reporoot/$fork.git/objects/info/alternates" ] || - continue - # Match objects in parent project - for d in objects/$octet; do - [ "$d" != "objects/$octet" ] || continue - mkdir -p "$cfg_reporoot/$fork.git/$d" - find "$d" -maxdepth 1 -type f -name "$octet19*" -exec \ - "$var_sh_bin" -c 'ln -f "$@" '"'$cfg_reporoot/$fork.git/$d/'" sh '{}' + || : - done - # Match packs in parent project if --include-packs given - if [ -n "$packstoo" ]; then - mkdir -p "$cfg_reporoot/$fork.git/objects/pack" - list_packs --all --exclude-no-idx objects/pack | LC_All=C sed 'p;s/\.pack$/.idx/' | - xargs "$var_sh_bin" -c 'ln -f "$@" '"'$cfg_reporoot/$fork.git/objects/pack/'" sh || : - if ! [ -e "$cfg_reporoot/$fork.git/.needsgc" ]; then - # Trigger a mini gc in the fork if it now has too many packs - packs="$(list_packs --quiet --count --exclude-no-idx --exclude-keep "$cfg_reporoot/$fork.git/objects/pack")" || : - if [ -n "$packs" ] && [ "$packs" -ge 20 ]; then - >"$cfg_reporoot/$fork.git/.needsgc" - fi - fi - git --git-dir="$cfg_reporoot/$fork.git" update-server-info - fi - # Update the fork's lastparentgc date (must be current, not $gcstart) - git --git-dir="$cfg_reporoot/$fork.git" config \ - gitweb.lastparentgc "$(date "$datefmt")" - done -fi - -# ---- END DUPLICATED CODE SECTION THREE ---- -} - - -if [ -n "$singlepack" ]; then - propagate_single_pack -else - propagate_objects -fi - -trap - EXIT -echo "loose objects ${packstoo:+and packs }for $proj have now been ${singlepack:+packed and }linked into child forks (if any)" +#!/bin/sh + +# Perform pre-gc linking of packs and objects to forks + +# It may, under extremely unusual circumstances, be desirable to run git gc +# manually. However, running git gc on a project that has forks is dangerous +# as it can reap objects not in use by the project itself but which are still +# in use by one or more forks which do not have their own copy since they use +# an alternates file to refer to them. +# +# Running this script on a project BEFORE manually running git gc on that +# project prevents this problem from occuring. +# +# Note that a .nogc file should really be created during the manual gc +# operation on a project! +# +# Alternatively, if a project is to be removed but its forks are to be kept +# then this script MUST be run before removing the project so as not to corrupt +# the forks that will be kept. +# +# Before the new order of gc came along this script did various things to +# try and optimize what it hard-linked down into the forks. Now, however, +# it just hard-links all packs and loose objects down since with the new order +# both packs and objects could end up going away. +# +# This technique is not as optimal as the prior version, but with the advent +# of the new order for running gc and the ability to trigger/force a Girocco gc +# on a project using the "projtool.pl" command there's really no legitimate +# reason to use this script anymore other than when keeping the forks of a +# project while discarding the project itself. +# +# In that case all packs and objects must be hard-linked down to the child +# fork(s). That functionality is all that remains in this script. It accepts +# and ignores the previous options (for backwards compatibility) and just +# always does that now. +# +# The single mode of operation makes maintenance of this script easier too. + +set -e + +. @basedir@/shlib.sh + +umask 002 + +force= +while case "$1" in + --help|-h) + cat < + --force Run even though no .nogc or .bypass file present + --single-pack Ignored for backwards compatibility + --include-packs Ignored for backwards compatibility + Name of project (e.g. "git" or "git/fork" etc.) +Always hard-links all packs and all loose objects down to forks. +EOT + --force) + force=1;; + --single-pack|--include-packs) + ;; + --) + shift; break;; + -?*) + echo "Unknown option: $1" >&2; exit 1;; + *) + ! :;; +esac; do shift; done + +proj="${1%.git}" +if [ "$#" -ne 1 ] || [ -z "$proj" ]; then + echo "I need a project name (e.g. \"$(basename "$0") example\")" + echo "(See also help -- \"$(basename "$0") --help\")" + exit 1 +fi +if ! cd "$cfg_reporoot/$proj.git"; then + echo "no such directory: $cfg_reporoot/$proj.git" + exit 1 +fi +apid= +ahost= +{ read -r apid ahost ajunk /dev/null 2>&1 || : +if [ -n "$apid" ] && [ -n "$ahost" ]; then + echo "ERROR: refusing to run, $cfg_reporoot/$proj.git/gc.pid file exists" + echo "ERROR: is gc already running on machine '$ahost' pid '$apid'?" + exit 1 +fi + +if [ -z "$force" ] && ! [ -e .nogc ] && ! [ -e .bypass ]; then + echo "WARNING: no .nogc or .bypass file found in $cfg_reporoot/$proj.git" + echo "WARNING: jobd.pl could run gc.sh while you're fussing with $proj" + echo "WARNING: either create one of those files or re-run with --force" + echo "WARNING: (e.g. \"$(basename "$0") --force ${singlepack:+--single-pack }$proj\") to bypass this warning" + echo "WARNING: please remember to remove the file after you're done fussing" + exit 1 +fi + +# date -R is linux-only, POSIX equivalent is '+%a, %d %b %Y %T %z' +datefmt='+%a, %d %b %Y %T %z' + +trap 'echo "hard-linking failed" >&2; exit 1' EXIT + +if has_forks_with_alternates "$proj"; then + + # We have to update the lastparentgc time in the child forks even if they do not get any + # new "loose objects" because they need to run gc just in case the parent now has some + # objects that used to only be in the child so they can be removed from the child. + # For example, a "patch" might be developed first in a fork and then later accepted into + # the parent in which case the objects making up the patch in the child fork are now + # redundant (since they're now in the parent as well) and need to be removed from the + # child fork which can only happen if the child fork runs gc. + + # It is enough to copy objects just one level down and get_repo_list + # takes a regular expression (which is automatically prefixed with '^') + # so we can easily match forks exactly one level down from this project + + forkdir="$proj" + get_repo_list "$forkdir/[^/:][^/:]*:" | + while read fork; do + # Ignore forks that do not exist or are symbolic links + ! [ -L "$cfg_reporoot/$fork.git" ] && [ -d "$cfg_reporoot/$fork.git" ] || + continue + # Or do not have a non-zero length alternates file + [ -s "$cfg_reporoot/$fork.git/objects/info/alternates" ] || + continue + # Match objects in parent project + for d in objects/$octet; do + [ "$d" != "objects/$octet" ] || continue + mkdir -p "$cfg_reporoot/$fork.git/$d" + find "$d" -maxdepth 1 -type f -name "$octet19*" -exec \ + "$var_sh_bin" -c 'ln -f "$@" '"'$cfg_reporoot/$fork.git/$d/'" sh '{}' + || : + done + # Match packs in parent project + mkdir -p "$cfg_reporoot/$fork.git/objects/pack" + list_packs --all --exclude-no-idx objects/pack | LC_All=C sed 'p;s/\.pack$/.idx/' | + xargs "$var_sh_bin" -c 'ln -f "$@" '"'$cfg_reporoot/$fork.git/objects/pack/'" sh || : + if ! [ -e "$cfg_reporoot/$fork.git/.needsgc" ]; then + # Trigger a mini gc in the fork if it now has too many packs + packs="$(list_packs --quiet --count --exclude-no-idx --exclude-keep "$cfg_reporoot/$fork.git/objects/pack")" || : + if [ -n "$packs" ] && [ "$packs" -ge 20 ]; then + >"$cfg_reporoot/$fork.git/.needsgc" + fi + fi + git --git-dir="$cfg_reporoot/$fork.git" update-server-info + # Update the fork's lastparentgc date + git --git-dir="$cfg_reporoot/$fork.git" config gitweb.lastparentgc "$(date "$datefmt")" + done +fi + +trap - EXIT +echo "packs and loose objects for $proj have now been linked into child forks (if any)" -- 2.11.4.GIT