From 1613083b89521fb366b254295ea8c8e4c59b727c Mon Sep 17 00:00:00 2001 From: "Kyle J. McKay" Date: Fri, 2 Sep 2016 18:08:24 -0700 Subject: [PATCH] forks: revert to hard-linking objects during gc In the past we used to hard-link all currently existing packs and loose objects down to all immediate child forks just before beginning garbage collection. This was necessary because we used "git repack -a -d -l" and otherwise objects no longer referred to by the parent but still reachable by the child could be lost resulting in child corruption. And we can't abide corrupt children. Subsequently we switched to "git repack -A -d -l" to make sure mail.sh has time to send out its notifications before any unreachable objects disappear and to avoid inherent race conditions with simultaneous pruning and pushing. Later we took advantage of this change to pack up any remaining (and therefore unreachable) loose objects after the repack into a single pack and just hard-link that into all the child forks before running "git prune". This is very efficient and works extremely well with one small exception. Eventually the child forks will run "git repack -A -d -l" themselves. Once the parent has finally pruned its unreachable loose objects that were in the single pack that was hard-linked into all the forks, a fork running "git repack -A -d -l" will then have any of those unreachable objects that are also unreachable from the fork pop out of the pack to become loose objects in the fork. And they will not be hard-linked anywhere else when that happens meaning the disk space used could then be multiplied by the number of forks present. This is an unacceptable situation and so we must revert to hard-linking the loose objects themselves into the child forks. We do, however, skip hard-linking the packs this time since we are now running repack with the "-A" option. We perform the hard-linking just after the repack but before the "git prune" so we are guaranteed that anything that could potentially be pruned has already been hard-linked into the child forks before that happens. The hard-linking code has also been upgraded to be impervious to command line length overflow problems when an excessive number of loose objects are present. Signed-off-by: Kyle J. McKay --- jobd/gc.sh | 51 ++++++------------------ toolbox/perform-pre-gc-linking.sh | 83 +++++++++++++++++++++++++++++++++------ 2 files changed, 84 insertions(+), 50 deletions(-) diff --git a/jobd/gc.sh b/jobd/gc.sh index d4d7ccb..3c60c23 100755 --- a/jobd/gc.sh +++ b/jobd/gc.sh @@ -599,9 +599,8 @@ fi ## 1. Run "git repack -A -d -l" in the parent BEFORE doing anything about ## child forks. ## -## 2. Collect all remaining existing loose objects in the parent into a -## single pack BEFORE running "git prune" and if it's not empty then -## hard-link that single pack into the immediate children. +## 2. Hard-link all remaining existing loose objects in the parent into the +## immediate child forks. ## ## 3. Now run "git prune" in the parent. ## @@ -675,21 +674,15 @@ git update-server-info { find objects/$octet -type f -name "$octet19" -print0 | xargs -0 chmod ug+w || :; } 2>/dev/null if has_forks "$proj"; then - # Pack up all the loose objects and copy (actually hard link) them into all the forks - progress "~ [$proj] creating pack of loose objects for forks" - lpacks="$(find objects/$octet -maxdepth 1 -type f -name "$octet19" -print 2>/dev/null | - LC_ALL=C awk -F / '{print $2 $3}' | - run_combine_packs --objects --names $packopts --incremental --all-progress-implied $quiet --non-empty)" || { - >.gc_failed - exit 1 - } + progress "~ [$proj] hard-linking loose objects into immediate child forks" # We have to update the lastparentgc time in the child forks even if they do not get any - # new "loose objects" pack(s) because they need to run gc just in case the parent now has - # some objects that used to only be in the child so they can be removed from the child. + # new "loose objects" because they need to run gc just in case the parent now has some + # objects that used to only be in the child so they can be removed from the child. # For example, a "patch" might be developed first in a fork and then later accepted into # the parent in which case the objects making up the patch in the child fork are now # redundant (since they're now in the parent as well) and need to be removed from the # child fork which can only happen if the child fork runs gc. + shbin="${cfg_posix_sh_bin:-/bin/sh}" forkdir="$proj" # It is enough to copy objects just one level down and get_repo_list # takes a regular expression (which is automatically prefixed with '^') @@ -702,35 +695,17 @@ if has_forks "$proj"; then # Or do not have a non-zero length alternates file [ -s "$cfg_reporoot/$fork.git/objects/info/alternates" ] || \ continue - if [ -n "$lpacks" ]; then - # Install the "loose objects" pack(s) into the fork - [ -d "$cfg_reporoot/$fork.git/objects/pack" ] || ( - cd "$cfg_reporoot/$fork.git" && \ - mkdir -p objects/pack - ) - for lpack in $lpacks; do - ln -f objects/pack/"pack-$lpack.pack" objects/pack/"pack-$lpack.idx" \ - "$cfg_reporoot/$fork.git/objects/pack/" || : - done - if ! [ -e "$cfg_reporoot/$fork.git/.needsgc" ]; then - # Trigger a mini gc in the fork if it now has too many packs - packs="$(list_packs --quiet --count --exclude-no-idx "$cfg_reporoot/$fork.git/objects/pack" || :)" - if [ -n "$packs" ] && [ "$packs" -ge 20 ]; then - >"$cfg_reporoot/$fork.git/.needsgc" - fi - fi - git --git-dir="$cfg_reporoot/$fork.git" update-server-info - fi + # Match objects in parent project + for d in objects/$octet; do + [ "$d" != "objects/$octet" ] || continue + mkdir -p "$cfg_reporoot/$fork.git/$d" + find "$d" -maxdepth 1 -type f -name "$octet19" -print0 | + xargs -0 "$shbin" -c 'ln -f "$@" '"'$cfg_reporoot/$fork.git/$d/'" sh || : + done # Update the fork's lastparentgc date (must be current, not $gcstart) git --git-dir="$cfg_reporoot/$fork.git" config \ gitweb.lastparentgc "$(date "$datefmt")" done - if [ -n "$lpacks" ]; then - # Remove the "loose objects" pack(s) from the parent - for lpack in $lpacks; do - rm -f objects/pack/"pack-$lpack.idx" objects/pack/"pack-$lpack.pack" - done - fi fi # The git prune command does not take a -q or --quiet but started outputting diff --git a/toolbox/perform-pre-gc-linking.sh b/toolbox/perform-pre-gc-linking.sh index dc95234..af827bc 100755 --- a/toolbox/perform-pre-gc-linking.sh +++ b/toolbox/perform-pre-gc-linking.sh @@ -25,11 +25,13 @@ # It is enough to run this script before a "git gc" rather than in the # middle (i.e. after "git repack -A -d -l" but before "git prune") although # that could result in forks referring to now-loosened objects in the parent. -# This is not a terrible thing and those objects will be packed up and linked -# into the forks before any future "git prune" so nothing will be lost, but -# the forks will not be quite as efficient as they could be with everything -# located in packs. The simple solution is to just run this script again -# after running "git gc". In other words do this: +# This is not a terrible thing and those objects will be propagated into the +# forks before any future "git prune" so nothing will be lost. +# +# However, having any child forks depend on loose objects only available via +# their alternates as unreachable loose objects in those alternates can easily +# be avoided by simply running this script again after running "git gc". In +# other words do this: # # 1. Run this script on the project # 2. Run "git gc" with any options AVOIDING any than cause immediate pruning @@ -37,7 +39,11 @@ # # Note that running this script WILL make all the project's child forks eligible # for gc at their next interval (i.e. they will not skip running gc even if it -# ends up not actually doing anything) +# ends up not actually doing anything). +# +# Loose objects are normally just hard-linked into the child forks, but if the +# "--single-pack" option is used they will instead be combined into a single +# pack and that will be hard-linked into the child forks instead. set -e @@ -46,10 +52,10 @@ set -e umask 002 force= -if [ "$1" = "--force" ]; then - force=1 - shift -fi +singlepack= +[ "$1" != "--force" ] || { force=1; shift; } +[ "$1" != "--single-pack" ] || { singlepack=1; shift; } +[ "$1" != "--force" ] || { force=1; shift; } proj="${1%.git}" if [ "$#" -ne 1 ] || [ -z "$proj" ]; then @@ -73,7 +79,7 @@ if [ -z "$force" ] && ! [ -e .nogc -o -e .bypass ]; then echo "WARNING: no .nogc or .bypass file found in $cfg_reporoot/$proj.git" echo "WARNING: jobd.pl could run gc.sh while you're fussing with $proj" echo "WARNING: either create one of those files or re-run with --force" - echo "WARNING: (e.g. \"$(basename "$0") --force $proj\") to bypass this warning" + echo "WARNING: (e.g. \"$(basename "$0") --force ${singlepack:+--single-pack }$proj\") to bypass this warning" echo "WARNING: please remember to remove the file after you're done fussing" exit 1 fi @@ -110,7 +116,9 @@ fi # This part creates a pack of all loose objects and hard-links it into any children +# This logic is no longer used by default but may be selected with the "--pack" option +propagate_single_pack() { # ---- BEGIN DUPLICATED CODE SECTION TWO ---- if has_forks "$proj"; then @@ -173,7 +181,58 @@ if has_forks "$proj"; then fi # ---- END DUPLICATED CODE SECTION TWO ---- +} + +# This part hard-links all loose objects into any children + +propagate_objects() { +# ---- BEGIN DUPLICATED CODE SECTION THREE ---- + +if has_forks "$proj"; then + progress "~ [$proj] hard-linking loose objects into immediate child forks" + # We have to update the lastparentgc time in the child forks even if they do not get any + # new "loose objects" because they need to run gc just in case the parent now has some + # objects that used to only be in the child so they can be removed from the child. + # For example, a "patch" might be developed first in a fork and then later accepted into + # the parent in which case the objects making up the patch in the child fork are now + # redundant (since they're now in the parent as well) and need to be removed from the + # child fork which can only happen if the child fork runs gc. + shbin="${cfg_posix_sh_bin:-/bin/sh}" + forkdir="$proj" + # It is enough to copy objects just one level down and get_repo_list + # takes a regular expression (which is automatically prefixed with '^') + # so we can easily match forks exactly one level down from this project + get_repo_list "$forkdir/[^/]*:" | + while read fork; do + # Ignore forks that do not exist or are symbolic links + [ ! -L "$cfg_reporoot/$fork.git" -a -d "$cfg_reporoot/$fork.git" ] || \ + continue + # Or do not have a non-zero length alternates file + [ -s "$cfg_reporoot/$fork.git/objects/info/alternates" ] || \ + continue + # Match objects in parent project + for d in objects/$octet; do + [ "$d" != "objects/$octet" ] || continue + mkdir -p "$cfg_reporoot/$fork.git/$d" + find "$d" -maxdepth 1 -type f -name "$octet19" -print0 | + xargs -0 "$shbin" -c 'ln -f "$@" '"'$cfg_reporoot/$fork.git/$d/'" sh || : + done + # Update the fork's lastparentgc date (must be current, not $gcstart) + git --git-dir="$cfg_reporoot/$fork.git" config \ + gitweb.lastparentgc "$(date "$datefmt")" + done +fi + +# ---- END DUPLICATED CODE SECTION THREE ---- +} + + +if [ -n "$singlepack" ]; then + propagate_single_pack +else + propagate_objects +fi trap - EXIT -echo "loose objects for $proj have now been packed and linked into child forks (if any)" +echo "loose objects for $proj have now been ${singlepack:+packed and }linked into child forks (if any)" -- 2.11.4.GIT