ja/09-git-internals/01-chapter9.markdown

   1 # Gitの内側 #
   2
   3 以前の章を飛ばしてこの章に来るか、この本の他の部分を読んだ後にここに到達したかでしょう。どちらの場合であっても、この章は、Gitの内部動作と実装を辿るところになります。この情報を学習することは、Gitがどうして便利で効果的なのかを理解するのには根本的には重要重要ですが、他の人々は初心者には混乱を招き無駄に複雑だと主張してきました。このため、遅かれ早かれ学習の仕方にあわせて読めるように、この議論をこの本の最後の章におきました。いつ読むかは、読者の判断にお任せします。
   4
   5 今やあなたはこの章を読んでいるので、早速、この章の議論を始めましょう。まず、Gitは基本的に連想記憶ファイル・システム（content-addressable filesystem）であり、それの上に書かれたVCSユーザー・インターフェイスを備えています。これが意味することを、もうちょっと学習していきましょう。
   6
   7 初期のGit（主として1.5以前）では、磨き上げられたVCSというより、むしろファイル・システムを（Gitの特徴として）強調したため、そのユーザー・インターフェイスは今よりも複雑なものでした。最近の数年間で、ユーザー・インターフェイスは、簡潔でそこら中のあらゆるシステムで簡単に扱えるまで改良されましたが、Gitに対する固定観念はたいてい、複雑で学習するのが難しい、初期のGitのユーザー・インターフェイスのあたりから変わっていません。
   8
   9
  10 連想記憶ファイル・システム層は、驚くほど素晴らしいので、この章の最初でそれをカバーすることにします。そして、転送メカニズムと、結局は取り扱う事になるリポジトリの保守作業について学習することにします。
  11
  12 ## Plumbing and Porcelain ##
  13
  14 This book covers how to use Git with 30 or so verbs such as `checkout`, `branch`, `remote`, and so on. But because Git was initially a toolkit for a VCS rather than a full user-friendly VCS, it has a bunch of verbs that do low-level work and were designed to be chained together UNIX style or called from scripts. These commands are generally referred to as "plumbing" commands, and the more user-friendly commands are called "porcelain" commands.
  15
  16 The book’s first eight chapters deal almost exclusively with porcelain commands. But in this chapter, you’ll be dealing mostly with the lower-level plumbing commands, because they give you access to the inner workings of Git and help demonstrate how and why Git does what it does. These commands aren’t meant to be used manually on the command line, but rather to be used as building blocks for new tools and custom scripts.
  17
  18 When you run `git init` in a new or existing directory, Git creates the `.git` directory, which is where almost everything that Git stores and manipulates is located. If you want to back up or clone your repository, copying this single directory elsewhere gives you nearly everything you need. This entire chapter basically deals with the stuff in this directory. Here’s what it looks like:
  19
  20         $ ls
  21         HEAD
  22         branches/
  23         config
  24         description
  25         hooks/
  26         index
  27         info/
  28         objects/
  29         refs/
  30
  31 You may see some other files in there, but this is a fresh `git init` repository — it’s what you see by default. The `branches` directory isn’t used by newer Git versions, and the `description` file is only used by the GitWeb program, so don’t worry about those. The `config` file contains your project-specific configuration options, and the `info` directory keeps a global exclude file for ignored patterns that you don’t want to track in a .gitignore file. The `hooks` directory contains your client- or server-side hook scripts, which are discussed in detail in Chapter 6.
  32
  33 This leaves four important entries: the `HEAD` and `index` files and the `objects` and `refs` directories. These are the core parts of Git. The `objects` directory stores all the content for your database, the `refs` directory stores pointers into commit objects in that data (branches), the `HEAD` file points to the branch you currently have checked out, and the `index` file is where Git stores your staging area information. You’ll now look at each of these sections in detail to see how Git operates.
  34
  35 ## Git Objects ##
  36
  37 Git is a content-addressable filesystem. Great. What does that mean?
  38 It means that at the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time. To demonstrate, you can use the plumbing command `hash-object`, which takes some data, stores it in your `.git` directory, and gives you back the key the data is stored as. First, you initialize a new Git repository and verify that there is nothing in the `objects` directory:
  39
  40         $ mkdir test
  41         $ cd test
  42         $ git init
  43         Initialized empty Git repository in /tmp/test/.git/
  44         $ find .git/objects
  45         .git/objects
  46         .git/objects/info
  47         .git/objects/pack
  48         $ find .git/objects -type f
  49         $
  50
  51 Git has initialized the `objects` directory and created `pack` and `info` subdirectories in it, but there are no regular files. Now, store some text in your Git database:
  52
  53         $ echo 'test content' | git hash-object -w --stdin
  54         d670460b4b4aece5915caf5c68d12f560a9fe3e4
  55
  56 The `-w` tells `hash-object` to store the object; otherwise, the command simply tells you what the key would be. `--stdin` tells the command to read the content from stdin; if you don’t specify this, `hash-object` expects the path to a file. The output from the command is a 40-character checksum hash. This is the SHA-1 hash — a checksum of the content you’re storing plus a header, which you’ll learn about in a bit. Now you can see how Git has stored your data:
  57
  58         $ find .git/objects -type f
  59         .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
  60
  61 You can see a file in the `objects` directory. This is how Git stores the content initially — as a single file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA, and the filename is the remaining 38 characters.
  62
  63 You can pull the content back out of Git with the `cat-file` command. This command is sort of a Swiss army knife for inspecting Git objects. Passing `-p` to it instructs the `cat-file` command to figure out the type of content and display it nicely for you:
  64
  65         $ git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
  66         test content
  67
  68 Now, you can add content to Git and pull it back out again. You can also do this with content in files. For example, you can do some simple version control on a file. First, create a new file and save its contents in your database:
  69
  70         $ echo 'version 1' > test.txt
  71         $ git hash-object -w test.txt
  72         83baae61804e65cc73a7201a7252750c76066a30
  73
  74 Then, write some new content to the file, and save it again:
  75
  76         $ echo 'version 2' > test.txt
  77         $ git hash-object -w test.txt
  78         1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
  79
  80 Your database contains the two new versions of the file as well as the first content you stored there:
  81
  82         $ find .git/objects -type f
  83         .git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a
  84         .git/objects/83/baae61804e65cc73a7201a7252750c76066a30
  85         .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
  86
  87 Now you can revert the file back to the first version
  88
  89         $ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
  90         $ cat test.txt
  91         version 1
  92
  93 or the second version:
  94
  95         $ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
  96         $ cat test.txt
  97         version 2
  98
  99 But remembering the SHA-1 key for each version of your file isn’t practical; plus, you aren’t storing the filename in your system — just the content. This object type is called a blob. You can have Git tell you the object type of any object in Git, given its SHA-1 key, with `cat-file -t`:
 100
 101         $ git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
 102         blob
 103
 104 ### Tree Objects ###
 105
 106 The next type you’ll look at is the tree object, which solves the problem of storing the filename and also allows you to store a group of files together. Git stores content in a manner similar to a UNIX filesystem, but a bit simplified. All the content is stored as tree and blob objects, with trees corresponding to UNIX directory entries and blobs corresponding more or less to inodes or file contents. A single tree object contains one or more tree entries, each of which contains a SHA-1 pointer to a blob or subtree with its associated mode, type, and filename. For example, the most recent tree in the simplegit project may look something like this:
 107
 108         $ git cat-file -p master^{tree}
 109         100644 blob a906cb2a4a904a152e80877d4088654daad0c859      README
 110         100644 blob 8f94139338f9404f26296befa88755fc2598c289      Rakefile
 111         040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      lib
 112
 113 The `master^{tree}` syntax specifies the tree object that is pointed to by the last commit on your `master` branch. Notice that the `lib` subdirectory isn’t a blob but a pointer to another tree:
 114
 115         $ git cat-file -p 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0
 116         100644 blob 47c6340d6459e05787f644c2447d2595f5d3a54b      simplegit.rb
 117
 118 Conceptually, the data that Git is storing is something like Figure 9-1.
 119
 120 Insert 18333fig0901.png
 121 Figure 9-1. Simple version of the Git data model.
 122
 123 You can create your own tree. Git normally creates a tree by taking the state of your staging area or index and writing a tree object from it. So, to create a tree object, you first have to set up an index by staging some files. To create an index with a single entry — the first version of your text.txt file — you can use the plumbing command `update-index`. You use this command to artificially add the earlier version of the test.txt file to a new staging area. You must pass it the `--add` option because the file doesn’t yet exist in your staging area (you don’t even have a staging area set up yet) and `--cacheinfo` because the file you’re adding isn’t in your directory but is in your database. Then, you specify the mode, SHA-1, and filename:
 124
 125         $ git update-index --add --cacheinfo 100644 \
 126           83baae61804e65cc73a7201a7252750c76066a30 test.txt
 127
 128 In this case, you’re specifying a mode of `100644`, which means it’s a normal file. Other options are `100755`, which means it’s an executable file; and `120000`, which specifies a symbolic link. The mode is taken from normal UNIX modes but is much less flexible — these three modes are the only ones that are valid for files (blobs) in Git (although other modes are used for directories and submodules).
 129
 130 Now, you can use the `write-tree` command to write the staging area out to a tree object. No `-w` option is needed — calling `write-tree` automatically creates a tree object from the state of the index if that tree doesn’t yet exist:
 131
 132         $ git write-tree
 133         d8329fc1cc938780ffdd9f94e0d364e0ea74f579
 134         $ git cat-file -p d8329fc1cc938780ffdd9f94e0d364e0ea74f579
 135         100644 blob 83baae61804e65cc73a7201a7252750c76066a30      test.txt
 136
 137 You can also verify that this is a tree object:
 138
 139         $ git cat-file -t d8329fc1cc938780ffdd9f94e0d364e0ea74f579
 140         tree
 141
 142 You’ll now create a new tree with the second version of test.txt and a new file as well:
 143
 144         $ echo 'new file' > new.txt
 145         $ git update-index test.txt
 146         $ git update-index --add new.txt
 147
 148 Your staging area now has the new version of test.txt as well as the new file new.txt. Write out that tree (recording the state of the staging area or index to a tree object) and see what it looks like:
 149
 150         $ git write-tree
 151         0155eb4229851634a0f03eb265b69f5a2d56f341
 152         $ git cat-file -p 0155eb4229851634a0f03eb265b69f5a2d56f341
 153         100644 blob fa49b077972391ad58037050f2a75f74e3671e92      new.txt
 154         100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a      test.txt
 155
 156 Notice that this tree has both file entries and also that the test.txt SHA is the "version 2" SHA from earlier (`1f7a7a`). Just for fun, you’ll add the first tree as a subdirectory into this one. You can read trees into your staging area by calling `read-tree`. In this case, you can read an existing tree into your staging area as a subtree by using the `--prefix` option to `read-tree`:
 157
 158         $ git read-tree --prefix=bak d8329fc1cc938780ffdd9f94e0d364e0ea74f579
 159         $ git write-tree
 160         3c4e9cd789d88d8d89c1073707c3585e41b0e614
 161         $ git cat-file -p 3c4e9cd789d88d8d89c1073707c3585e41b0e614
 162         040000 tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579      bak
 163         100644 blob fa49b077972391ad58037050f2a75f74e3671e92      new.txt
 164         100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a      test.txt
 165
 166 If you created a working directory from the new tree you just wrote, you would get the two files in the top level of the working directory and a subdirectory named `bak` that contained the first version of the test.txt file. You can think of the data that Git contains for these structures as being like Figure 9-2.
 167
 168 Insert 18333fig0902.png
 169 Figure 9-2. The content structure of your current Git data.
 170
 171 ### Commit Objects ###
 172
 173 You have three trees that specify the different snapshots of your project that you want to track, but the earlier problem remains: you must remember all three SHA-1 values in order to recall the snapshots. You also don’t have any information about who saved the snapshots, when they were saved, or why they were saved. This is the basic information that the commit object stores for you.
 174
 175 To create a commit object, you call `commit-tree` and specify a single tree SHA-1 and which commit objects, if any, directly preceded it. Start with the first tree you wrote:
 176
 177         $ echo 'first commit' | git commit-tree d8329f
 178         fdf4fc3344e67ab068f836878b6c4951e3b15f3d
 179
 180 Now you can look at your new commit object with `cat-file`:
 181
 182         $ git cat-file -p fdf4fc3
 183         tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579
 184         author Scott Chacon <schacon@gmail.com> 1243040974 -0700
 185         committer Scott Chacon <schacon@gmail.com> 1243040974 -0700
 186
 187         first commit
 188
 189 The format for a commit object is simple: it specifies the top-level tree for the snapshot of the project at that point; the author/committer information pulled from your `user.name` and `user.email` configuration settings, with the current timestamp; a blank line, and then the commit message.
 190
 191 Next, you’ll write the other two commit objects, each referencing the commit that came directly before it:
 192
 193         $ echo 'second commit' | git commit-tree 0155eb -p fdf4fc3
 194         cac0cab538b970a37ea1e769cbbde608743bc96d
 195         $ echo 'third commit'  | git commit-tree 3c4e9c -p cac0cab
 196         1a410efbd13591db07496601ebc7a059dd55cfe9
 197
 198 Each of the three commit objects points to one of the three snapshot trees you created. Oddly enough, you have a real Git history now that you can view with the `git log` command, if you run it on the last commit SHA-1:
 199
 200         $ git log --stat 1a410e
 201         commit 1a410efbd13591db07496601ebc7a059dd55cfe9
 202         Author: Scott Chacon <schacon@gmail.com>
 203         Date:   Fri May 22 18:15:24 2009 -0700
 204
 205             third commit
 206
 207          bak/test.txt |    1 +
 208          1 files changed, 1 insertions(+), 0 deletions(-)
 209
 210         commit cac0cab538b970a37ea1e769cbbde608743bc96d
 211         Author: Scott Chacon <schacon@gmail.com>
 212         Date:   Fri May 22 18:14:29 2009 -0700
 213
 214             second commit
 215
 216          new.txt  |    1 +
 217          test.txt |    2 +-
 218          2 files changed, 2 insertions(+), 1 deletions(-)
 219
 220         commit fdf4fc3344e67ab068f836878b6c4951e3b15f3d
 221         Author: Scott Chacon <schacon@gmail.com>
 222         Date:   Fri May 22 18:09:34 2009 -0700
 223
 224             first commit
 225
 226          test.txt |    1 +
 227          1 files changed, 1 insertions(+), 0 deletions(-)
 228
 229 Amazing. You’ve just done the low-level operations to build up a Git history without using any of the front ends. This is essentially what Git does when you run the `git add` and `git commit` commands — it stores blobs for the files that have changed, updates the index, writes out trees, and writes commit objects that reference the top-level trees and the commits that came immediately before them. These three main Git objects — the blob, the tree, and the commit — are initially stored as separate files in your `.git/objects` directory. Here are all the objects in the example directory now, commented with what they store:
 230
 231         $ find .git/objects -type f
 232         .git/objects/01/55eb4229851634a0f03eb265b69f5a2d56f341 # tree 2
 233         .git/objects/1a/410efbd13591db07496601ebc7a059dd55cfe9 # commit 3
 234         .git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a # test.txt v2
 235         .git/objects/3c/4e9cd789d88d8d89c1073707c3585e41b0e614 # tree 3
 236         .git/objects/83/baae61804e65cc73a7201a7252750c76066a30 # test.txt v1
 237         .git/objects/ca/c0cab538b970a37ea1e769cbbde608743bc96d # commit 2
 238         .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4 # 'test content'
 239         .git/objects/d8/329fc1cc938780ffdd9f94e0d364e0ea74f579 # tree 1
 240         .git/objects/fa/49b077972391ad58037050f2a75f74e3671e92 # new.txt
 241         .git/objects/fd/f4fc3344e67ab068f836878b6c4951e3b15f3d # commit 1
 242
 243 If you follow all the internal pointers, you get an object graph something like Figure 9-3.
 244
 245 Insert 18333fig0903.png
 246 Figure 9-3. All the objects in your Git directory.
 247
 248 ### Object Storage ###
 249
 250 I mentioned earlier that a header is stored with the content. Let’s take a minute to look at how Git stores its objects. You’ll see how to store a blob object — in this case, the string "what is up, doc?" — interactively in the Ruby scripting language. You can start up interactive Ruby mode with the `irb` command:
 251
 252         $ irb
 253         >> content = "what is up, doc?"
 254         => "what is up, doc?"
 255
 256 Git constructs a header that starts with the type of the object, in this case a blob. Then, it adds a space followed by the size of the content and finally a null byte:
 257
 258         >> header = "blob #{content.length}\0"
 259         => "blob 16\000"
 260
 261 Git concatenates the header and the original content and then calculates the SHA-1 checksum of that new content. You can calculate the SHA-1 value of a string in Ruby by including the SHA1 digest library with the `require` command and then calling `Digest::SHA1.hexdigest()` with the string:
 262
 263         >> store = header + content
 264         => "blob 16\000what is up, doc?"
 265         >> require 'digest/sha1'
 266         => true
 267         >> sha1 = Digest::SHA1.hexdigest(store)
 268         => "bd9dbf5aae1a3862dd1526723246b20206e5fc37"
 269
 270 Git compresses the new content with zlib, which you can do in Ruby with the zlib library. First, you need to require the library and then run `Zlib::Deflate.deflate()` on the content:
 271
 272         >> require 'zlib'
 273         => true
 274         >> zlib_content = Zlib::Deflate.deflate(store)
 275         => "x\234K\312\311OR04c(\317H,Q\310,V(-\320QH\311O\266\a\000_\034\a\235"
 276
 277 Finally, you’ll write your zlib-deflated content to an object on disk. You’ll determine the path of the object you want to write out (the first two characters of the SHA-1 value being the subdirectory name, and the last 38 characters being the filename within that directory). In Ruby, you can use the `FileUtils.mkdir_p()` function to create the subdirectory if it doesn’t exist. Then, open the file with `File.open()` and write out the previously zlib-compressed content to the file with a `write()` call on the resulting file handle:
 278
 279         >> path = '.git/objects/' + sha1[0,2] + '/' + sha1[2,38]
 280         => ".git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37"
 281         >> require 'fileutils'
 282         => true
 283         >> FileUtils.mkdir_p(File.dirname(path))
 284         => ".git/objects/bd"
 285         >> File.open(path, 'w') { |f| f.write zlib_content }
 286         => 32
 287
 288 That’s it — you’ve created a valid Git blob object. All Git objects are stored the same way, just with different types — instead of the string blob, the header will begin with commit or tree. Also, although the blob content can be nearly anything, the commit and tree content are very specifically formatted.
 289
 290 ## Git References ##
 291
 292 You can run something like `git log 1a410e` to look through your whole history, but you still have to remember that `1a410e` is the last commit in order to walk that history to find all those objects. You need a file in which you can store the SHA-1 value under a simple name so you can use that pointer rather than the raw SHA-1 value.
 293
 294 In Git, these are called "references" or "refs"; you can find the files that contain the SHA-1 values in the `.git/refs` directory. In the current project, this directory contains no files, but it does contain a simple structure:
 295
 296         $ find .git/refs
 297         .git/refs
 298         .git/refs/heads
 299         .git/refs/tags
 300         $ find .git/refs -type f
 301         $
 302
 303 To create a new reference that will help you remember where your latest commit is, you can technically do something as simple as this:
 304
 305         $ echo "1a410efbd13591db07496601ebc7a059dd55cfe9" > .git/refs/heads/master
 306
 307 Now, you can use the head reference you just created instead of the SHA-1 value in your Git commands:
 308
 309         $ git log --pretty=oneline  master
 310         1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
 311         cac0cab538b970a37ea1e769cbbde608743bc96d second commit
 312         fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit
 313
 314 You aren’t encouraged to directly edit the reference files. Git provides a safer command to do this if you want to update a reference called `update-ref`:
 315
 316         $ git update-ref refs/heads/master 1a410efbd13591db07496601ebc7a059dd55cfe9
 317
 318 That’s basically what a branch in Git is: a simple pointer or reference to the head of a line of work. To create a branch back at the second commit, you can do this:
 319
 320         $ git update-ref refs/heads/test cac0ca
 321
 322 Your branch will contain only work from that commit down:
 323
 324         $ git log --pretty=oneline test
 325         cac0cab538b970a37ea1e769cbbde608743bc96d second commit
 326         fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit
 327
 328 Now, your Git database conceptually looks something like Figure 9-4.
 329
 330 Insert 18333fig0904.png
 331 Figure 9-4. Git directory objects with branch head references included.
 332
 333 When you run commands like `git branch (branchname)`, Git basically runs that `update-ref` command to add the SHA-1 of the last commit of the branch you’re on into whatever new reference you want to create.
 334
 335 ### The HEAD ###
 336
 337 The question now is, when you run `git branch (branchname)`, how does Git know the SHA-1 of the last commit? The answer is the HEAD file. The HEAD file is a symbolic reference to the branch you’re currently on. By symbolic reference, I mean that unlike a normal reference, it doesn’t generally contain a SHA-1 value but rather a pointer to another reference. If you look at the file, you’ll normally see something like this:
 338
 339         $ cat .git/HEAD
 340         ref: refs/heads/master
 341
 342 If you run `git checkout test`, Git updates the file to look like this:
 343
 344         $ cat .git/HEAD
 345         ref: refs/heads/test
 346
 347 When you run `git commit`, it creates the commit object, specifying the parent of that commit object to be whatever SHA-1 value the reference in HEAD points to.
 348
 349 You can also manually edit this file, but again a safer command exists to do so: `symbolic-ref`. You can read the value of your HEAD via this command:
 350
 351         $ git symbolic-ref HEAD
 352         refs/heads/master
 353
 354 You can also set the value of HEAD:
 355
 356         $ git symbolic-ref HEAD refs/heads/test
 357         $ cat .git/HEAD
 358         ref: refs/heads/test
 359
 360 You can’t set a symbolic reference outside of the refs style:
 361
 362         $ git symbolic-ref HEAD test
 363         fatal: Refusing to point HEAD outside of refs/
 364
 365 ### Tags ###
 366
 367 You’ve just gone over Git’s three main object types, but there is a fourth. The tag object is very much like a commit object — it contains a tagger, a date, a message, and a pointer. The main difference is that a tag object points to a commit rather than a tree. It’s like a branch reference, but it never moves — it always points to the same commit but gives it a friendlier name.
 368
 369 As discussed in Chapter 2, there are two types of tags: annotated and lightweight. You can make a lightweight tag by running something like this:
 370
 371         $ git update-ref refs/tags/v1.0 cac0cab538b970a37ea1e769cbbde608743bc96d
 372
 373 That is all a lightweight tag is — a branch that never moves. An annotated tag is more complex, however. If you create an annotated tag, Git creates a tag object and then writes a reference to point to it rather than directly to the commit. You can see this by creating an annotated tag (`-a` specifies that it’s an annotated tag):
 374
 375         $ git tag -a v1.1 1a410efbd13591db07496601ebc7a059dd55cfe9 –m 'test tag'
 376
 377 Here’s the object SHA-1 value it created:
 378
 379         $ cat .git/refs/tags/v1.1
 380         9585191f37f7b0fb9444f35a9bf50de191beadc2
 381
 382 Now, run the `cat-file` command on that SHA-1 value:
 383
 384         $ git cat-file -p 9585191f37f7b0fb9444f35a9bf50de191beadc2
 385         object 1a410efbd13591db07496601ebc7a059dd55cfe9
 386         type commit
 387         tag v1.1
 388         tagger Scott Chacon <schacon@gmail.com> Sat May 23 16:48:58 2009 -0700
 389
 390         test tag
 391
 392 Notice that the object entry points to the commit SHA-1 value that you tagged. Also notice that it doesn’t need to point to a commit; you can tag any Git object. In the Git source code, for example, the maintainer has added their GPG public key as a blob object and then tagged it. You can view the public key by running
 393
 394         $ git cat-file blob junio-gpg-pub
 395
 396 in the Git source code. The Linux kernel also has a non-commit-pointing tag object — the first tag created points to the initial tree of the import of the source code.
 397
 398 ### Remotes ###
 399
 400 The third type of reference that you’ll see is a remote reference. If you add a remote and push to it, Git stores the value you last pushed to that remote for each branch in the `refs/remotes` directory. For instance, you can add a remote called `origin` and push your `master` branch to it:
 401
 402         $ git remote add origin git@github.com:schacon/simplegit-progit.git
 403         $ git push origin master
 404         Counting objects: 11, done.
 405         Compressing objects: 100% (5/5), done.
 406         Writing objects: 100% (7/7), 716 bytes, done.
 407         Total 7 (delta 2), reused 4 (delta 1)
 408         To git@github.com:schacon/simplegit-progit.git
 409            a11bef0..ca82a6d  master -> master
 410
 411 Then, you can see what the `master` branch on the `origin` remote was the last time you communicated with the server, by checking the `refs/remotes/origin/master` file:
 412
 413         $ cat .git/refs/remotes/origin/master
 414         ca82a6dff817ec66f44342007202690a93763949
 415
 416 Remote references differ from branches (`refs/heads` references) mainly in that they can’t be checked out. Git moves them around as bookmarks to the last known state of where those branches were on those servers.
 417
 418 ## Packfiles ##
 419
 420 Let’s go back to the objects database for your test Git repository. At this point, you have 11 objects — 4 blobs, 3 trees, 3 commits, and 1 tag:
 421
 422         $ find .git/objects -type f
 423         .git/objects/01/55eb4229851634a0f03eb265b69f5a2d56f341 # tree 2
 424         .git/objects/1a/410efbd13591db07496601ebc7a059dd55cfe9 # commit 3
 425         .git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a # test.txt v2
 426         .git/objects/3c/4e9cd789d88d8d89c1073707c3585e41b0e614 # tree 3
 427         .git/objects/83/baae61804e65cc73a7201a7252750c76066a30 # test.txt v1
 428         .git/objects/95/85191f37f7b0fb9444f35a9bf50de191beadc2 # tag
 429         .git/objects/ca/c0cab538b970a37ea1e769cbbde608743bc96d # commit 2
 430         .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4 # 'test content'
 431         .git/objects/d8/329fc1cc938780ffdd9f94e0d364e0ea74f579 # tree 1
 432         .git/objects/fa/49b077972391ad58037050f2a75f74e3671e92 # new.txt
 433         .git/objects/fd/f4fc3344e67ab068f836878b6c4951e3b15f3d # commit 1
 434
 435 Git compresses the contents of these files with zlib, and you’re not storing much, so all these files collectively take up only 925 bytes. You’ll add some larger content to the repository to demonstrate an interesting feature of Git. Add the repo.rb file from the Grit library you worked with earlier — this is about a 12K source code file:
 436
 437         $ curl http://github.com/mojombo/grit/raw/master/lib/grit/repo.rb > repo.rb
 438         $ git add repo.rb
 439         $ git commit -m 'added repo.rb'
 440         [master 484a592] added repo.rb
 441          3 files changed, 459 insertions(+), 2 deletions(-)
 442          delete mode 100644 bak/test.txt
 443          create mode 100644 repo.rb
 444          rewrite test.txt (100%)
 445
 446 If you look at the resulting tree, you can see the SHA-1 value your repo.rb file got for the blob object:
 447
 448         $ git cat-file -p master^{tree}
 449         100644 blob fa49b077972391ad58037050f2a75f74e3671e92      new.txt
 450         100644 blob 9bc1dc421dcd51b4ac296e3e5b6e2a99cf44391e      repo.rb
 451         100644 blob e3f094f522629ae358806b17daf78246c27c007b      test.txt
 452
 453 You can then use `git cat-file` to see how big that object is:
 454
 455         $ git cat-file -s 9bc1dc421dcd51b4ac296e3e5b6e2a99cf44391e
 456         12898
 457
 458 Now, modify that file a little, and see what happens:
 459
 460         $ echo '# testing' >> repo.rb
 461         $ git commit -am 'modified repo a bit'
 462         [master ab1afef] modified repo a bit
 463          1 files changed, 1 insertions(+), 0 deletions(-)
 464
 465 Check the tree created by that commit, and you see something interesting:
 466
 467         $ git cat-file -p master^{tree}
 468         100644 blob fa49b077972391ad58037050f2a75f74e3671e92      new.txt
 469         100644 blob 05408d195263d853f09dca71d55116663690c27c      repo.rb
 470         100644 blob e3f094f522629ae358806b17daf78246c27c007b      test.txt
 471
 472 The blob is now a different blob, which means that although you added only a single line to the end of a 400-line file, Git stored that new content as a completely new object:
 473
 474         $ git cat-file -s 05408d195263d853f09dca71d55116663690c27c
 475         12908
 476
 477 You have two nearly identical 12K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
 478
 479 It turns out that it can. The initial format in which Git saves objects on disk is called a loose object format. However, occasionally Git packs up several of these objects into a single binary file called a packfile in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the `git gc` command manually, or if you push to a remote server. To see what happens, you can manually ask Git to pack up the objects by calling the `git gc` command:
 480
 481         $ git gc
 482         Counting objects: 17, done.
 483         Delta compression using 2 threads.
 484         Compressing objects: 100% (13/13), done.
 485         Writing objects: 100% (17/17), done.
 486         Total 17 (delta 1), reused 10 (delta 0)
 487
 488 If you look in your objects directory, you’ll find that most of your objects are gone, and a new pair of files has appeared:
 489
 490         $ find .git/objects -type f
 491         .git/objects/71/08f7ecb345ee9d0084193f147cdad4d2998293
 492         .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
 493         .git/objects/info/packs
 494         .git/objects/pack/pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.idx
 495         .git/objects/pack/pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.pack
 496
 497 The objects that remain are the blobs that aren’t pointed to by any commit — in this case, the "what is up, doc?" example and the "test content" example blobs you created earlier. Because you never added them to any commits, they’re considered dangling and aren’t packed up in your new packfile.
 498
 499 The other files are your new packfile and an index. The packfile is a single file containing the contents of all the objects that were removed from your filesystem. The index is a file that contains offsets into that packfile so you can quickly seek to a specific object. What is cool is that although the objects on disk before you ran the `gc` were collectively about 12K in size, the new packfile is only 6K. You’ve halved your disk usage by packing your objects.
 500
 501 How does Git do this? When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space. The `git verify-pack` plumbing command allows you to see what was packed up:
 502
 503         $ git verify-pack -v \
 504           .git/objects/pack/pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.idx
 505         0155eb4229851634a0f03eb265b69f5a2d56f341 tree   71 76 5400
 506         05408d195263d853f09dca71d55116663690c27c blob   12908 3478 874
 507         09f01cea547666f58d6a8d809583841a7c6f0130 tree   106 107 5086
 508         1a410efbd13591db07496601ebc7a059dd55cfe9 commit 225 151 322
 509         1f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob   10 19 5381
 510         3c4e9cd789d88d8d89c1073707c3585e41b0e614 tree   101 105 5211
 511         484a59275031909e19aadb7c92262719cfcdf19a commit 226 153 169
 512         83baae61804e65cc73a7201a7252750c76066a30 blob   10 19 5362
 513         9585191f37f7b0fb9444f35a9bf50de191beadc2 tag    136 127 5476
 514         9bc1dc421dcd51b4ac296e3e5b6e2a99cf44391e blob   7 18 5193 1
 515         05408d195263d853f09dca71d55116663690c27c \
 516           ab1afef80fac8e34258ff41fc1b867c702daa24b commit 232 157 12
 517         cac0cab538b970a37ea1e769cbbde608743bc96d commit 226 154 473
 518         d8329fc1cc938780ffdd9f94e0d364e0ea74f579 tree   36 46 5316
 519         e3f094f522629ae358806b17daf78246c27c007b blob   1486 734 4352
 520         f8f51d7d8a1760462eca26eebafde32087499533 tree   106 107 749
 521         fa49b077972391ad58037050f2a75f74e3671e92 blob   9 18 856
 522         fdf4fc3344e67ab068f836878b6c4951e3b15f3d commit 177 122 627
 523         chain length = 1: 1 object
 524         pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.pack: ok
 525
 526 Here, the `9bc1d` blob, which if you remember was the first version of your repo.rb file, is referencing the `05408` blob, which was the second version of the file. The third column in the output is the size of the object in the pack, so you can see that `05408` takes up 12K of the file but that `9bc1d` only takes up 7 bytes. What is also interesting is that the second version of the file is the one that is stored intact, whereas the original version is stored as a delta — this is because you’re most likely to need faster access to the most recent version of the file.
 527
 528 The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space. You can also manually repack at any time by running `git gc` by hand.
 529
 530 ## The Refspec ##
 531
 532 Throughout this book, you’ve used simple mappings from remote branches to local references; but they can be more complex.
 533 Suppose you add a remote like this:
 534
 535         $ git remote add origin git@github.com:schacon/simplegit-progit.git
 536
 537 It adds a section to your `.git/config` file, specifying the name of the remote (`origin`), the URL of the remote repository, and the refspec for fetching:
 538
 539         [remote "origin"]
 540                url = git@github.com:schacon/simplegit-progit.git
 541                fetch = +refs/heads/*:refs/remotes/origin/*
 542
 543 The format of the refspec is an optional `+`, followed by `<src>:<dst>`, where `<src>` is the pattern for references on the remote side and `<dst>` is where those references will be written locally. The `+` tells Git to update the reference even if it isn’t a fast-forward.
 544
 545 In the default case that is automatically written by a `git remote add` command, Git fetches all the references under `refs/heads/` on the server and writes them to `refs/remotes/origin/` locally. So, if there is a `master` branch on the server, you can access the log of that branch locally via
 546
 547         $ git log origin/master
 548         $ git log remotes/origin/master
 549         $ git log refs/remotes/origin/master
 550
 551 They’re all equivalent, because Git expands each of them to `refs/remotes/origin/master`.
 552
 553 If you want Git instead to pull down only the `master` branch each time, and not every other branch on the remote server, you can change the fetch line to
 554
 555         fetch = +refs/heads/master:refs/remotes/origin/master
 556
 557 This is just the default refspec for `git fetch` for that remote. If you want to do something one time, you can specify the refspec on the command line, too. To pull the `master` branch on the remote down to `origin/mymaster` locally, you can run
 558
 559         $ git fetch origin master:refs/remotes/origin/mymaster
 560
 561 You can also specify multiple refspecs. On the command line, you can pull down several branches like so:
 562
 563         $ git fetch origin master:refs/remotes/origin/mymaster \
 564            topic:refs/remotes/origin/topic
 565         From git@github.com:schacon/simplegit
 566          ! [rejected]        master     -> origin/mymaster  (non fast forward)
 567          * [new branch]      topic      -> origin/topic
 568
 569 In this case, the  master branch pull was rejected because it wasn’t a fast-forward reference. You can override that by specifying the `+` in front of the refspec.
 570
 571 You can also specify multiple refspecs for fetching in your configuration file. If you want to always fetch the master and experiment branches, add two lines:
 572
 573         [remote "origin"]
 574                url = git@github.com:schacon/simplegit-progit.git
 575                fetch = +refs/heads/master:refs/remotes/origin/master
 576                fetch = +refs/heads/experiment:refs/remotes/origin/experiment
 577
 578 You can’t use partial globs in the pattern, so this would be invalid:
 579
 580         fetch = +refs/heads/qa*:refs/remotes/origin/qa*
 581
 582 However, you can use namespacing to accomplish something like that. If you have a QA team that pushes a series of branches, and you want to get the master branch and any of the QA team’s branches but nothing else, you can use a config section like this:
 583
 584         [remote "origin"]
 585                url = git@github.com:schacon/simplegit-progit.git
 586                fetch = +refs/heads/master:refs/remotes/origin/master
 587                fetch = +refs/heads/qa/*:refs/remotes/origin/qa/*
 588
 589 If you have a complex workflow process that has a QA team pushing branches, developers pushing branches, and integration teams pushing and collaborating on remote branches, you can namespace them easily this way.
 590
 591 ### Pushing Refspecs ###
 592
 593 It’s nice that you can fetch namespaced references that way, but how does the QA team get their branches into a `qa/` namespace in the first place? You accomplish that by using refspecs to push.
 594
 595 If the QA team wants to push their `master` branch to `qa/master` on the remote server, they can run
 596
 597         $ git push origin master:refs/heads/qa/master
 598
 599 If they want Git to do that automatically each time they run `git push origin`, they can add a `push` value to their config file:
 600
 601         [remote "origin"]
 602                url = git@github.com:schacon/simplegit-progit.git
 603                fetch = +refs/heads/*:refs/remotes/origin/*
 604                push = refs/heads/master:refs/heads/qa/master
 605
 606 Again, this will cause a `git push origin` to push the local `master` branch to the remote `qa/master` branch by default.
 607
 608 ### Deleting References ###
 609
 610 You can also use the refspec to delete references from the remote server by running something like this:
 611
 612         $ git push origin :topic
 613
 614 Because the refspec is `<src>:<dst>`, by leaving off the `<src>` part, this basically says to make the topic branch on the remote nothing, which deletes it.
 615
 616 ## Transfer Protocols ##
 617
 618 Git can transfer data between two repositories in two major ways: over HTTP and via the so-called smart protocols used in the `file://`, `ssh://`, and `git://` transports. This section will quickly cover how these two main protocols operate.
 619
 620 ### The Dumb Protocol ###
 621
 622 Git transport over HTTP is often referred to as the dumb protocol because it requires no Git-specific code on the server side during the transport process. The fetch process is a series of GET requests, where the client can assume the layout of the Git repository on the server. Let’s follow the `http-fetch` process for the simplegit library:
 623
 624         $ git clone http://github.com/schacon/simplegit-progit.git
 625
 626 The first thing this command does is pull down the `info/refs` file. This file is written by the `update-server-info` command, which is why you need to enable that as a `post-receive` hook in order for the HTTP transport to work properly:
 627
 628         => GET info/refs
 629         ca82a6dff817ec66f44342007202690a93763949     refs/heads/master
 630
 631 Now you have a list of the remote references and SHAs. Next, you look for what the HEAD reference is so you know what to check out when you’re finished:
 632
 633         => GET HEAD
 634         ref: refs/heads/master
 635
 636 You need to check out the `master` branch when you’ve completed the process.
 637 At this point, you’re ready to start the walking process. Because your starting point is the `ca82a6` commit object you saw in the `info/refs` file, you start by fetching that:
 638
 639         => GET objects/ca/82a6dff817ec66f44342007202690a93763949
 640         (179 bytes of binary data)
 641
 642 You get an object back — that object is in loose format on the server, and you fetched it over a static HTTP GET request. You can zlib-uncompress it, strip off the header, and look at the commit content:
 643
 644         $ git cat-file -p ca82a6dff817ec66f44342007202690a93763949
 645         tree cfda3bf379e4f8dba8717dee55aab78aef7f4daf
 646         parent 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
 647         author Scott Chacon <schacon@gmail.com> 1205815931 -0700
 648         committer Scott Chacon <schacon@gmail.com> 1240030591 -0700
 649
 650         changed the version number
 651
 652 Next, you have two more objects to retrieve — `cfda3b`, which is the tree of content that the commit we just retrieved points to; and `085bb3`, which is the parent commit:
 653
 654         => GET objects/08/5bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
 655         (179 bytes of data)
 656
 657 That gives you your next commit object. Grab the tree object:
 658
 659         => GET objects/cf/da3bf379e4f8dba8717dee55aab78aef7f4daf
 660         (404 - Not Found)
 661
 662 Oops — it looks like that tree object isn’t in loose format on the server, so you get a 404 response back. There are a couple of reasons for this — the object could be in an alternate repository, or it could be in a packfile in this repository. Git checks for any listed alternates first:
 663
 664         => GET objects/info/http-alternates
 665         (empty file)
 666
 667 If this comes back with a list of alternate URLs, Git checks for loose files and packfiles there — this is a nice mechanism for projects that are forks of one another to share objects on disk. However, because no alternates are listed in this case, your object must be in a packfile. To see what packfiles are available on this server, you need to get the `objects/info/packs` file, which contains a listing of them (also generated by `update-server-info`):
 668
 669         => GET objects/info/packs
 670         P pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack
 671
 672 There is only one packfile on the server, so your object is obviously in there, but you’ll check the index file to make sure. This is also useful if you have multiple packfiles on the server, so you can see which packfile contains the object you need:
 673
 674         => GET objects/pack/pack-816a9b2334da9953e530f27bcac22082a9f5b835.idx
 675         (4k of binary data)
 676
 677 Now that you have the packfile index, you can see if your object is in it — because the index lists the SHAs of the objects contained in the packfile and the offsets to those objects. Your object is there, so go ahead and get the whole packfile:
 678
 679         => GET objects/pack/pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack
 680         (13k of binary data)
 681
 682 You have your tree object, so you continue walking your commits. They’re all also within the packfile you just downloaded, so you don’t have to do any more requests to your server. Git checks out a working copy of the `master` branch that was pointed to by the HEAD reference you downloaded at the beginning.
 683
 684 The entire output of this process looks like this:
 685
 686         $ git clone http://github.com/schacon/simplegit-progit.git
 687         Initialized empty Git repository in /private/tmp/simplegit-progit/.git/
 688         got ca82a6dff817ec66f44342007202690a93763949
 689         walk ca82a6dff817ec66f44342007202690a93763949
 690         got 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
 691         Getting alternates list for http://github.com/schacon/simplegit-progit.git
 692         Getting pack list for http://github.com/schacon/simplegit-progit.git
 693         Getting index for pack 816a9b2334da9953e530f27bcac22082a9f5b835
 694         Getting pack 816a9b2334da9953e530f27bcac22082a9f5b835
 695          which contains cfda3bf379e4f8dba8717dee55aab78aef7f4daf
 696         walk 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
 697         walk a11bef06a3f659402fe7563abf99ad00de2209e6
 698
 699 ### The Smart Protocol ###
 700
 701 The HTTP method is simple but a bit inefficient. Using smart protocols is a more common method of transferring data. These protocols have a process on the remote end that is intelligent about Git — it can read local data and figure out what the client has or needs and generate custom data for it. There are two sets of processes for transferring data: a pair for uploading data and a pair for downloading data.
 702
 703 #### Uploading Data ####
 704
 705 To upload data to a remote process, Git uses the `send-pack` and `receive-pack` processes. The `send-pack` process runs on the client and connects to a `receive-pack` process on the remote side.
 706
 707 For example, say you run `git push origin master` in your project, and `origin` is defined as a URL that uses the SSH protocol. Git fires up the `send-pack` process, which initiates a connection over SSH to your server. It tries to run a command on the remote server via an SSH call that looks something like this:
 708
 709         $ ssh -x git@github.com "git-receive-pack 'schacon/simplegit-progit.git'"
 710         005bca82a6dff817ec66f4437202690a93763949 refs/heads/master report-status delete-refs
 711         003e085bb3bcb608e1e84b2432f8ecbe6306e7e7 refs/heads/topic
 712         0000
 713
 714 The `git-receive-pack` command immediately responds with one line for each reference it currently has — in this case, just the `master` branch and its SHA. The first line also has a list of the server’s capabilities (here, `report-status` and `delete-refs`).
 715
 716 Each line starts with a 4-byte hex value specifying how long the rest of the line is. Your first line starts with 005b, which is 91 in hex, meaning that 91 bytes remain on that line. The next line starts with 003e, which is 62, so you read the remaining 62 bytes. The next line is 0000, meaning the server is done with its references listing.
 717
 718 Now that it knows the server’s state, your `send-pack` process determines what commits it has that the server doesn’t. For each reference that this push will update, the `send-pack` process tells the `receive-pack` process that information. For instance, if you’re updating the `master` branch and adding an `experiment` branch, the `send-pack` response may look something like this:
 719
 720         0085ca82a6dff817ec66f44342007202690a93763949  15027957951b64cf874c3557a0f3547bd83b3ff6 refs/heads/master report-status
 721         00670000000000000000000000000000000000000000 cdfdb42577e2506715f8cfeacdbabc092bf63e8d refs/heads/experiment
 722         0000
 723
 724 The SHA-1 value of all '0's means that nothing was there before — because you’re adding the experiment reference. If you were deleting a reference, you would see the opposite: all '0's on the right side.
 725
 726 Git sends a line for each reference you’re updating with the old SHA, the new SHA, and the reference that is being updated. The first line also has the client’s capabilities. Next, the client uploads a packfile of all the objects the server doesn’t have yet. Finally, the server responds with a success (or failure) indication:
 727
 728         000Aunpack ok
 729
 730 #### Downloading Data ####
 731
 732 When you download data, the `fetch-pack` and `upload-pack` processes are involved. The client initiates a `fetch-pack` process that connects to an `upload-pack` process on the remote side to negotiate what data will be transferred down.
 733
 734 There are different ways to initiate the `upload-pack` process on the remote repository. You can run via SSH in the same manner as the `receive-pack` process. You can also initiate the process via the Git daemon, which listens on a server on port 9418 by default. The `fetch-pack` process sends data that looks like this to the daemon after connecting:
 735
 736         003fgit-upload-pack schacon/simplegit-progit.git\0host=myserver.com\0
 737
 738 It starts with the 4 bytes specifying how much data is following, then the command to run followed by a null byte, and then the server’s hostname followed by a final null byte. The Git daemon checks that the command can be run and that the repository exists and has public permissions. If everything is cool, it fires up the `upload-pack` process and hands off the request to it.
 739
 740 If you’re doing the fetch over SSH, `fetch-pack` instead runs something like this:
 741
 742         $ ssh -x git@github.com "git-upload-pack 'schacon/simplegit-progit.git'"
 743
 744 In either case, after `fetch-pack` connects, `upload-pack` sends back something like this:
 745
 746         0088ca82a6dff817ec66f44342007202690a93763949 HEAD\0multi_ack thin-pack \
 747           side-band side-band-64k ofs-delta shallow no-progress include-tag
 748         003fca82a6dff817ec66f44342007202690a93763949 refs/heads/master
 749         003e085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 refs/heads/topic
 750         0000
 751
 752 This is very similar to what `receive-pack` responds with, but the capabilities are different. In addition, it sends back the HEAD reference so the client knows what to check out if this is a clone.
 753
 754 At this point, the `fetch-pack` process looks at what objects it has and responds with the objects that it needs by sending "want" and then the SHA it wants. It sends all the objects it already has with "have" and then the SHA. At the end of this list, it writes "done" to initiate the `upload-pack` process to begin sending the packfile of the data it needs:
 755
 756         0054want ca82a6dff817ec66f44342007202690a93763949 ofs-delta
 757         0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
 758         0000
 759         0009done
 760
 761 That is a very basic case of the transfer protocols. In more complex cases, the client supports `multi_ack` or `side-band` capabilities; but this example shows you the basic back and forth used by the smart protocol processes.
 762
 763 ## Maintenance and Data Recovery ##
 764
 765 Occasionally, you may have to do some cleanup — make a repository more compact, clean up an imported repository, or recover lost work. This section will cover some of these scenarios.
 766
 767 ### Maintenance ###
 768
 769 Occasionally, Git automatically runs a command called "auto gc". Most of the time, this command does nothing. However, if there are too many loose objects (objects not in a packfile) or too many packfiles, Git launches a full-fledged `git gc` command. The `gc` stands for garbage collect, and the command does a number of things: it gathers up all the loose objects and places them in packfiles, it consolidates packfiles into one big packfile, and it removes objects that aren’t reachable from any commit and are a few months old.
 770
 771 You can run auto gc manually as follows:
 772
 773         $ git gc --auto
 774
 775 Again, this generally does nothing. You must have around 7,000 loose objects or more than 50 packfiles for Git to fire up a real gc command. You can modify these limits with the `gc.auto` and `gc.autopacklimit` config settings, respectively.
 776
 777 The other thing `gc` will do is pack up your references into a single file. Suppose your repository contains the following branches and tags:
 778
 779         $ find .git/refs -type f
 780         .git/refs/heads/experiment
 781         .git/refs/heads/master
 782         .git/refs/tags/v1.0
 783         .git/refs/tags/v1.1
 784
 785 If you run `git gc`, you’ll no longer have these files in the `refs` directory. Git will move them for the sake of efficiency into a file named `.git/packed-refs` that looks like this:
 786
 787         $ cat .git/packed-refs
 788         # pack-refs with: peeled
 789         cac0cab538b970a37ea1e769cbbde608743bc96d refs/heads/experiment
 790         ab1afef80fac8e34258ff41fc1b867c702daa24b refs/heads/master
 791         cac0cab538b970a37ea1e769cbbde608743bc96d refs/tags/v1.0
 792         9585191f37f7b0fb9444f35a9bf50de191beadc2 refs/tags/v1.1
 793         ^1a410efbd13591db07496601ebc7a059dd55cfe9
 794
 795 If you update a reference, Git doesn’t edit this file but instead writes a new file to `refs/heads`. To get the appropriate SHA for a given reference, Git checks for that reference in the `refs` directory and then checks the `packed-refs` file as a fallback. However, if you can’t find a reference in the `refs` directory, it’s probably in your `packed-refs` file.
 796
 797 Notice the last line of the file, which begins with a `^`. This means the tag directly above is an annotated tag and that line is the commit that the annotated tag points to.
 798
 799 ### Data Recovery ###
 800
 801 At some point in your Git journey, you may accidentally lose a commit. Generally, this happens because you force-delete a branch that had work on it, and it turns out you wanted the branch after all; or you hard-reset a branch, thus abandoning commits that you wanted something from. Assuming this happens, how can you get your commits back?
 802
 803 Here’s an example that hard-resets the master branch in your test repository to an older commit and then recovers the lost commits. First, let’s review where your repository is at this point:
 804
 805         $ git log --pretty=oneline
 806         ab1afef80fac8e34258ff41fc1b867c702daa24b modified repo a bit
 807         484a59275031909e19aadb7c92262719cfcdf19a added repo.rb
 808         1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
 809         cac0cab538b970a37ea1e769cbbde608743bc96d second commit
 810         fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit
 811
 812 Now, move the `master` branch back to the middle commit:
 813
 814         $ git reset --hard 1a410efbd13591db07496601ebc7a059dd55cfe9
 815         HEAD is now at 1a410ef third commit
 816         $ git log --pretty=oneline
 817         1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
 818         cac0cab538b970a37ea1e769cbbde608743bc96d second commit
 819         fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit
 820
 821 You’ve effectively lost the top two commits — you have no branch from which those commits are reachable. You need to find the latest commit SHA and then add a branch that points to it. The trick is finding that latest commit SHA — it’s not like you’ve memorized it, right?
 822
 823 Often, the quickest way is to use a tool called `git reflog`. As you’re working, Git silently records what your HEAD is every time you change it. Each time you commit or change branches, the reflog is updated. The reflog is also updated by the `git update-ref` command, which is another reason to use it instead of just writing the SHA value to your ref files, as we covered in the "Git References" section of this chapter earlier.  You can see where you’ve been at any time by running `git reflog`:
 824
 825         $ git reflog
 826         1a410ef HEAD@{0}: 1a410efbd13591db07496601ebc7a059dd55cfe9: updating HEAD
 827         ab1afef HEAD@{1}: ab1afef80fac8e34258ff41fc1b867c702daa24b: updating HEAD
 828
 829 Here we can see the two commits that we have had checked out, however there is not much information here.  To see the same information in a much more useful way, we can run `git log -g`, which will give you a normal log output for your reflog.
 830
 831         $ git log -g
 832         commit 1a410efbd13591db07496601ebc7a059dd55cfe9
 833         Reflog: HEAD@{0} (Scott Chacon <schacon@gmail.com>)
 834         Reflog message: updating HEAD
 835         Author: Scott Chacon <schacon@gmail.com>
 836         Date:   Fri May 22 18:22:37 2009 -0700
 837
 838             third commit
 839
 840         commit ab1afef80fac8e34258ff41fc1b867c702daa24b
 841         Reflog: HEAD@{1} (Scott Chacon <schacon@gmail.com>)
 842         Reflog message: updating HEAD
 843         Author: Scott Chacon <schacon@gmail.com>
 844         Date:   Fri May 22 18:15:24 2009 -0700
 845
 846              modified repo a bit
 847
 848 It looks like the bottom commit is the one you lost, so you can recover it by creating a new branch at that commit. For example, you can start a branch named `recover-branch` at that commit (ab1afef):
 849
 850         $ git branch recover-branch ab1afef
 851         $ git log --pretty=oneline recover-branch
 852         ab1afef80fac8e34258ff41fc1b867c702daa24b modified repo a bit
 853         484a59275031909e19aadb7c92262719cfcdf19a added repo.rb
 854         1a410efbd13591db07496601ebc7a059dd55cfe9 third commit
 855         cac0cab538b970a37ea1e769cbbde608743bc96d second commit
 856         fdf4fc3344e67ab068f836878b6c4951e3b15f3d first commit
 857
 858 Cool — now you have a branch named `recover-branch` that is where your `master` branch used to be, making the first two commits reachable again.
 859 Next, suppose your loss was for some reason not in the reflog — you can simulate that by removing `recover-branch` and deleting the reflog. Now the first two commits aren’t reachable by anything:
 860
 861         $ git branch –D recover-branch
 862         $ rm -Rf .git/logs/
 863
 864 Because the reflog data is kept in the `.git/logs/` directory, you effectively have no reflog. How can you recover that commit at this point? One way is to use the `git fsck` utility, which checks your database for integrity. If you run it with the `--full` option, it shows you all objects that aren’t pointed to by another object:
 865
 866         $ git fsck --full
 867         dangling blob d670460b4b4aece5915caf5c68d12f560a9fe3e4
 868         dangling commit ab1afef80fac8e34258ff41fc1b867c702daa24b
 869         dangling tree aea790b9a58f6cf6f2804eeac9f0abbe9631e4c9
 870         dangling blob 7108f7ecb345ee9d0084193f147cdad4d2998293
 871
 872 In this case, you can see your missing commit after the dangling commit. You can recover it the same way, by adding a branch that points to that SHA.
 873
 874 ### Removing Objects ###
 875
 876 There are a lot of great things about Git, but one feature that can cause issues is the fact that a `git clone` downloads the entire history of the project, including every version of every file. This is fine if the whole thing is source code, because Git is highly optimized to compress that data efficiently. However, if someone at any point in the history of your project added a single huge file, every clone for all time will be forced to download that large file, even if it was removed from the project in the very next commit. Because it’s reachable from the history, it will always be there.
 877
 878 This can be a huge problem when you’re converting Subversion or Perforce repositories into Git. Because you don’t download the whole history in those systems, this type of addition carries few consequences. If you did an import from another system or otherwise find that your repository is much larger than it should be, here is how you can find and remove large objects.
 879
 880 Be warned: this technique is destructive to your commit history. It rewrites every commit object downstream from the earliest tree you have to modify to remove a large file reference. If you do this immediately after an import, before anyone has started to base work on the commit, you’re fine — otherwise, you have to notify all contributors that they must rebase their work onto your new commits.
 881
 882 To demonstrate, you’ll add a large file into your test repository, remove it in the next commit, find it, and remove it permanently from the repository. First, add a large object to your history:
 883
 884         $ curl http://kernel.org/pub/software/scm/git/git-1.6.3.1.tar.bz2 > git.tbz2
 885         $ git add git.tbz2
 886         $ git commit -am 'added git tarball'
 887         [master 6df7640] added git tarball
 888          1 files changed, 0 insertions(+), 0 deletions(-)
 889          create mode 100644 git.tbz2
 890
 891 Oops — you didn’t want to add a huge tarball to your project. Better get rid of it:
 892
 893         $ git rm git.tbz2
 894         rm 'git.tbz2'
 895         $ git commit -m 'oops - removed large tarball'
 896         [master da3f30d] oops - removed large tarball
 897          1 files changed, 0 insertions(+), 0 deletions(-)
 898          delete mode 100644 git.tbz2
 899
 900 Now, `gc` your database and see how much space you’re using:
 901
 902         $ git gc
 903         Counting objects: 21, done.
 904         Delta compression using 2 threads.
 905         Compressing objects: 100% (16/16), done.
 906         Writing objects: 100% (21/21), done.
 907         Total 21 (delta 3), reused 15 (delta 1)
 908
 909 You can run the `count-objects` command to quickly see how much space you’re using:
 910
 911         $ git count-objects -v
 912         count: 4
 913         size: 16
 914         in-pack: 21
 915         packs: 1
 916         size-pack: 2016
 917         prune-packable: 0
 918         garbage: 0
 919
 920 The `size-pack` entry is the size of your packfiles in kilobytes, so you’re using 2MB. Before the last commit, you were using closer to 2K — clearly, removing the file from the previous commit didn’t remove it from your history. Every time anyone clones this repository, they will have to clone all 2MB just to get this tiny project, because you accidentally added a big file. Let’s get rid of it.
 921
 922 First you have to find it. In this case, you already know what file it is. But suppose you didn’t; how would you identify what file or files were taking up so much space? If you run `git gc`, all the objects are in a packfile; you can identify the big objects by running another plumbing command called `git verify-pack` and sorting on the third field in the output, which is file size. You can also pipe it through the `tail` command because you’re only interested in the last few largest files:
 923
 924         $ git verify-pack -v .git/objects/pack/pack-3f8c0...bb.idx | sort -k 3 -n | tail -3
 925         e3f094f522629ae358806b17daf78246c27c007b blob   1486 734 4667
 926         05408d195263d853f09dca71d55116663690c27c blob   12908 3478 1189
 927         7a9eb2fba2b1811321254ac360970fc169ba2330 blob   2056716 2056872 5401
 928
 929 The big object is at the bottom: 2MB. To find out what file it is, you’ll use the `rev-list` command, which you used briefly in Chapter 7. If you pass `--objects` to `rev-list`, it lists all the commit SHAs and also the blob SHAs with the file paths associated with them. You can use this to find your blob’s name:
 930
 931         $ git rev-list --objects --all | grep 7a9eb2fb
 932         7a9eb2fba2b1811321254ac360970fc169ba2330 git.tbz2
 933
 934 Now, you need to remove this file from all trees in your past. You can easily see what commits modified this file:
 935
 936         $ git log --pretty=oneline -- git.tbz2
 937         da3f30d019005479c99eb4c3406225613985a1db oops - removed large tarball
 938         6df764092f3e7c8f5f94cbe08ee5cf42e92a0289 added git tarball
 939
 940 You must rewrite all the commits downstream from `6df76` to fully remove this file from your Git history. To do so, you use `filter-branch`, which you used in Chapter 6:
 941
 942         $ git filter-branch --index-filter \
 943            'git rm --cached --ignore-unmatch git.tbz2' -- 6df7640^..
 944         Rewrite 6df764092f3e7c8f5f94cbe08ee5cf42e92a0289 (1/2)rm 'git.tbz2'
 945         Rewrite da3f30d019005479c99eb4c3406225613985a1db (2/2)
 946         Ref 'refs/heads/master' was rewritten
 947
 948 The `--index-filter` option is similar to the `--tree-filter` option used in Chapter 6, except that instead of passing a command that modifies files checked out on disk, you’re modifying your staging area or index each time. Rather than remove a specific file with something like `rm file`, you have to remove it with `git rm --cached` — you must remove it from the index, not from disk. The reason to do it this way is speed — because Git doesn’t have to check out each revision to disk before running your filter, the process can be much, much faster. You can accomplish the same task with `--tree-filter` if you want. The `--ignore-unmatch` option to `git rm` tells it not to error out if the pattern you’re trying to remove isn’t there. Finally, you ask `filter-branch` to rewrite your history only from the `6df7640` commit up, because you know that is where this problem started. Otherwise, it will start from the beginning and will unnecessarily take longer.
 949
 950 Your history no longer contains a reference to that file. However, your reflog and a new set of refs that Git added when you did the `filter-branch` under `.git/refs/original` still do, so you have to remove them and then repack the database. You need to get rid of anything that has a pointer to those old commits before you repack:
 951
 952         $ rm -Rf .git/refs/original
 953         $ rm -Rf .git/logs/
 954         $ git gc
 955         Counting objects: 19, done.
 956         Delta compression using 2 threads.
 957         Compressing objects: 100% (14/14), done.
 958         Writing objects: 100% (19/19), done.
 959         Total 19 (delta 3), reused 16 (delta 1)
 960
 961 Let’s see how much space you saved.
 962
 963         $ git count-objects -v
 964         count: 8
 965         size: 2040
 966         in-pack: 19
 967         packs: 1
 968         size-pack: 7
 969         prune-packable: 0
 970         garbage: 0
 971
 972 The packed repository size is down to 7K, which is much better than 2MB. You can see from the size value that the big object is still in your loose objects, so it’s not gone; but it won’t be transferred on a push or subsequent clone, which is what is important. If you really wanted to, you could remove the object completely by running `git prune --expire`.
 973
 974 ## Summary ##
 975
 976 You should have a pretty good understanding of what Git does in the background and, to some degree, how it’s implemented. This chapter has covered a number of plumbing commands — commands that are lower level and simpler than the porcelain commands you’ve learned about in the rest of the book. Understanding how Git works at a lower level should make it easier to understand why it’s doing what it’s doing and also to write your own tools and helping scripts to make your specific workflow work for you.
 977
 978 Git as a content-addressable filesystem is a very powerful tool that you can easily use as more than just a VCS. I hope you can use your newfound knowledge of Git internals to implement your own cool application of this technology and feel more comfortable using Git in more advanced ways.