es/secrets.txt

   1 == Secrets Revealed ==
   2
   3 We take a peek under the hood and explain how Git performs its miracles. I will skimp over details. For in-depth descriptions refer to http://www.kernel.org/pub/software/scm/git/docs/user-manual.html[the user manual].
   4
   5 === Invisibility ===
   6
   7 How can Git be so unobtrusive? Aside from occasional commits and merges, you can work as if you were unaware that version control exists. That is, until you need it, and that's when you're glad Git was watching over you the whole time.
   8
   9 Other version control systems don't let you forget about them. Permissions of files may be read-only unless you explicitly tell the server which files you intend to edit. The central server might be keeping track of who's checked out which code, and when. When the network goes down, you'll soon suffer. Developers constantly struggle with virtual red tape and bureaucracy.
  10
  11 The secret is the `.git` directory in your working directory. Git keeps the history of your project here. The initial "." stops it showing up in `ls` listings. Except when you're pushing and pulling changes, all version control operations operate within this directory.
  12
  13 You have total control over the fate of your files because Git doesn't care what you do to them. Git can easily recreate a saved state from `.git` at any time.
  14
  15 === Integrity ===
  16
  17 Most people associate cryptography with keeping information secret, but another equally important goal is keeping information safe. Proper use of cryptographic hash functions can prevent accidental or malicious data corruption.
  18
  19 A SHA1 hash can be thought of as a unique 160-bit ID number for every string of bytes you'll encounter in your life. Actually more than that: every string of bytes that any human will ever use over many lifetimes.
  20
  21 As a SHA1 hash is itself a string of bytes, we can hash strings of bytes containing other hashes. This simple observation is surprisingly useful: look up 'hash chains'. We'll later see how Git uses it to efficiently guarantee data integrity.
  22
  23 Briefly, Git keeps your data in the ".git/objects" subdirectory, where instead of normal filenames, you'll find only IDs. By using IDs as filenames, as well as a few lockfiles and timestamping tricks, Git transforms any humble filesystem into an efficient and robust database.
  24
  25 === Intelligence ===
  26
  27 How does Git know you renamed a file, even though you never mentioned the fact explicitly? Sure, you may have run *git mv*, but that is exactly the same as a *git rm* followed by a *git add*.
  28
  29 Git heuristically ferrets out renames and copies between successive versions. In fact, it can detect chunks of code being moved or copied around between files! Though it cannot cover all cases, it does a decent job, and this feature is always improving. If it fails to work for you, try options enabling more expensive copy detection, and consider upgrading.
  30
  31 === Indexing ===
  32
  33 For every tracked file, Git records information such as its size, creation time and last modification time in a file known as the 'index'. To determine whether a file has changed, Git compares its current stats with that held the index. If they match, then Git can skip reading the file again.
  34
  35 Since stat calls are considerably faster than file reads, if you only edit a
  36 few files, Git can update its state in almost no time.
  37
  38 === Bare Repositories ===
  39
  40 You may have been wondering what format those online Git repositories use.
  41 They're plain Git repositories, just like your `.git` directory, except they've got names like `proj.git`, and they have no working directory associated with them.
  42
  43 Most Git commands expect the Git index to live in `.git`, and will fail on these bare repositories. Fix this by setting the `GIT_DIR` environment variable to the path of the bare repository, or running Git within the directory itself with the `\--bare` option.
  44
  45 === Git's Origins ===
  46
  47 This http://lkml.org/lkml/2005/4/6/121[Linux Kernel Mailing List post] describes the chain of events that led to Git. The entire thread is a fascinating archaeological site for Git historians.
  48
  49 === The Object Database ===
  50
  51 Are you a systems programmer? Then here's how to write a Git-like
  52 system from scratch in a few hours.
  53
  54 ==== Blobs ====
  55
  56 First, a magic trick. Pick a filename, any filename. In an empty directory:
  57
  58  $ echo sweet > YOUR_FILENAME
  59  $ git init
  60  $ git add .
  61  $ find .git/objects -type f
  62
  63 You'll see +.git/objects/aa/823728ea7d592acc69b36875a482cdf3fd5c8d+.
  64
  65 How do I know this despite not knowing the filename you chose? It's because the
  66 SHA1 hash of:
  67
  68  "blob" SP "6" NUL "sweet" LF
  69
  70 is aa823728ea7d592acc69b36875a482cdf3fd5c8d,
  71 where SP is a space, NUL is a zero byte and LF is a linefeed. You can verify
  72 this by typing:
  73
  74   $ echo "blob 6"$'\001'"sweet" | tr '\001' '\000' | sha1sum
  75
  76 This is written with the bash shell in mind; other shells may be able to handle
  77 NUL on the command line, obviating the need for the *tr* workaround.
  78
  79 Git is 'content-addressable': files are not stored according to their filename,
  80 but rather by the hash of the data they contain, in a file we call a 'blob
  81 object'. We can think of the hash as a unique ID for a file's contents, so
  82 in a sense we are addressing files by their content.
  83
  84 The initial "blob 6" is a just a header denoting the type of the
  85 object and the length of its contents in bytes, to simplify internal
  86 bookkeeping. This is how I knew what you would see. The file's name is
  87 irrelevant: only the data inside is used to construct the blob object.
  88
  89 You may be wondering: what happens with identical files? Try adding copies of
  90 your file, with any filenames whatsoever. The contents of +.git/objects+ stay
  91 the same no matter how many copies you add. Git only stores the data once.
  92
  93 By the way, the files within +.git/objects+ are compressed with zlib so you
  94 should not stare at them directly. Filter them through
  95 http://www.zlib.net/zpipe.c[zpipe -d], or type:
  96
  97  $ git cat-file -p aa823728ea7d592acc69b36875a482cdf3fd5c8d
  98
  99 which pretty-prints the given object.
 100
 101 ==== Trees ====
 102
 103 But where are the filenames? They must be stored somewhere at some stage.
 104 Git gets around to the filenames during a commit:
 105
 106  $ git commit
 107  $ find .git/objects -type f
 108
 109 You should now see 3 objects. This time I cannot tell you what the 2 new files are, as it partly depends on the filename you picked. We'll proceed assuming you chose "rose". If you didn't, you can rewrite history to make it look like you did:
 110
 111  $ git filter-branch --tree-filter 'mv YOUR_FILENAME rose'
 112  $ find .git/objects -type f
 113
 114 Now you should see +.git/objects/05/b217bb859794d08bb9e4f7f04cbda4b207fbe9+,
 115 because this is the SHA1 hash of:
 116
 117  "tree" SP "32" NUL "100644 rose" NUL 0xaa823728ea7d592acc69b36875a482cdf3fd5c8d
 118
 119 Check this file does indeed contain this by typing:
 120
 121  $ echo 05b217bb859794d08bb9e4f7f04cbda4b207fbe9 | git cat-file --batch
 122
 123 With zpipe, it's easy to verify the hash:
 124
 125  $ zpipe -d < .git/objects/05/b217bb859794d08bb9e4f7f04cbda4b207fbe9 | sha1sum
 126
 127 Hash verification is trickier via cat-file because its output contains more
 128 than the raw uncompressed object file.
 129
 130 This file is a 'tree' object: a list of tuples consisting of a file
 131 type, a filename, and a hash. In our example, the file type is "100644", which
 132 means "rose" is a normal file, and the hash is the blob object that contains
 133 the contents of "rose". Other possible file types are executables, symlinks or
 134 directories. In the last case, the hash points to a tree object.
 135
 136 If you ran filter-branch, you'll have old objects you no longer need. Although
 137 they will be jettisoned automatically once the grace period expires, we'll
 138 delete them now to make our toy example easier to follow:
 139
 140  $ rm -r .git/refs/original
 141  $ git reflog expire --expire=now --all
 142  $ git prune
 143
 144 For real projects you should typically avoid commands like this, as you are
 145 destroying backups. If you want a clean repository, it is usually best to make
 146 a fresh clone. Also, take care if you directly manipulate +.git+: what if a Git
 147 command is running at the same time, or a sudden power outage occurs?
 148 Ideally, refs should be deleted with *git update-ref -d*,
 149 though usually it's safe to remove +refs/original+ by hand.
 150
 151 ==== Commits ====
 152
 153 We've explained 2 of the 3 objects. The third is a 'commit' object. Its
 154 contents depend on the commit message as well as the date and time it was
 155 created. To match what we have here, we'll have to tweak it a little:
 156
 157  $ git commit --amend -m Shakespeare  # Change the commit message.
 158  $ git filter-branch --env-filter 'export
 159      GIT_AUTHOR_DATE="Fri 13 Feb 2009 15:31:30 -0800"
 160      GIT_AUTHOR_NAME="Alice"
 161      GIT_AUTHOR_EMAIL="alice@example.com"
 162      GIT_COMMITTER_DATE="Fri, 13 Feb 2009 15:31:30 -0800"
 163      GIT_COMMITTER_NAME="Bob"
 164      GIT_COMMITTER_EMAIL="bob@example.com"'  # Rig timestamps and authors.
 165  $ find .git/objects -type f
 166
 167 You should now see
 168 +.git/objects/49/993fe130c4b3bf24857a15d7969c396b7bc187+
 169 which is the SHA1 hash of its contents:
 170
 171  "commit 158" NUL
 172  "tree 05b217bb859794d08bb9e4f7f04cbda4b207fbe9" LF
 173  "author Alice <alice@example.com> 1234567890 -0800" LF
 174  "committer Bob <bob@example.com> 1234567890 -0800" LF
 175  LF
 176  "Shakespeare" LF
 177
 178 As before, you can run zpipe or cat-file to see for yourself.
 179
 180 This is the first commit, so there are no parent commits, but later commits
 181 will always have at least one parent.
 182
 183 ==== Indistinguishable From Magic ====
 184
 185 There's little else to say. We have just exposed the secret behind Git's
 186 powers. It seems too simple: it looks like you could mix together a few shell
 187 scripts and add a dash of C code to cook up the above in a matter of hours. In
 188 fact, this accurately describes the earliest versions of Git. Nonetheless,
 189 apart from ingenious packing tricks to save space, and ingenious indexing
 190 tricks to save time, we now know how Git deftly changes a filesystem into a
 191 database perfect for version control.
 192
 193 For example, if any file within the object database is corrupted by a disk
 194 error, then its hash will no longer match, alerting us to the problem. By
 195 hashing hashes of other objects, we maintain integrity at all levels. Commits
 196 are atomic, that is, a commit can never only partially record changes: we can
 197 only compute the hash of a commit and store it in the database after we already
 198 have stored all relevant trees, blobs and parent commits. The object
 199 database is immune to unexpected interruptions such as power outages.
 200
 201 We defeat even the most devious adversaries. Suppose somebody attempts to
 202 stealthily modify the contents of a file in an ancient version of a project. To
 203 keep the object database looking healthy, they must also change the hash of the
 204 corresponding blob object since it's now a different string of bytes. This
 205 means they'll have to change the hash of any tree object referencing the file,
 206 and in turn change the hash of all commit objects involving such a tree, in
 207 addition to the hashes of all the descendants of these commits. This implies the
 208 hash of the official head differs to that of the bad repository. By
 209 following the trail of mismatching hashes we can pinpoint the mutilated file,
 210 as well as the commit where it was first corrupted.
 211
 212 In short, so long as the 20 bytes representing the last commit are safe,
 213 it's impossible to tamper with a Git repository.
 214
 215 What about Git's famous features? Branching? Merging? Tags?
 216
 217 Mere details. The current head is kept in the file +.git/HEAD+,
 218 which contains a hash of a commit object. The hash gets updated during a commit
 219 as well as many other commands. Branches are almost the same: they are files in
 220 +.git/refs/heads+. Tags too: they live in +.git/refs/tags+ but they
 221 are updated by a different set of commands.