secrets.txt

   1 == Secrets Revealed ==
   2
   3 We take a peek under the hood and explain how Git performs its miracles. I will skimp over details. For in-depth descriptions refer to http://www.kernel.org/pub/software/scm/git/docs/user-manual.html[the user manual].
   4
   5 === Invisibility ===
   6
   7 How can Git be so unobtrusive? Aside from occasional commits and merges, you can work as if you were unaware that version control exists. That is, until you need it, and that's when you're glad Git was watching over you the whole time.
   8
   9 Other version control systems don't let you forget about them. Permissions of files may be read-only unless you explicitly tell the server which files you intend to edit. The central server might be keeping track of who's checked out which code, and when. When the network goes down, you'll soon suffer. Developers constantly struggle with virtual red tape and bureaucracy.
  10
  11 The secret is the `.git` directory in your working directory. Git keeps the history of your project here. The initial "." stops it showing up in `ls` listings. Except when you're pushing and pulling changes, all version control operations operate within this directory.
  12
  13 You have total control over the fate of your files because Git doesn't care what you do to them. Git can easily recreate a saved state from `.git` at any time.
  14
  15 === Integrity ===
  16
  17 Most people associate cryptography with keeping information secret, but another equally important goal is keeping information safe. Proper use of cryptographic hash functions can prevent accidental or malicious data corruption.
  18
  19 A SHA1 hash can be thought of as a unique 160-bit ID number for every string of bytes you'll encounter in your life. Actually more than that: every string of bytes that any human will ever use over many lifetimes. The hash of the whole contents of a file can be viewed as a unique ID number for that file.
  20
  21 An important observation is that a SHA1 hash is itself a string of bytes, so we can hash strings of bytes containing other hashes.
  22
  23 Roughly speaking, all files handled by Git are referred to by their unique ID, not by their filename. All data resides in files in the ".git/objects" subdirectory, where you won't find any normal filenames. The contents of files are strings of bytes we call ''blobs'' and they are divorced from their filenames.
  24
  25 The filenames are recorded somewhere though. They live in ''tree'' objects, which are lists of filenames along with the IDs of their contents. Since the tree itself is a string of bytes, it too has a unique ID, which is how it is stored in the ".git/objects" subdirectory. Trees can appear on the lists of other trees, hence a directory tree and all the files within may be represented by trees and blobs.
  26
  27 Lastly, a ''commit'' contains a message, a few tree IDs and information on how they are related to each other. A commit is also a string of bytes, hence it too has a unique ID.
  28
  29 You can see for yourself: take any hash you see in the `.git/objects` directory, and type
  30
  31  $ git cat-file -p SHA1_HASH
  32
  33 Now suppose somebody tries to rewrite history and attempts to change the contents of a file in an ancient version. Then the ID of the file will change since it's now a different string of bytes. This changes the ID of any tree object referencing this file, which in turn changes the ID of all commit objects involving this tree. The corruption in the bad repository is exposed when everyone realizes all the commits since the mutilated file have the wrong IDs.
  34
  35 I've ignored details such as file permissions and signatures. But in short, so long as the 20 bytes representing the last commit are safe, it's impossible to tamper with a Git repository.
  36
  37 === Intelligence ===
  38
  39 How does Git know you renamed a file, even though you never mentioned the fact explicitly? Sure, you may have run *git mv*, but that is exactly the same as a *git rm* followed by a *git add*.
  40
  41 Git heuristically ferrets out renames and copies between successive versions. In fact, it can detect chunks of code being moved or copied around between files! Though it cannot cover all cases, it does a decent job, and this feature is always improving. If it fails to work for you, try options enabling more expensive copy detection, and consider upgrading.
  42
  43 === Indexing ===
  44
  45 For every tracked file, Git records information such as its size, creation time and last modification time in a file known as the *index*. To determine whether a file has changed, Git compares its current stats with that held the index. If they match, then Git can skip reading the file again.
  46
  47 Since stat calls are vastly cheaper than reading file contents, if you only edit a few files, Git can update its state in almost no time.
  48
  49 === Bare Repositories ===
  50
  51 You may have been wondering what format those online Git repositories use.
  52 They're plain Git repositories, just like your `.git` directory, except they've got names like `proj.git`, and they have no working directory associated with them.
  53
  54 Most Git commands expect the Git index to live in `.git`, and will fail on these bare repositories. Fix this by setting the `GIT_DIR` environment variable to the path of the bare repository, or running Git within the directory itself with the `\--bare` option.
  55
  56 === Git's Origins ===
  57
  58 This http://lkml.org/lkml/2005/4/6/121[Linux Kernel Mailing List post] describes the chain of events that led to Git. The entire thread is a fascinating archaeological site for Git historians.