From d63b42802664074dcf85430775a69037fc2ff52a Mon Sep 17 00:00:00 2001 From: Ben Lynn Date: Sun, 3 May 2009 21:20:46 -0700 Subject: [PATCH] How the object database works, with an example. --- basic.txt | 2 +- clone.txt | 2 +- grandmaster.txt | 7 +-- history.txt | 4 +- secrets.txt | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++----- 5 files changed, 175 insertions(+), 22 deletions(-) diff --git a/basic.txt b/basic.txt index 7541e08..afae343 100644 --- a/basic.txt +++ b/basic.txt @@ -170,7 +170,7 @@ Let A, B, C, D be four successive commits where B is the same as A except some f There are at least three solutions. Assuming we are at D: 1. The difference between A and B are the removed files. We can create a patch representing this difference and apply it: - + $ git diff B A | git apply 2. Since we saved the files back at A, we can retrieve them: diff --git a/clone.txt b/clone.txt index d0affa3..43fbc33 100644 --- a/clone.txt +++ b/clone.txt @@ -93,7 +93,7 @@ Are you working on a project that uses some other version control system, and yo $ git add . $ git commit -m "Initial commit" -then clone it, at light speed: +then clone it: $ git clone . /some/new/directory diff --git a/grandmaster.txt b/grandmaster.txt index c9b848f..c5399e8 100644 --- a/grandmaster.txt +++ b/grandmaster.txt @@ -77,10 +77,9 @@ and create new refresher bundles with: Patches are text representations of your changes that can be easily understood by computers and humans alike. This gives them universal appeal. You can email a -patch to developers no matter what version control system they prefer. As long +patch to developers no matter what version control system they're using. As long as your audience can read their email, they can see your edits. Similarly, on -your side, all you require is an email account: there's no need to setup a Git -repository online somewhere. +your side, all you require is an email account: there's no need to setup an online Git repository. Recall from the first chapter: @@ -179,7 +178,7 @@ The HEAD tag is like a cursor that normally points at the latest commit, advanci $ git reset HEAD~3 -will move the HEAD three commits backwards in time. Thus all Git commands now act as if you hadn't made those last three commits, while your files remain in the present. See the git reset man page for some applications. +will move the HEAD three commits back. Thus all Git commands now act as if you hadn't made those last three commits, while your files remain in the present. See the help page for some applications. But how can you go back to the future? The past commits know nothing of the future. diff --git a/history.txt b/history.txt index 585cf6a..ffa180b 100644 --- a/history.txt +++ b/history.txt @@ -85,7 +85,9 @@ See *git help filter-branch*, which discusses this example and gives a faster method. In general, *filter-branch* lets you alter large sections of history with a single command. -Afterwards, you must replace clones of your project with your revised version if you wish to interact with them later. +Afterwards, the +.git/refs/original+ directory describes the state of affairs before the operation. Check the filter-branch command did what you wanted, then delete this directory if you wish to run more filter-branch commands. + +Lastly, replace clones of your project with your revised version if you want to interact with them later. === Making History === diff --git a/secrets.txt b/secrets.txt index 56ead1d..b2f9345 100644 --- a/secrets.txt +++ b/secrets.txt @@ -16,23 +16,11 @@ You have total control over the fate of your files because Git doesn't care what Most people associate cryptography with keeping information secret, but another equally important goal is keeping information safe. Proper use of cryptographic hash functions can prevent accidental or malicious data corruption. -A SHA1 hash can be thought of as a unique 160-bit ID number for every string of bytes you'll encounter in your life. Actually more than that: every string of bytes that any human will ever use over many lifetimes. The hash of the whole contents of a file can be viewed as a unique ID number for that file. +A SHA1 hash can be thought of as a unique 160-bit ID number for every string of bytes you'll encounter in your life. Actually more than that: every string of bytes that any human will ever use over many lifetimes. -An important observation is that a SHA1 hash is itself a string of bytes, so we can hash strings of bytes containing other hashes. +As a SHA1 hash is itself a string of bytes, we can hash strings of bytes containing other hashes. This simple observation is surprisingly useful: look up 'hash chains'. We'll later see how Git uses it to efficiently guarantee data integrity. -Roughly speaking, all files handled by Git are referred to by their unique ID, not by their filename. All data resides in files in the ".git/objects" subdirectory, where you won't find any normal filenames. The contents of files are strings of bytes we call ''blobs'' and they are divorced from their filenames. - -The filenames are recorded somewhere though. They live in ''tree'' objects, which are lists of filenames along with the IDs of their contents. Since the tree itself is a string of bytes, it too has a unique ID, which is how it is stored in the ".git/objects" subdirectory. Trees can appear on the lists of other trees, hence a directory tree and all the files within may be represented by trees and blobs. - -Lastly, a ''commit'' contains a message, a few tree IDs and information on how they are related to each other. A commit is also a string of bytes, hence it too has a unique ID. - -You can see for yourself: take any hash you see in the `.git/objects` directory, and type - - $ git cat-file -p SHA1_HASH - -Now suppose somebody tries to rewrite history and attempts to change the contents of a file in an ancient version. Then the ID of the file will change since it's now a different string of bytes. This changes the ID of any tree object referencing this file, which in turn changes the ID of all commit objects involving this tree. The corruption in the bad repository is exposed when everyone realizes all the commits since the mutilated file have the wrong IDs. - -I've ignored details such as file permissions and signatures. But in short, so long as the 20 bytes representing the last commit are safe, it's impossible to tamper with a Git repository. +Briefly, Git keeps your data in the ".git/objects" subdirectory, where instead of normal filenames, you'll find only IDs. By using IDs as filenames, as well as a few lockfiles and timestamping tricks, Git transforms any humble filesystem into an efficient and robust database. === Intelligence === @@ -56,3 +44,167 @@ Most Git commands expect the Git index to live in `.git`, and will fail on these === Git's Origins === This http://lkml.org/lkml/2005/4/6/121[Linux Kernel Mailing List post] describes the chain of events that led to Git. The entire thread is a fascinating archaeological site for Git historians. + +=== The Object Database === + +Are you a systems programmer? Then here's how to write a Git-like +system from scratch in a few hours. + +==== Blobs ==== + +First, a magic trick. Pick a filename, any filename. In an empty directory: + + $ echo foo > YOUR_FILENAME + $ git init + $ git add . + $ find .git/objects -type f + +You'll see +.git/objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99+. + +How do I know this despite not knowing the filename you chose? It's because the +SHA1 hash of: + + "blob" SP "4" NUL "foo" LF + +is 257cc5642cb1a054f08cc83f2d943e56fd3ebe99, +where SP is a space, NUL is a zero byte and LF is a linefeed. You can verify +this by typing: + + $ echo "blob 4"$'\001'"foo" | tr '\001' '\000' | sha1sum + +This is written with the bash shell in mind; other shells may be able to handle +NUL on the command line, obviating the need for the *tr* workaround. + +Git is 'content-addressable': files are not stored according to their filename, +but rather by the hash of the data they contain, in a file we call a 'blob +object'. We can think of the hash as a unique ID for a file's contents, so +in a sense we are addressing files by their content. + +The initial "blob 4" is a just a header denoting the type of the +object and the length of its contents in bytes, to simplify internal +bookkeeping. This is how I knew what you would see. The filename is irrelevant: +only the data inside is used to construct the blob object. + +Thus for identical files, Git only stores the data once as the same blob. Indeed, try adding copies of your file, with any filenames whatsoever. The contents of +.git/objects+ stay the same no matter how many copies you add. + +By the way, the files within +.git/objects+ are compressed with zlib so you +should not stare at them directly. Filter them through +http://www.zlib.net/zpipe.c[zpipe -d], or type: + + $ git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 + +which pretty-prints the given object. + +==== Trees ==== + +But where are the filenames? They must be stored somewhere at some stage. +Git gets around to the filenames during a commit: + + $ git commit + $ find .git/objects -type f + +You should now see 3 objects. This time I cannot tell you what the 2 new files are, as it partly depends on the filename you picked. We'll proceed assuming you chose "bar". If you didn't, you can rewrite history to make it look like you did: + + $ git filter-branch --tree-filter 'mv YOUR_FILENAME bar' + $ find .git/objects -type f + +Now you should see +.git/objects/ef/bc17e61e746dad5c834bcb94869ba66b6264f9+, because this is the SHA1 hash of: + + "tree" SP "31" NUL "100644 bar" NUL 0x257cc5642cb1a054f08cc83f2d943e56fd3ebe99 + +Check this file does indeed contain this by typing: + + $ echo efbc17e61e746dad5c834bcb94869ba66b6264f9 | git cat-file --batch + +With zpipe, it's easy to verify the hash: + + $ zpipe -d < .git/objects/ef/bc17e61e746dad5c834bcb94869ba66b6264f9 | sha1sum + +Hash verification is trickier via cat-file because its output contains more +than the raw uncompressed object file. + +This file is a 'tree' object. All filenames are kept in tree objects, where +they are mapped to SHA1 hashes describing their contents. The string "100644" +specifies the file type: normal file, executable, or symlink. The hash can be a +blob object, or another tree object, allowing directory hierarchies to be +represented. + +If you ran filter-branch, you'll now have old objects you no longer need. Although they will be jettisoned automatically once the grace period expires, we'll +delete them now to make our toy example easier to follow: + + $ rm -r .git/refs/original + $ git reflog expire --expire=now --all + $ git prune + +For real projects you should typically avoid commands like this as you are destroying backups. If you want a clean repository, it is usually best to make a fresh clone. Also, take care if you directly manipulate +.git+: what if a Git command is running at the same time? For a serious project, ideally delete the original refs with *git update-ref -d*. + +==== Commits ==== + +We've explained 2 of the 3 objects. The third is a 'commit' object. Its +contents depend on the commit message as well as the date and time it was +created. To match what we have here, we'll have to tweak it a little: + + $ git commit --amend -m baz # Change the commit message. + $ git filter-branch --env-filter 'export + GIT_AUTHOR_DATE="Fri 13 Feb 2009 15:31:30 -0800" + GIT_AUTHOR_NAME="Alice" + GIT_AUTHOR_EMAIL="alice@example.com" + GIT_COMMITTER_DATE="Fri, 13 Feb 2009 15:31:30 -0800" + GIT_COMMITTER_NAME="Bob" + GIT_COMMITTER_EMAIL="bob@example.com"' # Rig timestamps and authors. + $ find .git/objects -type f + +You should now see ++.git/objects/f0/92611fe90e213cd76b35ce165fc00b6e311b4f+ which is the SHA1 hash +of its contents: + + "commit 160" NUL + "tree efbc17e61e746dad5c834bcb94869ba66b6264f9" LF + "author Alice 1234567890 -0800" LF + "committer Bob 1234567890 -0800" LF + LF + "baz" LF + +As before, you can run zpipe or cat-file to see for yourself. + +This is the first commit, so there are no parent commits, but later commits +will always have at least one parent. + +==== Indistinguishable From Magic ==== + +There's little else to say. We have just exposed the secret behind Git's +powers. It seems too simple: it looks like you could mix together a few shell +scripts and add a dash of C code to cook up the above in a matter of hours. In +fact, this accurately describes the earliest versions of Git. Nonetheless, +apart from ingenious packing tricks to save space, and ingenious indexing +tricks to save time, we now know how Git deftly changes a filesystem into a +database perfect for version control. + +For example, if any file within the object database is corrupted by a disk +error, then its hash will no longer match. Commits are atomic, that is, a +commit can never only partially record changes: we cannot compute the hash of a +commit and store it in the database until we already have stored all relevant +trees, blobs and parent commits. The object database is immune to +unexpected interruptions such as power outages. + +We defeat even the most devious adversaries. Suppose somebody attempts to +stealthily modify the contents of a file in an ancient version of a project. To +keep the object database looking healthy, they must also change the hash of the +corresponding blob object since it's now a different string of bytes. This +means they'll have to change the hash of any tree object referencing the file, +and in turn change the hash of all commit objects involving such a tree, in +addition to the hashes of all the descendants of these commits. This implies the +hash of the official current head differs to that of the bad repository. By +following the trail of mismatching hashes we can pinpoint the mutilated file, +as well as the commit where it was first corrupted. + +In short, so long as the 20 bytes representing the last commit are safe, +it's impossible to tamper with a Git repository. + +What about Git's famous features? Branching? Merging? Tags? + +Mere details. The current head is kept in the file +.git/HEAD+, +which contains a hash of a commit object. The hash gets updated during a commit +as well as many other commands. Branches are almost the same: they are files in ++.git/refs/heads+. Tags too: they live in +.git/refs/tags+ but they +are updated by a different set of commands. -- 2.11.4.GIT