Q: What does git do to assure commit hashes are always unique even if they do the exact same operations with exactly the same contents. Does git simply use something like uuidgen
to generate a unique id for the commit object or does it do something different based on a combination of a timestamp, your mac address, your wifi signals etc. per person?
If I create a file foo
with touch foo
and then run shasum foo
it displays:
da39a3ee5e6b4b0d3255bfef95601890afd80709
.
No matter how often I run shasum foo
or if I run it on a different computer it will always print da39a3ee5e6b4b0d3255bfef95601890afd80709
because, yep, it's the SHA1 representation of exactly the same contents. Empty contents in this case :)
However, if I do the following steps:
cd /some/where
mkdir demo
git init
touch foo
git add -A
git commit -m "adding foo"
..and remember the SHA key of the commit (e.g. 959c363ed4cf147725360532454bc258c964c744
).
Now, when I delete demo
and repeat the exact same steps, still the commit SHA key will be different. And that is great and it's important to ensure identity.
What I would like to know though is, what exactly does git do to assure commit hashes are always unique even if they do the exact same operations with exactly the same contents.
Nothing. If you create the same contents, you get the same SHA-1.
First, however, you need to realize that "same contents" of a commit means that—provided you don't get an accidental SHA-1 collision1 or find a way to break SHA-1—you must create the same complete repository history leading up to and including the commit itself, including all the same trees, author-names, time-stamps, and so on.
This is because the contents of a commit are what you see if you run git cat-file -p <sha-1>
on a commit (plus the tag-and-size field that says "this object is of type commit", so that there are no trivial ways to break things by creating a blob with the same contents as a previous commit). Here's one as an example:
$ git cat-file -p 996b0fdbb4ff63bfd880b3901f054139c95611cf
tree e760f781f2c997fd1d26f2779ac00d42ca93f534
parent 6da748a7cebe3911448fabf9426f81c9df9ec54f
parent 740c281d21ef5b27f6f1b942a4f2fc20f51e8c7e
author Junio C Hamano <gitster@pobox.com> 1406140600 -0700
committer Junio C Hamano <gitster@pobox.com> 1406140600 -0700
Sync with v2.0.3
* maint:
Git 2.0.3
.mailmap: combine Stefan Beller's emails
git.1: switch homepage for stats
Note that this string includes the tree and its SHA-1, both of this commit's parent SHA-1s, the author and timestamp, the committer and timestamp, and the message. If you change even a single bit of this—such as by trying to change the underlying tree, or using some different parent commit(s)—you will get a new, different SHA-1, rather than 996b0fdbb4ff63bfd880b3901f054139c95611cf
.
So the answer to this:
So in theory if me and you do exactly the same steps at exactly the same time with exactly the same configured author, email etc, we would actually get the same commit SHA key?
is "yes". However ... you must start with the same staging area (this is what will become the tree
), and the same parent commits. If you then configure your author, email, etc., exactly the same as the other guy, and both of you create a new commit at the same second (or using git's environment variables2 to force the time stamps), you both get the same new commit.
Which is precisely what we want. It doesn't matter if you create it, when you're named "me", or I create it, when I'm named "me", if all the rest of the contents are the same. Because whoever creates it, the other "me" can clone it, and then we both have the same thing that way too.
(If I want to be sure that the "me" that creates something is not confused with the real me, I need to add something unique, that I know and the other me doesn't. Of course, if I publish this thing somewhere, the other me know knows it. But this is what signed, annotated tags are for. They can contain a GPG signature.)
1The chances of an accidental hash collision (for any pair of objects; chances rise with more objects) are 1 out of 2160, which is ... very small. :-) The rise is actually very rapid, so that by the time you have a million objects, it's about 1 out of 2121. The formula I use here is:
1 - exp(((-(n * (n-1))) / (2 * r))
where r = 2160 and n is the number of objects. Without the subtraction from 1, the equation calculates the "safety margin", as it were: the chance that we won't have an accidental hash collision. If we want to keep this number in the same range as the safety margin that a disk drive won't read back the wrong contents for a file—or at least, that disk-makers claim—we need to keep it around 10-18, which means we need to avoid putting more than about 1.7 quadrillion (1.7E15) objects in our git databases.
2There are many git environment variables that you can set to override various defaults. The ones for the author and committer, including date and email, are:
as described in the git commit-tree documentation.