githashname-clash

Is 10 hex digit git hash abbreviation enough?


How many possible hash values does one need to avoid clashes among N items? If you recall birthday paradox, the answer is much smaller than N.

Let's reverse the question: for N=16^10 possible hash values, which corresponds to 10 hex digits of abbreviated git revision codes, with how many revision the probability of a revision hash coincidence rises to 50%? A direct calculation shows that if you have 1234603 revisions the probability that two of them would have the same 10-digit hash is 50%.

Now, a million or so revisions is not unheard of in large active repositories. Have anybody here experienced a git hash clash in your work? Theoretically speaking, that ought to have happened.


Solution

  • Git automatically scales the length of abbreviated hashes as the number of objects increases such that this is usually not an issue. In addition, if an abbreviated hash would be ambiguous at the normal length, Git will automatically produce a longer, unambiguous value. Some commands let you control the length of abbreviations with an option named --abbrev if you want a specific value, and the core.abbrev option can override the default.

    However, these names are necessarily only unique at the moment they're created, so if you're producing tools that need to work with revisions, they should always operate on the full object IDs. Note also that there is work underway to switch to using SHA-256, so you should not assume anything about the length of a particular full object ID when writing tools.