gitgit-hash

Git objects SHA-1 are file contents or file names?


I am confused with how a file's actual contents are stored in .git.

For e.g. Version 1 is the actual text content in test.txt. When I commit (first commit) it to the repo, git returns a SHA-1 for that file which is located in .git\objects\0c\15af113a95643d7c244332b0e0b287184cd049.

When I open the file 15af113a95643d7c244332b0e0b287184cd049 in a text editor, it's all garbage, something like this

x+)JMU074f040031QÐKÏ,ÉLÏË/Je¨}ºõw[Éœ„ÇR­ ñ·Î}úyGª*±8#³¨,1%>9?¯$5¯D¯¤¢„áôÏ3%³þú>š~}Ž÷*ë²-¶ç¡êÊòR“KâKòãs+‹sô

But I'm not sure whether this garbage represents the encrypted form of the text Version 1 or it's represented by the SHA-1 15af113a95643d7c244332b0e0b287184cd049.


Solution

  • The correct answer to the question in the subject line:

    Git objects SHA-1 are file contents or file names?

    is probably "neither", since you were referring to the contents of the loose object file, rather than the original file—and even if you were referring to the original file, that's still not quite right.

    A loose object, in Git, is a plain file. The name of the file is constructed from the object's hash ID. The object's hash ID, in turn, is constructed by computing a hash of the object's contents with a prefix header attached.

    The prefixed header depends on the object type. There are four types: blob, commit, tag, and tree. The header consists of the a zero-terminated byte string composed of the type name as an ASCII (or equivalently, UTF-8) byte string, followed by a space, followed by a decimalized representation of the size of the object in bytes, followed by an ASCII NUL (b'\x00' in Python, if you prefer modern Python notation, or '\0' if you prefer C).

    After the header come the actual object contents. So, for a file containing the byte string b'hello\n', the data to be hashed consist of b'blob 6\0hello\n:

    $ echo 'hello' | git hash-object -t blob --stdin
    ce013625030ba8dba906f756967f9e9ca394464a
    $ python3
    [...]
    >>> import hashlib
    >>> s = b'blob 6\0hello\n'
    >>> hashlib.sha1(s).hexdigest()
    'ce013625030ba8dba906f756967f9e9ca394464a'
    

    Hence, the file name that would be used to store this file is (derived from) ce013625030ba8dba906f756967f9e9ca394464a. As a loose object, it becomes .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a.

    The contents of that file, however, are the zlib-compressed form of b'blob 6\0hello\n' (with, apparently, level=1—the default is currently 6 and the result does not match at that level; it's not clear whether Git's zlib deflate exactly matches Python's, but using level 1 did work here):

    $ echo 'hello' | git hash-object -w -t blob --stdin
    ce013625030ba8dba906f756967f9e9ca394464a
    $ vis .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
    x\^AK\M-J\M-IOR0c\M-HH\M-M\M-I\M-I\M-g\^B\000\^]\M-E\^D\^T$
    

    (note that the final $ is the shell prompt again; now back to Python3)

    >>> import zlib
    >>> zlib.compress(s, 1)
    b'x\x01K\xca\xc9OR0c\xc8H\xcd\xc9\xc9\xe7\x02\x00\x1d\xc5\x04\x14'
    >>> import vis
    >>> print(vis.vis(zlib.compress(s, 1)))
    x\^AK\M-J\M-IOR0c\M-HH\M-M\M-I\M-I\M-g\^B\^@\^]\M-E\^D\^T
    

    where vis.py is:

    def vischr(byte):
        "encode characters the way vis(1) does by default"
        if byte in b' \t\n':
            return chr(byte)
        # control chars: \^X; del: \^?
        if byte < 32 or byte == 127:
            return r'\^' + chr(byte ^ 64)
        # printable characters, 32..126
        if byte < 128:
            return chr(byte)
        # meta characters: prefix with \M^ or \M-
        byte -= 128
        if byte < 32 or byte == 127:
            return r'\M^' + chr(byte ^ 64)
        return r'\M-' + chr(byte)
    
    def vis(bytestr):
        "same as vis(1)"
        return ''.join(vischr(c) for c in bytestr)
    

    (vis produces an invertible but printable encoding of binary files; it was my 1993-ish answer to problems with cat -v).

    Note that the names of files stored in a Git repository (under a commit) appear only as path name components stored in individual tree objects. Computing the hash ID of a tree object is nontrivial; I have Python code that does this in my public "scripts" repository under githash.py.