pythongitbackupsha1git-annex

How does bup (git-based image backup) computes hashes of stored objects


There is bup backup program (https://github.com/bup/bup) based on some ideas and some functions from git version control system for compact storage of virtual machine images.

In bup there is bup ls subcommand, which can show some sha1-like hashes (same length of hex) of objects stored inside the backup when -s option is passed (in man bup-ls there is just "-s, --hash : show hash for each file/directory."). But the sha1-like hash is not equal to sha1sum output of original file.

Original git computes sha1 hash of data by prefixing data with `blob NNN\0' string, where NNN is size of object in bytes, written as decimal, according to How does git compute file hashes? and https://stackoverflow.com/a/28881708/

I tested prefix `blob NNN\0' and still not same sha1 sum.

What is the method of computing hash sum for files is used in bup? Is it linear sha1 or some tree-like variant like Merkle trees? What is the hash of directory?

The source of ls command of bup is https://github.com/bup/bup/blob/master/lib/bup/ls.py, and hash just printed in hex, but where the hash was generated?

def node_info(n, name, 
    ''' ....
    if show_hash:
        result += "%s " % n.hash.encode('hex')

Is that hash generated on creating bup backup (when file is placed inside to the backup by bup index + bup save commands) and just printed out on bup ls; or is it recomputed on every bup ls and can be used as integrity test of bup backup?


Solution

  • bup stores all data in a bare git repository (which by default is located at ~/.bup). Therefore bup's hash computation method exactly replicates the one used by git.

    However, an important difference from git is that bup may split files into chunks. If bup decides to split a file into chunks, then the file is represented in the repository as a tree rather than as a blob. In that case bup's hash of the file coincides with git's hash of the corresponding tree.

    The following script demonstrates that:

    bup_hash_test

    #!/bin/bash
    
    bup init
    BUPTEST=/tmp/bup_test
    function test_bup_hash()
    {
        bup index $BUPTEST &> /dev/null
        bup save -n buptest $BUPTEST &> /dev/null
        local buphash=$(bup ls -s buptest/latest$BUPTEST|cut -d' ' -f 1)
        echo "bup's hash: $buphash"
        echo "git's hash: $(git hash-object $BUPTEST)"
        echo git --git-dir \~/.bup cat-file -p $buphash
        git --git-dir ~/.bup cat-file -p $buphash
    }
    
    cat > $BUPTEST <<'END'
        http://pkgsrc.se/sysutils/bup
        http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
    END
    
    test_bup_hash
    
    echo
    echo
    
    echo " -1" >> $BUPTEST
    
    echo "After appending ' -1' line:"
    test_bup_hash
    
    echo
    echo
    
    echo "After replacing '-' with '#':"
    sed -i 's/-/#/' $BUPTEST
    test_bup_hash
    

    Output:

    $ ./bup_hash_test
    Initialized empty Git repository in ~/.bup/
    bup's hash: b52baef90c17a508115ce05680bbb91d1d7bfd8d
    git's hash: b52baef90c17a508115ce05680bbb91d1d7bfd8d
    git --git-dir ~/.bup cat-file -p b52baef90c17a508115ce05680bbb91d1d7bfd8d
        http://pkgsrc.se/sysutils/bup
        http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
    
    
    After appending ' -1' line:
    bup's hash: c95b4a1fe1956418cb0e58e0a2c519622d8ce767
    git's hash: b5bc4094328634ce6e2f4c41458514bab5f5cd7e
    git --git-dir ~/.bup cat-file -p c95b4a1fe1956418cb0e58e0a2c519622d8ce767
    100644 blob aa7770f6a52237f29a5d10b350fe877bf4626bd6    00
    100644 blob d00491fd7e5bb6fa28c517a0bb32b8b506539d4d    61
    
    
    After replacing '-' with '#':
    bup's hash: cda9a69f1cbe66ff44ea6530330e51528563e32a
    git's hash: cda9a69f1cbe66ff44ea6530330e51528563e32a
    git --git-dir ~/.bup cat-file -p cda9a69f1cbe66ff44ea6530330e51528563e32a
        http://pkgsrc.se/sysutils/bup
        http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
     #1
    

    As we can see, when bup's and git's hashes match, the corresponding object in the bup repository is a blob with the expected contents. When bup's and git's hashes do NOT match, the object with bup's hash is a tree. The contents of the blobs in that tree correspond to fragments of the full file:

    $ git --git-dir ~/.bup cat-file -p aa7770f6a52237f29a5d10b350fe877bf4626bd6
        http://pkgsrc.se/sysutils/bup
        http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
     -$ git --git-dir ~/.bup cat-file -p d00491fd7e5bb6fa28c517a0bb32b8b506539d4d
    1