gitgithub

Find commit hash from individual files checked out locally


In my Go project, I have a copy of https://github.com/HouzuoGuo/tiedot made locally. This was probably made manually (or go get) couple of years ago.

I cannot tell what version/tag was checked out since that is not maintained anywhere.

Is there any way for me to find the commit hash from hash of individual files? For example the some hashes are as below:

github.com/HouzuoGuo/tiedot/db> shasum *.go
79b42b7af9784255b39b4307950709880df4a86f  col.go
b5f5a127c990229e8ac085eb8e7c72d0e6617e1c  col_test.go
be45a7eae65803df2dc31e23db7eb27bcffa17cc  db.go
290c32d11498aacb0456117f2bffa8e7ab74ccd8  db_test.go
3d0e0dc06fbd8191b5d68b32b4ac4200444e98f2  doc.go
f15745867ccfcb8609194b617cc6e8911174dad9  doc_test.go
40fcd698a680b39bd8405b9bc62d0f4b99411cbf  idx_test.go
d1c481d7d75140b229440819bb21eb64095a7b35  query.go
c83114227dc59100de953ffceb4398e4d8a6075b  query_test.go

Once I have the commit hash, I can add it to my go.mod file using something like go get github.com/HouzuoGuo/tiedot@<hash>

Based on suggestions from @torek below, I checked out the code from github and wrote a sample script to read all the commits and check if hash of one of the files matches. This does not work though. What am I missing?

COMMITS=$(git rev-list --all)

for COMMIT_HASH in $COMMITS
do
    TREE_HASH=$(git cat-file -p $COMMIT_HASH | grep tree | cut -d' ' -f2)
    if [[ -z "$TREE_HASH" ]]; then
        echo "Tree hash is empty"
        continue
    fi

    DB_DIR_HASH=$(git cat-file -p $TREE_HASH | grep '[[:space:]]db$' | awk '{print $3}')
    if [[ -z "$DB_DIR_HASH" ]]; then
        echo "db dir hash is empty"
        continue
    fi

    DBGO_HASH=$(git cat-file -p $DB_DIR_HASH | grep db.go | awk '{print $3}')
    if [[ -z "$DBGO_HASH" ]]; then
        echo "db.go hash is empty"
        continue
    fi

    if [[ "$DBGO_HASH" == "be45a7eae65803df2dc31e23db7eb27bcffa17cc" ]]; then
        echo "db.go hash matched!!!   Commit $COMMIT_HASH"
    fi
done

Solution

  • Is there any way for me to find the commit hash from hash of individual files?

    The bad news: no, because the commit hash depends on not only the files themselves, but also the commit's metadata.

    The good news: you don't need to do that, as you can simply go the other direction, from commit hash to files. That is, with a clone of the repository, walk the commit graph. For each commit you find in the process, compare the saved source snapshot to the set of files you care about.

    Edit 2: Make sure the checksum you're using is the one Git would use, not the one produced by running shasum or any similar command. That is, use the git hash-object command to compute the hash IDs of the objects for which you will search. (The default is to compute a blob hash ID so you can just run git hash-object db/db.go for instance.)

    You may find more than one match (which is why this is not invertible): for instance, perhaps v2.4.2 and v2.4.4 both match because v2.4.3 was broken and the bug was reverted to make v2.4.4. But that's not important, as long as the result works for you.

    To compare the hashes of the sources you care about, use git ls-tree -r on the commit in question. Use git rev-list to enumerate commit hash IDs. If you have a full tree, you can speed things up by computing the tree hash and comparing the result of git rev-parse $commit^{tree} for each $commit value, rather than comparing all the file hashes of some known subset of files, but either way this should go pretty fast.

    Edit: I'm not sure what is going wrong with your script, but here is a much simpler variant:

    git rev-list --branches |
    while read commit; do
        h=$(git rev-parse --quiet --verify $commit:db/db.go) || continue
        if [ $h == be45a7eae65803df2dc31e23db7eb27bcffa17cc ]; then
            echo "db/db.go hash matched in commit $commit"
        fi
    done
    

    Note that the file may be in many commits! When I ran a variant of this on the Git repository for Git, looking for hash ID d2632690d5107b53ee8a7ac4832cd85eb8c7bfc1 of levenshtein.c, I got 18132 commits matched (which took about ten minutes, scanning through just over 60000 commits). But, it's possible that the hash ID is in no commit: a fast way to check is to use the option in jthill's comment: git log --find-object=hash (with --all or --branches or whatever). If this turns up at least one match, then at least one commit has the object; the script will find all commits that have the object.

    Using git rev-list --tags --no-walk found 181 commits in about 8 seconds:

    $ time git rev-list --tags --no-walk | while read commit; do h=$(git rev-parse --quiet --verify $commit:levenshtein.c) || continue; test $h = d2632690d5107b53ee8a7ac4832cd85eb8c7bfc1 && echo "found in $commit"; done | wc -l
         181
    
    real    0m7.810s
    user    0m2.449s
    sys     0m3.434s
    

    The same thing without the script finds 772 tagged commits in 0.046s, so this script fragment handles about 100 commits per second on my old Mac laptop. (I used this to back-estimate the 10 minutes: I know it was slow!)