pythonpython-3.xsha1sha1sum

python 3.9 - unable to get correct sha1 hash for multiple files in loop


By referring the code, given in solution in below link, not getting the correct SHA1 hash for 2nd onwards files in loop. Why saying incorrect because

Using the code given below: -

Please advice if anything to modify in this code or need to opt any other approach?

Code written by referring link given at bottom ->

import glob
import hashlib
import os

path = input("Please provide path to search for file pattern (search will be in this path sub-directories also: ")
filepattern = input("Please provide the file pattern to search in given path. Example *.jar, *abc*.jar.: ")
assert os.path.exists(path), "I did not find the path " + str(path)
path = path.rstrip("/")
tocheck = (f'{path}/**/{filepattern}')
hash_obj = hashlib.sha1()

searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        checksum = ""
        file_for_sha1 = ""
        file_for_sha1 = open(file, 'rb')
        hash_obj.update(file_for_sha1.read())
        checksum = hash_obj.hexdigest()
        print(f'sha1 for file ({file})= {checksum}')
    finally:
        file_for_sha1.close()

Example file -> abc.txt with below text created at /home/test/git/reader/cabin/: - Hi This is to test SHA1 code.

and then this file copied to one more location i.e. /home/test/git/reader/check/cabin/

Linux console output showing same SHA1 for both files: -

:~/git/reader/check/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc  abc.txt
:~/git/reader/check/cabin$ cd ../..
:~/git/reader$ cd cabin/
:~/git/reader/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc  abc.txt

Code in loop in single execution - generating two different SHA1 for this abc.txt file from both locations: -

Code executed twice for same file by giving respective location (means one file at a time) then generating same and correct SHA1 hash:

Referred code link -> Generating one MD5/SHA1 checksum of multiple files in Python


Solution

  • To quote the docs on the update method

    Repeated calls are equivalent to a single call with the concatenation of all the arguments: m.update(a); m.update(b) is equivalent to m.update(a+b).

    So instead of finding the hash of both files separately, you're finding the hash of both files concatenated. That is what the question you've linked is doing - a single hash for multiple files. You want a hash for each file, so instead of using the update method multiple times on the same hash_obj instance, create a new instance for each file, so

    hash_obj = hashlib.sha1()
    searched_file_list = glob.iglob(tocheck, recursive=True)
    for file in searched_file_list:
        print(f'{file}')
        try:
            ...
            hash_obj.update(file_for_sha1.read())
    

    will become

    searched_file_list = glob.iglob(tocheck, recursive=True)
    for file in searched_file_list:
        print(f'{file}')
        try:
            hash_obj = hashlib.sha1()
            ...
            hash_obj.update(file_for_sha1.read())