pythonpython-3.xhashhashlib

Python 3.7 : Hashing a binary file


I am trying to generate a hash for a given file, in this case the hash function got to a binary file (.tgz file) and then generated an error. Is there a way I can read a binary file and generate a md5 hash of it?

The Error I am receiving is:

buffer = buffer.decode('UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 10: invalid start byte

The source code is:

import hashlib

def HashFile(filename, readBlockSize = 4096):
    hash = hashlib.md5()

    with open(filename, 'rb') as fileHandle:

        while True:
            buffer = fileHandle.read(readBlockSize)

            if not buffer:
                break

            buffer = buffer.decode('UTF-8')                
            hash.update(hashlib.md5(buffer).hexdigest())

    return

I am using Python 3.7 on Linux.


Solution

  • There are a couple of things you can tweak here.

    You don't need to decode the bytes returned by .read(), because md5() is expecting bytes in the first place, not str:

    >>> import hashlib
    >>> h = hashlib.md5(open('dump.rdb', 'rb').read()).hexdigest()
    >>> h
    '9a7bf9d3fd725e8b26eee3c31025b18e'
    

    This means you can remove the line buffer = buffer.decode('UTF-8') from your function.

    You'll also need to return hash if you want to use the results of the function.

    Lastly, you need to pass the raw block of bytes to .update(), not its hex digest (which is a str); see the docs' example.

    Putting it all together:

    def hash_file(filename: str, blocksize: int = 4096) -> str:
        hsh = hashlib.md5()
        with open(filename, "rb") as f:
            while True:
                buf = f.read(blocksize)
                if not buf:
                    break
                hsh.update(buf)
        return hsh.hexdigest()
    

    (The above is an example using a Redis .rdb dump binary file.)