I am trying to generate a hash for a given file, in this case the hash function got to a binary file (.tgz file) and then generated an error. Is there a way I can read a binary file and generate a md5 hash of it?
The Error I am receiving is:
buffer = buffer.decode('UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 10: invalid start byte
The source code is:
import hashlib
def HashFile(filename, readBlockSize = 4096):
hash = hashlib.md5()
with open(filename, 'rb') as fileHandle:
while True:
buffer = fileHandle.read(readBlockSize)
if not buffer:
break
buffer = buffer.decode('UTF-8')
hash.update(hashlib.md5(buffer).hexdigest())
return
I am using Python 3.7 on Linux.
There are a couple of things you can tweak here.
You don't need to decode the bytes returned by .read()
, because md5()
is expecting bytes
in the first place, not str
:
>>> import hashlib
>>> h = hashlib.md5(open('dump.rdb', 'rb').read()).hexdigest()
>>> h
'9a7bf9d3fd725e8b26eee3c31025b18e'
This means you can remove the line buffer = buffer.decode('UTF-8')
from your function.
You'll also need to return hash
if you want to use the results of the function.
Lastly, you need to pass the raw block of bytes to .update()
, not its hex digest (which is a str
); see the docs' example.
Putting it all together:
def hash_file(filename: str, blocksize: int = 4096) -> str:
hsh = hashlib.md5()
with open(filename, "rb") as f:
while True:
buf = f.read(blocksize)
if not buf:
break
hsh.update(buf)
return hsh.hexdigest()
(The above is an example using a Redis .rdb dump binary file.)