pythonmd5hashlib

Get the MD5 hash of big files in Python


I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5() function.

The problem is with very big files that their sizes could exceed the RAM size.

How can I get the MD5 hash of a file without loading the whole file into memory?


Solution

  • Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

    This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

    In Python 3.8+ you can do

    import hashlib
    with open("your_filename.txt", "rb") as f:
        file_hash = hashlib.md5()
        while chunk := f.read(8192):
            file_hash.update(chunk)
    print(file_hash.digest())
    print(file_hash.hexdigest())  # to get a printable str instead of bytes