javanettymd5filehash

File md5 hash changes when chunking it (for netty transfer)


Question at the bottom

I'm using netty to transfer a file to another server. I limit my file-chunks to 1024*64 bytes (64KB) because of the WebSocket protocol. The following method is a local example what will happen to the file:

public static void rechunck(File file1, File file2) {

    FileInputStream is = null;
    FileOutputStream os = null;

    try {

        byte[] buf = new byte[1024*64];

        is = new FileInputStream(file1);
        os = new FileOutputStream(file2);

        while(is.read(buf) > 0) {
            os.write(buf);
        }

    } catch (IOException e) {
        Controller.handleException(Thread.currentThread(), e);
    } finally {

        try {

            if(is != null && os != null) {
                is.close();
                os.close();
            }

        } catch (IOException e) {
            Controller.handleException(Thread.currentThread(), e);
        }

    }

}

The file is loaded by the InputStream into a ByteBuffer and directly written to the OutputStream. The content of the file cannot change while this process.

To get the md5-hashes of the file I've wrote the following method:

public static String checksum(File file) {

    InputStream is = null;

    try {

        is = new FileInputStream(file);
        MessageDigest digest = MessageDigest.getInstance("MD5");
        byte[] buffer = new byte[8192];
        int read = 0;

        while((read = is.read(buffer)) > 0) {
            digest.update(buffer, 0, read);
        }

        return new BigInteger(1, digest.digest()).toString(16);

    } catch(IOException | NoSuchAlgorithmException e) {
        Controller.handleException(Thread.currentThread(), e);
    } finally {

        try {
            is.close();
        } catch(IOException e) {
            Controller.handleException(Thread.currentThread(), e);
        }

    }

    return null;

}

So: just in theory it should return the same hash, shouldn't it? The problem is that it returns two different hashes that do not differ with every run.. file size stays the same and the content either. When I run the method once for in: file-1, out: file-2 and again with in: file-2 and out: file-3 the hashes of file-2 and file-3 are the same! This means the method will properly change the file every time the same way.

1. 58a4a9fbe349a9e0af172f9cf3e6050a
2. 7b3f343fa1b8c4e1160add4c48322373
3. 7b3f343fa1b8c4e1160add4c48322373

Here is a little test that compares all buffers if they are equivalent. Test is positive. So there aren't any differences.

File file1 = new File("controller/templates/Example.zip");
File file2 = new File("controller/templates2/Example.zip");

try {

    byte[] buf1 = new byte[1024*64];
    byte[] buf2 = new byte[1024*64];

    FileInputStream is1 = new FileInputStream(file1);
    FileInputStream is2 = new FileInputStream(file2);

    boolean run = true;
    while(run) {

        int read1 = is1.read(buf1), read2 = is2.read(buf2);
        String result1 = Arrays.toString(buf1), result2 = Arrays.toString(buf2);
        boolean test = result1.equals(result2);

        System.out.println("1: " + result1);
        System.out.println("2: " + result2);
        System.out.println("--- TEST RESULT: " + test + " ----------------------------------------------------");

        if(!(read1 > 0 && read2 > 0) || !test) run = false;

    }

} catch (IOException e) {
    e.printStackTrace();
}

Question: Can you help me chunking the file without changing the hash?


Solution

  • while(is.read(buf) > 0) {
        os.write(buf);
    }
    

    The read() method with the array argument will return the number of files read from the stream. When the file doesn't end exactly as a multiple of the byte array length, this return value will be smaller than the byte array length because you reached the file end.

    However your os.write(buf); call will write the whole byte array to the stream, including the remaining bytes after the file end. This means the written file gets bigger in the end, therefore the hash changed.

    Interestingly you didn't make the mistake when you updated the message digest:

    while((read = is.read(buffer)) > 0) {
        digest.update(buffer, 0, read);
    }
    

    You just have to do the same when you "rechunk" your files.