pythonlinuxbashstatisticshexdump

What information depicts the quantitative difference between two large given files of the same size?


Usually, in order to find how two binary files are different, I use diff and hexdump tools. But in some situations if two large binary files of the same size are given, I would like to see only their quantitative differences, like number of regions of differences, cumulative difference.

Example: 2 Files A and B. They have 2 diff regions, and their cumulative difference is 6c-a3 + 6c-11 + 6f-6e + 20-22.

File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
              |--------|  |--|
                 reg 1   reg 2

How can I get such information using standard GNU tools and Bash or should I better use a simple Python script? Other statistics about how 2 files are different can also be useful, but I don't know what else and how can be measured? Entropy difference? Variance difference?


Solution

  • For everything but the regions thing you can use numpy. Something like this (untested):

    import numpy as np
    a = np.fromfile("file A", dtype="uint8")
    b = np.fromfile("file B", dtype="uint8")
    
    # Compute the number of bytes that are different
    different_bytes = np.sum(a != b)
    
    # Compute the sum of the differences
    difference = np.sum(a - b)
    
    # Compute the sum of the absolute value of the differences
    absolute_difference = np.sum(np.abs(a - b))
    
    # In some cases, the number of bits that have changed is a better
    # measurement of change. To compute it we make a lookup array where 
    # bitcount_lookup[byte] == number_of_1_bits_in_byte (so
    # bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4])
    bitcount_lookup = np.array(
        [bin(i).count("1") for i in range(256)], dtype="uint8")
    
    # Numpy allows using an array as an index. ^ computes the XOR of
    # each pair of bytes. The result is a byte with a 1 bit where the
    # bits of the input differed, and a 0 bit otherwise.
    bit_diff_count = np.sum(bitcount_lookup[a ^ b])
    

    I couldn't find a numpy function for computing the regions, but just write your own using a != b as input, it shouldn't be hard. See this question for inspiration.