pythoncomparisonbytecodepyc

compare whether two python files result in same byte code (are code wise identical)


We're doing some code cleanup. The cleanup is only about formatting (if an issue, then let's even assume, that line numbers don't change, though ideally I'd like to ignore also line number changes)

In order to be sure, that there is no accidental code change I'd like to find a simple / fast way to compare the two source codes.

So let's assume, that I have file1.py and file2.py

what is working is to use py_compile.compile(filename) to create .pyc files and then use uncompyle6 pycfile, then strip off comments and compare the results, But this is overkill and very slow.

Another approach I imagined is to copy file1.py for example to file.py, use py_compile.compile("file.py") and save the .pyc file

then copy file2.py for example to file.py and use use py_compile.compile("file.py") and save the .pyc file and finally compare both generated .pyc files

Would this work reliably with all (current) versions >= python 3.6

If I remember at least for python2 the pyc files could contain time stamps or absolute paths, that could make the comparison fail. (at least if the generation of the pyc file was run on two different machines)

Is there a clean way to compare the byte code of py2 files?

As bonus feature (if possible) I'd like to create a hash for each byte code, that I could store for future reference.


Solution

  • You might try using Python's internal compile function, which can compile from string (read in from a file in your case). For example, compiling and comparing the resulting code objects from two equivalent programs and one almost equivalent program and then just for demo purposes (something you would not want to do) executing a couple of the code objects:

    import hashlib
    import marshal
    ​
    ​
    def compute_hash(code):
        code_bytes = marshal.dumps(code)
        code_hash = hashlib.sha1(code_bytes).hexdigest()
        return code_hash
    ​
    ​
    source1 = """x = 3
    y = 4
    z = x * y
    print(z)
    """
    source2 = "x=3;y=4;z=x*y;print(z)"
    ​
    source3 = "a=3;y=4;z=a*y;print(z)"
    ​
    obj1 = compile(source=source1, filename='<string>', mode='exec', dont_inherit=1)
    obj2 = compile(source=source2, filename='<string>', mode='exec', dont_inherit=1)
    obj3 = compile(source=source3, filename='<string>', mode='exec', dont_inherit=1)
    ​
    print(obj1 == obj2)
    print(obj1 == obj3)
    ​
    exec(obj1)
    exec(obj3)
    print(compute_hash(obj1))
    

    Prints:

    True
    False
    12
    12
    48632a1b64357e9d09d19e765d3dc6863ee67ab9
    

    This will save you from having to copying py files, creating pyc files, comparing pyc files, etc.

    Note:

    The compute_hash function is if you need a hash function that is repeatable, i.e. returns the same value repeatedly for the same code object when computed in successive program runs.