python-3.xpandasbytecodepycbinary-reproducibility

How to make the compilation of python source code reproducible


After installing jsonpickle on my machine ( pip install jsonpickle==1.4.1 --no-compile), I have noticed that the compilation of the pandas.py file in the ext subfolder is not always reproducible.

In the ext subfolder I executed the following bash code to compile all .py files to .pyc files:

python -m compileall -d somereldir --invalidation-mode checked-hash

this created a pandas.cpython-37.pyc file in the __pycache__ subdirectory. In the __pycache__ subdirectory, I then executed: xxd pandas.cpython-37.pyc > hex1.hex

If I do the abovementioned steps again and write the hexdump to hex2.hex, I noticed that there are two lines that do not match.

diff hex1.hex hex2.hex
288,289c288,289
< 000011f0: 0029 013e 0200 0000 723f 0000 00da 056e  .).>....r?.....n
< 00001200: 616d 6573 7213 0000 0029 0372 3300 0000  amesr....).r3...
---
> 000011f0: 0029 013e 0200 0000 da05 6e61 6d65 7372  .).>......namesr
> 00001200: 3f00 0000 7213 0000 0029 0372 3300 0000  ?...r....).r3...

I performed it several times and it appears that there are two "versions" of .pyc file, sometimes they match, sometimes they don't.

Because of this, I have several questions:

  1. Why is there a difference in the .pyc files?
  2. How can I make sure that the compiled .pyc file is always the same.
  3. I checked some other python libraries and all of them produced reproducible .pyc files, so what is different for this pandas.py file?

Solution

  • After splitting the pandas.py file in smaller parts and compiling these, I was able to determine the location of the problem on line 135:

    name_bundle = {k: v for k, v in meta.items() if k in {'name', 'names'}}

    which answers the questions:

    1. line 135 contains a set ( {'name','names'}). The order of elements in a set is not necessarily preserved after compilation. Although dictionaries preserve insertion order as of Python 3.7, I could not find anything about order preservation of elements in sets for Python 3.7.
    2. Set the environment variable PYTHONHASHSEED to a fixed value.
    3. It is possible that these libraries do not contain any set.