pythongitpysparkruff

How to get git commit hash in a secure way in Python


I'm developing a pyspark dataframe in my project. For this dataframe I need a column that contains the latest commit hash for traceability purposes. The commit hash has to be obtained within python code. I found this post that solved the problem. One of the approaches there was to use the GitPython library, which works fine, but I'd like to avoid introducing new dependencies to the project that would have only one use case. So I tried the other suggested approach:

import subprocess

def get_git_revision_hash() -> str:
    return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()

def get_git_revision_short_hash() -> str:
    return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD']).decode('ascii').strip()

This approach works as well, however my project uses Ruff as its linter and it found that there may be potential security risks using this approach see Ruff rules S603 and S607.

Now I'm wondering, is there a secure way to obtain the commit hash with python without relying on external dependencies?


Solution

  • You could check the content of .git/HEAD. If you are in detached HEAD state, you will have the commit id of whatever you have checked out there.... otherwise, you will get a reference to where to check. So, if you have branch X checked out, you would get something like:

    ref: refs/heads/X
    

    Then, you can check the file .git/refs/heads/X and you should get the commit ID.