I am using git rev-list --all --format="%H%n%B"
to retrieve all (reachable) commits of a git repository.
I need to be able to parse the resulting output into separate fields for commit hash as well as the raw body.
-> Is there any robust way to format the output in a way so it can be parsed?
While the commit hash is of fixed length, the raw body has an unknown amount of lines introducing the need for some kind of delimiter. I thought about wrapping the output in xml like tags, e.g. --format="<record>%H%n%B</record>"
, but this has the obvious disadvantage that the string </record>
, if inserted into the raw body, will brake the parser. Of course I could make the delimiters more complex to reduce the risk of having someone insert them into commit messages, but what I really need is a character that cannot technically be part of the raw body. I tried to use the ASCII control character for record separator "\x1F". However, it is not inserted into the output as intended, but printed as it is.
Based on the reply from torek (thank you!) I was able to create a small python function:
from subprocess import Popen, PIPE
from codecs import decode
directory = '/path/to/git/repo'
git_rev_list = Popen(['git', '-C', directory, 'rev-list', '--all'], stdout=PIPE)
git_cat_file = Popen(['git', '-C', directory, 'cat-file', '--batch'],
stdin=git_rev_list.stdout, stdout=PIPE)
while True:
line = git_cat_file.stdout.readline()
try:
hash_, type_, bytes_ = map(decode, line.split())
except ValueError:
break
content = decode(git_cat_file.stdout.read(int(bytes_)))
if type_ == 'commit':
yield _get_commit(hash_, content)
git_cat_file.stdout.readline()
To insert an ASCII RS via the format, use %x1F
, not \x1F
.
In general, your best bet is to do the body-retrieval separately, since %B
can literally expand to anything and there's no protection available. It's usually easy enough to run git log --no-walk --pretty=format:%B
on each commit one at a time, it's just slow.
To speed it up you can use git cat-file --batch
or similar, which does provide a simple way to parse the data in a program: each object is preceded by its size. Commit objects are pretty easy to parse as well since the %B
equivalent is just "everything after the first two adjacent newlines". Thus, instead of:
git rev-list --all --format=something-tricky | ...
you can use:
git rev-list --all | git cat-file --batch | ...
and modify the expected input format to expect a sequence of <hash> <type> <size-in-bytes> LF <bytes>
. Or, add format directives to the git cat-file
to ditch the object type (but I'd keep it since this means you can tell commits apart from annotated tags).