gitawkdiffunified-diff

How do I identify & list unique hunks in a git commit?


I have a commit with a large number (hundreds) of similar hunks, and I'd like to list each unique hunk in the commit in order to compare them.

I wrote the following GNU awk script, which writes each hunk to a unique file (hunk-[md5-of-hunk].txt):

BEGIN {
  hunk = ""
  buildhunk = 0
}

function writeHunk() {
  if (length(hunk) > 0) {
    print hunk > "hunk.tmp"
    close("hunk.tmp")
    cmd = "cat hunk.tmp | md5"
    cmd | getline md5
    close(cmd)
    if (!(md5 in hunkfiles)) {
      hunkfilename = "hunk-" md5 ".txt"
      print hunk > hunkfilename
      hunkfiles[md5] = hunkfilename
    }
  }
}

/^@@|^diff/ {
  writeHunk()
  hunk = ""
  buildhunk = ($1 == "@@") ? 1 : 0
}

/^[ +-]/ {
  if (buildhunk) {
    hunk = hunk $0 "\n"
  }
}

END {
  writeHunk()
  system("rm hunk.tmp")
  for (md5 in hunkfiles) {
    print hunkfiles[md5]
  }
}

I then run this with git show [commit-SHA] | awk -f my_script.awk, which creates & lists the resulting files. It works for my purposes, but is there a way to do this more efficiently using git's plumbing commands.

Example

Suppose the commit's patch looks like this (reduced to 1 line of context below for clarity's sake):

diff --git a/file1.txt b/file1.txt
index a3fb2ed..4d6f587 100644
--- a/file1.txt
+++ b/file1.txt
@@ -3,2 +3,3 @@ context
 context
+added line
 context
@@ -7,2 +8,3 @@ context
 context
+added line
 context
@@ -11,2 +13,3 @@ context
 context
+added line
 context
@@ -15,2 +18,3 @@ context
 context
+different added line
 context
@@ -19,2 +23,3 @@ context
 context
+different added line
 context
@@ -23,2 +28,3 @@ context
 context
+different added line
 context
@@ -27,2 +33,3 @@ context
 context
+even more different added line
 context
@@ -31,2 +38,3 @@ context
 context
+even more different added line
 context

I want to be able to identity that there are only 3 unique hunks, and see what they are. Namely:

Unique hunk 1:

 context
+added line
 context

Unique hunk 2:

 context
+different added line
 context

Unique hunk 3:

 context
+even more different added line
 context

Solution

  • Commits are snapshots, and as such, they don't have diff hunks.

    Diffs, of course, do have diff hunks. So if you have just one commit, you cannot do this at all. You need two commits. You then simply diff them and do what you are doing.

    Note that git show <commit-hash> really means git diff <parent or parents of commit> <commit-hash>. If the specified commit is a merge commit, this produces a combined diff, which is probably not useful for your purposes since combined diffs intentionally omit many changes entirely. You might want to run an explicit diff against the commit's first parent only (to view only changes brought in as part of the merge).

    There are some parts of Git that internally do something like what you're doing, for git rerere and git patch-id. However, they don't do exactly what you're doing: for rerere they record only diff hunks where there was a merge conflict, and match up those diff hunks (saved by hash ID and file name) with resolutions recorded later. For patch-id they strip off line numbers and white-space but accumulate the entire set of changes from a commit into one big piece. It might be nice if Git had a bit of plumbing that did the git patch-id part hunk by hunk, independent of computing the overall patch ID for the commit, but it doesn't.