gitdiffgit-diffhuman-readable

git word diff on non-english text


How to have a word for word diff on a human language text (in Chinese)?

I have some plain text in Chinese in a git repository. The text has been edited and I'd like to see which words have been added/removed. One line in the file represents a whole paragraph of text, so a simple git diff is not enough: we know that something has changed in a certain number of paragraph but we cannot see which sentences/words have changed in it.

To make matter worse, as I said, the text is in Chinese. Unlike English and other Indo-European languages, Chinese does not use spaces as a word delimiter. The whole paragraph, together with Chinese punctuation marks, makes a unified block without any space included. Thus, git diff --word-diff does not help at all.

Is there a way to have a human-readable diff between two versions of such a text in Chinese? Is there an equivalent of --word-diff for each character?


Solution

  • I post this as an answer to my own question, however, it contains only part of the solution, a pointer in the right direction. Something is still missing.

    From How can I visualize per-character differences in a unified diff file? Try either command:

    git diff --word-diff-regex=. 
    git diff --color-words=.  
    

    Either of the two command above get me very close. However, I have 2 problems. If I simply type the command above and look at the output in the console, I am only shown the beginning of each paragraph. The whole line does not fit in the console and git truncates the end (i.e. most of it!).

    Or if I try to redirect to a file:

    git diff --color-words=. > diff.patch
    

    and then use vim to view the file, I get some scrambled mess which looks more like binary code than anything human-readable.

    Update:
    I finally used this solution:

    wget https://git.kernel.org/cgit/git/git.git/plain/contrib/diff-highlight/diff-highlight --no-check-certificate 
    chmod u+x diff-highlight
    git diff --color=always | ./diff-highlight | less -R