How to have a word for word diff on a human language text (in Chinese)?
I have some plain text in Chinese in a git repository. The text has been edited and I'd like to see which words have been added/removed. One line in the file represents a whole paragraph of text, so a simple git diff is not enough: we know that something has changed in a certain number of paragraph but we cannot see which sentences/words have changed in it.
To make matter worse, as I said, the text is in Chinese. Unlike English and other Indo-European languages, Chinese does not use spaces as a word delimiter. The whole paragraph, together with Chinese punctuation marks, makes a unified block without any space included. Thus, git diff --word-diff does not help at all.
Is there a way to have a human-readable diff between two versions of such a text in Chinese? Is there an equivalent of --word-diff for each character?
I post this as an answer to my own question, however, it contains only part of the solution, a pointer in the right direction. Something is still missing.
From How can I visualize per-character differences in a unified diff file? Try either command:
git diff --word-diff-regex=.
git diff --color-words=.
Either of the two command above get me very close. However, I have 2 problems. If I simply type the command above and look at the output in the console, I am only shown the beginning of each paragraph. The whole line does not fit in the console and git truncates the end (i.e. most of it!).
Or if I try to redirect to a file:
git diff --color-words=. > diff.patch
and then use vim to view the file, I get some scrambled mess which looks more like binary code than anything human-readable.
Update:
I finally used this solution:
wget https://git.kernel.org/cgit/git/git.git/plain/contrib/diff-highlight/diff-highlight --no-check-certificate
chmod u+x diff-highlight
git diff --color=always | ./diff-highlight | less -R