gitgit-diffword-diff

What are `git diff --word-diff' default regexps?


git diff has option --word-diff-regex=<...> that matches words. There are special default values for some languages (as said in man 5 gitattributes). But what are these? No description in docs, I looked up sources of git, haven't found them too.

Any ideas?

EDIT: I'm on git 1.9.1, but I'll accept answers for any version.


Solution

  • The sources contain the default word regexes in the userdiff.c file. The PATTERNS and IPATTERN macros take the base word regex as their third parameter, and add "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" to make sure all non-whitespace characters that aren't part of a larger word are treated as a word by themselves, and assuming UTF-8, without splitting up multi-byte characters. For example, in:

    PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
             "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+"),
    

    the word regex is "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+".

    In this case, the |[\xc0-\xff][\x80-\xbf]+ happens not to have any benefit, as everything covered by [\xc0-\xff][\x80-\xbf]+ is already covered by [a-zA-Z0-9\x80-\xff]+, but it doesn't cause any harm either.