bashubuntuencodinglocalediff

make diff ignore case of umlauts


I need to make diff ignore the case of my inputs. Both inputs contain German umlauts like ä and Ä. Option -i successfully makes diff ignore the case of my input for other characters like a and A, but not for umlauts:

$ diff -i <(echo ä) <(echo Ä)
1c1
< ä
---
> Ä

The output should be empty, as ä and Ä should be seen as the same letter if case is ignored. If I try this instead:

$ diff -i <(echo a) <(echo A)

Then it works as expected (no output).

I also tried to set the environment variable LANG to make diff use the correct locale, but this didn’t seem to have any influence:

LANG=de_DE.UTF-8 diff -i <(echo ä) <(echo Ä)

I tried various values for LANG.

Is there a way to make diff ignore the case of German umlauts?

(I’m on Ubuntu 22.04 FWIW.)


Solution

  • Compare normalized strings, see Unicode normalization forms:

     diff -i <(echo ä| uconv -x Any-NFD) <(echo Ä| uconv -x Any-NFD)
    

    Note: used uconv from sudo apt install icu-devtools

    FYI:

    Form   String StrLen Unicode
    ----   ------ ------ -------
    NFC    äÄ          2 \u00e4\u00c4
    NFD    äÄ          4 \u0061\u0308\u0041\u0308
    NFKC   äÄ          2 \u00e4\u00c4
    NFKD   äÄ          4 \u0061\u0308\u0041\u0308
    

    Update:

    from info diff [Emphasis mine]:

    18.1.1 Handling Multibyte and Varying-Width Characters

    diff’, ‘diff3’ and ‘sdiff’ treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, ‘diff’ does not properly ignore a multibyte space character.

    Also, ‘diff’ currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the ‘-y’ or ‘--side-by-side’ option of ‘diff’.

    These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

    The IBM GNU/Linux Technology Center Internationalization Team has proposed patches to support internationalized ‘diff’ (http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz). Unfortunately, these patches are incomplete and are to an older version of ‘diff’, so more work needs to be done in this area.

    Ubuntu 24.04 LTS (GNU/Linux 5.15.153.1-microsoft-standard-WSL2 x86_64)