unicodeunicode-normalization

Where can I get examples of unicode that normalizes differently?


I'm adding yet another unicode normalization question because I've spent quite a bit of time looking and can't find what I need. I have a situation where I need to normalize unicode to check if strings are equivalent, but I don't understand the consequences of choosing different normal forms. What I would like to do is get some example valid unicode input that normalizes differently so I can play around with the different options, but I don't know how to make it or where I could find it. This answer has some example data but the examples are focused on malformed or invalid unicode strings (I think? Maybe I don't know what I'm looking at). I need a set of strings users will expect to be equivalent, an interface will accept as valid, and that are not equal until normalized. Let's say UTF-8 to be specific but I'd appreciate examples for multiple encodings. I'm working with python if there are answers that depend on implementation, but I imagine others might appreciate answers that are not limited to python.

Where can I get example unicode strings that are equivalent under some normal forms and not others, preferably demonstrating how all the normalizations differ?


Solution

  • https://unicode.org/reports/tr15/#Norm_Forms has a good number of examples, and a significant amount of explanations around them.