pythonunicodepython-unicodepython-module-unicodedata

What are the differences between the modules unicode and unicodedata?


I have a large dataset with over 2 million rows of textual data. Now I want to remove the accents from the strings.

In the link below, two different modules are described to remove the accents:

What is the best way to remove accents in a Python unicode string?

The modules described are unicode and unicodedata. To me it's not clear what the differences are between the two and a comparison is hard, because I don't have many rows with accents and I don't know what accents might be replaced and which ones are not.

Therefore, I would like to know what the differences are between the two and which one is recommended to use.


Solution

  • There is only one module: unicodedata, which includes the unicode database, so the names and properties of unicode code points.

    unicode was a built-in function in Python 2. This function just convert strings to unicode strings, so it was just the encoding, no need to store all the data. On python3 all strings are unicode (with some particularities). Just the encoding now should be defined explicitly.

    On that answer, you see only import unicodedata, so only one module. To remove accents, you do no need just unicode code point, but also information about the type of a unicode code point (combining character), so you need unicodedata.

    Maybe you mean unidecode. This is a special module, but outside standard library. It could be useful for some uses. The modules is simple and give only results in ASCII domain. This could be ok on some cases, but it could cause problems outside Latin writing system.

    On the other hand, unicodedata do nothing for you. You should understand unicode and apply the right filter function (and maybe knowing how other languages works).

    So it depends on the case, and maybe you need just other slug functions (to create non escaped string). When workign with languages, you should care not to overdo things (you may built an offensive word).