I am trying to learn Python, so I thought I would start by trying to query IMDB to check my movie collection against IMDB; which was going well 😊
What I am stuck on is how to handle special characters in names, and encode the name to something a URL will respect.
For example I have the movie Brüno
If I encode the string using urllib.parse.quote
I get - Bru%CC%88no
which means when I query IMDB using OMDBAPI it fails to find the movie. If I do the search via the OMDBAPI site, they encode the name as Br%C3%BCno
and this search works.
I am assuming that the encode is using a different standard, but I can’t work out what I need to do
It is using the same encoding, but using different normalizations.
>>> import unicodedata
>>> "Brüno".encode("utf-8")
b'Bru\xcc\x88no'
>>> unicodedata.normalize("NFC", "Brüno").encode("utf-8")
b'Br\xc3\xbcno'
Some graphemes (things you see as one "character"), especially those with diacritics can be made from different characters. An "ü" can either be a "u", with a combining diaresis, or the character "ü" itself (the combined form). Combined forms don't exist for every combination of letter and diacritic, but they do for commonly used ones (= those existing in common languages).
Unicode normalization transforms all characters that form graphemes into either combined or seperate characters. The normalization method "NFC", or Normalization Form Canonical Composition, combines characters as far as possible.
In comparison, the other main form, Normalization Form Canonical Decomposition, or "NFD" will produce your version:
>>> unicodedata.normalize("NFD", "Brüno").encode("utf-8")
b'Bru\xcc\x88no'