pythonencodingwebvtt

Weird encoding in vtt file--python


I am trying to obtain text from a subtitles file (vtt format) as follows:

import requests
r = requests.get('https://nogeovod-fy.atresmedia.com/vsg/sitemap/assets4/2022/09/26/C302281D-5C76-4710-A4FB-9AD7252B7F47/es.vtt')
print(r.encoding)

r.encoding = r.apparent_encoding

print(r.text)

Some characters seem to be missed as the original encoding ISO-8859-1 is not the right one. However, when I try to change it to utf-8, still all the accents remain weird...


Solution

  • The file appears to contain the following replaced characters:

    With that, simply replacing these one-to-one should fix your problem. We still don't know which encoding this is, but the damage is quite limited.

    fixed = r.text.replace("Ć", "á").replace("Ž", "é").replace(
      "Ð", "í").replace("Š", "ó").replace("ž", "ñ").replace(
      "ë", "ú").replace("Č", "¡").replace("č", "¿")