luatabletop-simulator

Replace Accented Characters in string to standard with LUA


**THIS ERROR LOOKS LIKE IT IS A BUG IN UNITY. THE CODE SEEMS TO WORK FINE OUTSIDE OF TABLETOP SIMULATOR (THE GAME I AM MODDING)

I'm marking this as solved but leaving it for the mods to remove if needed, as the code might still be useful to other people googling. **

I'm trying to process a large string of a few lines.. and would like to have all the accented characters it find converted into standard characters. I have some code I got form the net for this but there is a small bug in the code and I do not understand how it works, so need some help on this issue if you are able.

function stripChars(str)
    local tableAccents = {}
        tableAccents["à"] = "a"
        tableAccents["á"] = "a"
        tableAccents["â"] = "a"
        tableAccents["ã"] = "a"
        tableAccents["ä"] = "a"
        tableAccents["ç"] = "c"
        tableAccents["è"] = "e"
        tableAccents["é"] = "e"
        tableAccents["ê"] = "e"
        tableAccents["ë"] = "e"
        tableAccents["ì"] = "i"
        tableAccents["í"] = "i"
        tableAccents["î"] = "i"
        tableAccents["ï"] = "i"
        tableAccents["ñ"] = "n"
        tableAccents["ò"] = "o"
        tableAccents["ó"] = "o"
        tableAccents["ô"] = "o"
        tableAccents["õ"] = "o"
        tableAccents["ö"] = "o"
        tableAccents["ù"] = "u"
        tableAccents["ú"] = "u"
        tableAccents["û"] = "u"
        tableAccents["ü"] = "u"
        tableAccents["ý"] = "y"
        tableAccents["ÿ"] = "y"
        tableAccents["À"] = "A"
        tableAccents["Á"] = "A"
        tableAccents["Â"] = "A"
        tableAccents["Ã"] = "A"
        tableAccents["Ä"] = "A"
        tableAccents["Ç"] = "C"
        tableAccents["È"] = "E"
        tableAccents["É"] = "E"
        tableAccents["Ê"] = "E"
        tableAccents["Ë"] = "E"
        tableAccents["Ì"] = "I"
        tableAccents["Í"] = "I"
        tableAccents["Î"] = "I"
        tableAccents["Ï"] = "I"
        tableAccents["Ñ"] = "N"
        tableAccents["Ò"] = "O"
        tableAccents["Ó"] = "O"
        tableAccents["Ô"] = "O"
        tableAccents["Õ"] = "O"
        tableAccents["Ö"] = "O"
        tableAccents["Ù"] = "U"
        tableAccents["Ú"] = "U"
        tableAccents["Û"] = "U"
        tableAccents["Ü"] = "U"
        tableAccents["Ý"] = "Y"
    local normalizedString = ''

    for strChar in string.gmatch(str, "([%z\1-\127\194-\244][\128-\191]*)") do
        if tableAccents[strChar] ~= nil then
            normalizedString = normalizedString..tableAccents[strChar]
        else
            normalizedString = normalizedString..strChar
        end
    end
 return normalizedString
end

This code seems to work really well, but it doesn't work for the u type chars... so...

local test = "ù, ú, û, ü"
print(stripChars(test)) -- Prints (,,,)
test = "à, á, â, ã, ä"
print(stripChars(test)) -- Prints (a, a, a, a, a)

Any ideas?.. I assume it is something to do with the pattern thing.. but I do not see how exactly it works in the 1st place. (see the bottom of the code block under the large table of characters)


Solution

  • I don't know why the function would work on "à, á, â, ã, ä" but would delete characters when used on "ù, ú, û, ü". The function assumes that both strings are encoded in UTF-8. Perhaps it is an encoding issue, but then I would expect it to fail in both cases. For me, calling the function on "ù, ú, û, ü" gives "u, u, u, u", as expected.

    As Curtis F says, it might help to call print(string.byte(test, 1, -1)) on the string that is failing to find out how it is being encoded. I have the file encoded in UTF-8, so the values printed are 195 185 44 32 195 186 44 32 195 187 44 32 195 188.

    How the function works is that "[%z\1-\127\194-\244][\128-\191]*" is a pattern that matches a single character (codepoint) encoded in the UTF-8 encoding. Each codepoint takes 1 to 4 bytes. The pattern, for instance, matches the single byte used to encode the comma character ("," is "\44") or the two two bytes that are used to encode the accented letters ("ù" is "\195\185"). The for-loop looks up each character in the tableAccents table, where the keys are accented letters and the values are the corresponding unaccented ones (tableAccents["ù"]"u"). If the character is a key in the table, the value for that key is added to the normalizedString. If the character is not a key in the table, it is added without being changed. Thus the accented letters are replaced with unaccented ones, while other characters are left alone.

    This is just a code cleanup suggestion: the for-loop could be simplified by using string.gsub:

    local normalizedString = str:gsub("[%z\1-\127\194-\244][\128-\191]*", tableAccents)