luaasciinon-ascii-characters

lua - string.byte for non ascii characters


I want to convert characters to numerical codes, so I tried string.byte("å"). However, it seems that the return value of string.byte() is 195 for these kind of characters;

any way to get a numerical code of non-ascii characters like:?

à,á,â,ã,ä,å

I'm using pure lua;


Solution

  • Lua thinks a string is a sequence of bytes, but a Unicode character may contain multiple bytes.

    Assuming the string is has valid UTF-8 encoding, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence. (In Lua 5.1, use "[%z\1-\127\194-\244][\128-\191]*"), and then get its numerical codes:

    local str = "à,á,â,ã,ä,å"
    
    for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
        print(c:byte(1, -1))
    end
    

    Output:

    195 160
    44
    195 161
    44
    195 162
    44
    195 163
    44
    195 164
    44
    195 165
    

    Note that 44 is the encoding for the comma.