I am trying to see whether the \dot operator can be detected from a symbol in Julia, here is what I have tried:
The following two blocks return different results
julia> [codepoint(i) for i in string(:ẋ)]
1-element Vector{UInt32}:
0x00001e8b
julia> [codepoint(i) for i in "ẋ"]
2-element Vector{UInt32}:
0x00000078
0x00000307
Ideally I would have a symbol at the beginning, not a string, so I need to use the first method, but that will not return the 0x307 which is the unicode of \dot, making it hard to detect \dot.
So what is the mechanism behind the difference? Thank you.
Both results are equivalent.
Humans are complex, languages also, and so Unicode was required to have complex rules.
In your case you have two representation:
Both are considered equivalent on Unicode. Note: when comparing strings, it is good to normalize strings. Unfortunately there are two main normalization:
See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
The display engines (layout engine, text shapening, glyph display, font metadata) will probably make the same symbol (each font has own preference on which normalization they expect data, but then they will try to find a combined glyph).
I think in your case, you may have two different variant in the text file. One using two characters, and one with a single character. It happen often when copying characters (some editors prefer one normalization compared to the other).
In your case, I think you should normalize the string, see e.g. Unicode.normalize
in https://docs.julialang.org/en/v1/stdlib/Unicode/
And we are using Latin characters, so in the easy part of Unicode (but for being one of the few scripts with upper case and lower case).