I'm parsing a file that contains both alpha strings and unicode/UTF-8 strings containing IPA pronunciations.
I want to be able to obtain the last character of a string, but sometimes those characters occupy two spaces, e.g.
syl = 'tyl' # plain ascii
last_char = syl[-1]
# last char is 'l'
syl = 'tl̩' # contains IPA char
last_char = syl[-1]
# last char erroneously contains: '̩' which is a diacritical mark on the l
# want the whole character 'l̩'
If I try using .decode()
, it fails with:
'str' object has no attribute 'decode'
How to obtain the last character of the Unicode/UTF-8 string (when you don't know if it's Ascii or Unicode string)?
I guess I could use a lookup table to known characters and if it fails, go back and grab syl[-2:]
. Is there an easier way?
In response to some comments, here is the complete list of IPA characters I've collected so far:
a, b, d, e, f, f̩, g, h, i, i̩, i̬,
j, k, l, l̩, m, n, n̩, o, p, r, s,
s̩, t, t̩, t̬, u, v, w, x, z, æ, ð,
ŋ, ɑ, ɑ̃, ɒ, ɔ, ə, ɚ, ɛ, ɜ, ɜ˞, ɝ,
ɡ, ɪ, ɵ, ɹ, ɾ, ʃ, ʃ̩, ʊ, ʌ, ʒ, ʤ,
θ, ∅
Here's a solution that works though it includes a hack to handle the rhotic hook
def get_last_character_and_length(s):
matches = regex.findall(r'[\w\W][\u0300-\u036f\u02B0-\u02FF]*˞?', s)
last_character = matches[-1] if matches else None
return last_character, len(last_character) if last_character else 0
examples
syl1 = 'tyl' # plain ascii
c, c_l = get_last_character_and_length(syl1)
assert(c == 'l')
assert(c_l == 1)
syl2 = 'tl̩' # contains IPA
c, c_l = get_last_character_and_length(syl2)
assert(c == 'l̩')
assert(c_l == 2)
syl3 = 'stɜ˞' # contains rhotic hook
c, c_l = get_last_character_and_length(syl3)
assert(c == 'ɜ˞')
assert(c_l == 2)