On stackoverflow there are a lot of answers about how to keep only alphabetic characters from a string, the most common accepted is the famous regex '[^a-zA-Z]'
. But this answer is totally wrong because it supposes everybody only write English... I thought I could down vote all these answers but I finally thought it would be more constructive to ask the question again, because I can't find the answer.
Is there an easy (or not...) way in python to keep only alphabetic characters from a string that works for all languages ? I think maybe about a library that could do like xregexp in javascript... By all languages I mean english but also french, russian, chinese, greec...etc
With Python3 or the re.UNICODE
flag in Python2, you could use [^\W\d_]
.
\W : If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.
So [^\W\d_]
is anything which is not not alphanumeric or not a digit or not an underscore. In other words, it's any alphabetic character. :)
>>> import re
>>> re.findall("[^\W\d_]", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']
To avoid this convoluted logic, you could also remove digits and underscores first, and then look for alphanumeric characters :
>>> without_digit = re.sub("[\d_]", "", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
>>> re.findall("\w", without_digit, re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']
It seems that regex
module could help, since it understands \p{L}
or [\w--\d_]
.
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
>>> import regex as re
>>> re.findall("\p{L}", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']
(Tested with Anaconda Python 3.6)