pythonregexdjangohashtagnon-ascii-characters

How to account for accent characters for regex in Python?


I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this


Solution

  • I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

    hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
    

    which will return ['#yogenfrüz']

    Hope this'll help anyone else.