pythonregexhashtag

Regex function to split words and numbers in a hashtag in a sentence


I need a regex function to recognize a hashtag in a sentence, split the words and numbers in the hashtag and put the word 'hashtag' behind the hashtag. For example:

As you can see the words need to be split after before every capital and every number. However, 2015 can not be 2 0 1 5.

I already have the following:

r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 "

With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST

I already have the following:

document = re.sub(r"(#)([A-Za-z]*|\d*)", r" \1hashtag \2 ", document)

With output: #hashtag MainauDeclaration 2015 watch out guys.. This is HUGE!! #hashtag LindauNobel #hashtag SemST.


Solution

  • You can use

    import re
    text = "#MainauDeclaration2015 watch out guys.. This is HUGE!! #LindauNobel #SemST"
    print( re.sub(r'#(\w+)', lambda x: '#hashtag ' + re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)), text) )
    # => #hashtag Mainau Declaration 2015 watch out guys.. This is HUGE!! #hashtag Lindau Nobel #hashtag Sem S T
    

    See the Python demo.

    The #(\w+) regex used with the first re.sub matches a # + any one or more word chars captured into Group 1.

    The re.sub(r'(?!^)(?=[A-Z])|(?<=\D)(?=\d)|(?<=\d)(?=\D)', ' ', x.group(1)) part takes the Group 1 value as input and inserts a space between a non-digit and a digit, a digit and a non-digit and before a non-initial uppercase letter.