pythonregexmultimarkdown

Regex to Extract #hashtags from MMD metadata in Python


I'm trying to extract all the #hashtags from the "Tags: #tag1 #tag2" line of a multimarkdown plaintext file. (I'm in Python multiline mode.)

I've tried using lookaheads:

^(?=Tags:\s.*)#(\w+)\b

and lookbehinds:

#(\w+)\b(?<=Tags:^\s)

Plain vanilla #(\w+)\b works, except it picks up any #hashtag that might appear later in the document.

Any hints, help, instruction appreciated.


Solution

  • text = "\n\n#bogus\nTags: #foo #bar\n"
    

    First, you need to get the line:

    line = re.findall(r'Tags:.+\n', text)
    # line = ['Tags: #foo #bar\n']
    

    Lastly, you need to get the tags from the line:

    tags = re.findall(r'#(\w+)', line[0])
    # tags = ['foo', 'bar']
    tags = re.findall(r'#\w+', line[0])
    # tags = ['#foo', '#bar']
    

    Lookbehind won't work since you would need to provide a pattern that doesn't have a fixed width.