pythonregexstringpositive-lookahead

How to use positive lookbehind assertions to extract substring from string following the word "named"


I have a pandas series of text from tweets. The tweets are about dogs. Some of the tweets contain the dog's name. The name shows up in the following way. "...blah blah blah named name. blah blah blah..." Unknown number of characters before and after the piece I need. I want to extract name.

I believe I need to use positive lookbehind assertions and regex's search option. I've looked at the documentation for re.search as well as the following SO questions: How to extract the substring between two markers? and Regex captured groups with positive lookbehind (python), as well as this tutorial https://www.rexegg.com/regex-lookarounds.html. I still feel stuck.

These are the two ideas I have so far:

A)

tweet = 'This is a Shotokon Macadamia mix named Cheryl. Sophisticated af.'
m = re.search('(?<=named)[A-Z][a-z]+', tweet)
m.group(0)

B)

s.str.extract(^named([A-Z][a-z])\.$)

According to the documentation, A) should return 'Cheryl,' but I get an attribute error: AttributeError: 'NoneType' object has no attribute 'group'.

B) only works on a series, and not every element in the tweet series contains the "... named name." structure. I am not sure how to incorporate that into the code so it returns Cheryl.


Solution

  • Pythons says m is a 'NoneType' object because the regex did not match any string, so you cannot extract a group from its result. For getting a correct match you should add a space after "named". Therefore, just try with:

    (?<=named )[A-Z][a-z]+
    

    See also https://regex101.com/r/nZiAFN/1