pythonmatch-phrase

Count number of times a phrase is near another phrase, within n# of words of each other


I need to count the number of times a specific phrase occurs within 3 words of another specific phrase, per row of a dataframe string. Order does not matter.

To illustrate: X = "black cat", Y = "is my", proximity distance = 3, and String = "The black cat is my black cat", .... the output count would be two (two unique pairs found). "The black cat by the window is my black cat" would also = two matches found. However, "The black cat by the big window is my black cat" = one match found.

Here is my example data, broken code, and desired output:

data = [['ABC123', 'test sentence here has these test words'], ['ABC456', 'test sentence here 
contains these test words in test sentence form'], ['ABC789', 'the third test sentence has no 
more additional test words']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)

Record ID | String
----------|-----------------------
ABC123    | test sentence here has these test words
ABC456    | test sentence here contains these test words in test sentence form
ABC789    | the third test sentence has no more additional test words


import pandas as pd  

def phrase_finder(df, text_column, search_phrase, near_phrase, distance):
results = 0
for text in df[text_column]:
    for substring in text.split(search_phrase):
        words = substring.split()
        if len(words) <= distance + 1 and near_phrase in substring:
            results += 1
return results if results else None

search_phrase = "test sentence"
near_phrase = "test words"
distance = 3

print(phrase_finder(df, 'String', search_phrase, near_phrase, distance))

ID        | Number of Matches
----------|-----------------------
ABC123    | 1
ABC456    | 2
ABC789    | 0

This is a direct follow-up to Find word near other word, within N# of words

I was instructed to create a separate question for this rather than posting it on the other one as a follow-up.


Solution

  • I believe O-O-O was somewhat right about regex - it is a major unsustainable PITA in your use case, IMHO. That said, the problem is quite tricky...

    What regex does well is string tokenization. I have applied a rather straightforward approach:

    1. Find all matches for substring 1
    2. Find all matches for substring 2
    3. Count words between these matches

    Not sure what are we supposed to do if substrings overlap. The code is as follows. Just string slicing and word counting, no mindboggling magic here (the less magic in the production code, the better!):

    import re
    
    def phrase_finder(text: str, str1: str, str2: str, distance: int) -> int:
        results = 0
        for match1 in re.finditer(str1, text):
            for match2 in re.finditer(str2, text):
                if match1.end() < match2.start():
                    between_matches = text[match1.end():match2.start()]
                    if len(re.findall(r'\w+', between_matches)) <= distance:
                        results += 1
                elif match2.end() < match1.start():
                    between_matches = text[match2.end():match1.start()]
                    if len(re.findall(r'\w+', between_matches)) <= distance:
                        results += 1
                else:
                    # what do we do here?
                    pass
        return results
    

    Test cases:

    phrase_finder('The black cat is my black cat', 'black cat', 'is my', 3)
    # 2
    phrase_finder('The black cat by the window is my black cat', 'black cat', 'is my', 3)
    # 2
    phrase_finder('The black cat by the big window is my black cat', 'black cat', 'is my', 3)
    # 1
    
    import pandas as pd
    from functools import partial
    
    data = [
        ['', 0],
        ['A', 0],
        ['B', 0],
        ['A B', 1],
        ['B A', 1],
        ['A A B', 2],
        ['A B B', 2],
        ['A B C', 1],
        ['A C C C B', 1], 
        ['A C C C C B', 0], 
        ['A B A', 2], 
        ['A B A A', 3],
        ['A B A A A', 4],
        ['A B A B A', 6]
    ]
    df = pd.DataFrame(data, columns=['text', 'expected_output'])
    df['result'] = df['text'].apply(partial(phrase_finder, str1=r'A', str2=r'B', distance=3))
    df
    #       text    expected_output result
    # 0                 0               0
    # 1     A           0               0
    # 2     B           0               0
    # 3     A B         1               1
    # 4     B A         1               1
    # 5     A A B       2               2
    # 6     A B B       2               2
    # 7     A B C       1               1
    # 8     A C C C B   1               1
    # 9     A C C C C B 0               0
    # 10    A B A       2               2
    # 11    A B A A     3               3
    # 12    A B A A A   4               4
    # 13    A B A B A   6               6
    

    And it is symmetric as well.

    There is one notable pitfall here, however:

    phrase_finder(r'AA B A C AAA', r'A', r'B', 3)
    # -> 6
    

    The correct way to call it in this case is by supplying word boundaries for regexes (note the r prefix as well!):

    phrase_finder(r'AA B A C AAA', r'\bA\b', r'\bB\b', 3)
    # -> 1