I need to count the number of times a specific phrase occurs within 3 words of another specific phrase, per row of a dataframe string. Order does not matter.
To illustrate: X = "black cat", Y = "is my", proximity distance = 3, and String = "The black cat is my black cat", .... the output count would be two (two unique pairs found). "The black cat by the window is my black cat" would also = two matches found. However, "The black cat by the big window is my black cat" = one match found.
Here is my example data, broken code, and desired output:
data = [['ABC123', 'test sentence here has these test words'], ['ABC456', 'test sentence here
contains these test words in test sentence form'], ['ABC789', 'the third test sentence has no
more additional test words']]
df = pd.DataFrame(data, columns=['Record ID', 'String'])
print(df)
Record ID | String
----------|-----------------------
ABC123 | test sentence here has these test words
ABC456 | test sentence here contains these test words in test sentence form
ABC789 | the third test sentence has no more additional test words
import pandas as pd
def phrase_finder(df, text_column, search_phrase, near_phrase, distance):
results = 0
for text in df[text_column]:
for substring in text.split(search_phrase):
words = substring.split()
if len(words) <= distance + 1 and near_phrase in substring:
results += 1
return results if results else None
search_phrase = "test sentence"
near_phrase = "test words"
distance = 3
print(phrase_finder(df, 'String', search_phrase, near_phrase, distance))
ID | Number of Matches
----------|-----------------------
ABC123 | 1
ABC456 | 2
ABC789 | 0
This is a direct follow-up to Find word near other word, within N# of words
I was instructed to create a separate question for this rather than posting it on the other one as a follow-up.
I believe O-O-O was somewhat right about regex - it is a major unsustainable PITA in your use case, IMHO. That said, the problem is quite tricky...
What regex does well is string tokenization. I have applied a rather straightforward approach:
Not sure what are we supposed to do if substrings overlap. The code is as follows. Just string slicing and word counting, no mindboggling magic here (the less magic in the production code, the better!):
import re
def phrase_finder(text: str, str1: str, str2: str, distance: int) -> int:
results = 0
for match1 in re.finditer(str1, text):
for match2 in re.finditer(str2, text):
if match1.end() < match2.start():
between_matches = text[match1.end():match2.start()]
if len(re.findall(r'\w+', between_matches)) <= distance:
results += 1
elif match2.end() < match1.start():
between_matches = text[match2.end():match1.start()]
if len(re.findall(r'\w+', between_matches)) <= distance:
results += 1
else:
# what do we do here?
pass
return results
Test cases:
phrase_finder('The black cat is my black cat', 'black cat', 'is my', 3)
# 2
phrase_finder('The black cat by the window is my black cat', 'black cat', 'is my', 3)
# 2
phrase_finder('The black cat by the big window is my black cat', 'black cat', 'is my', 3)
# 1
import pandas as pd
from functools import partial
data = [
['', 0],
['A', 0],
['B', 0],
['A B', 1],
['B A', 1],
['A A B', 2],
['A B B', 2],
['A B C', 1],
['A C C C B', 1],
['A C C C C B', 0],
['A B A', 2],
['A B A A', 3],
['A B A A A', 4],
['A B A B A', 6]
]
df = pd.DataFrame(data, columns=['text', 'expected_output'])
df['result'] = df['text'].apply(partial(phrase_finder, str1=r'A', str2=r'B', distance=3))
df
# text expected_output result
# 0 0 0
# 1 A 0 0
# 2 B 0 0
# 3 A B 1 1
# 4 B A 1 1
# 5 A A B 2 2
# 6 A B B 2 2
# 7 A B C 1 1
# 8 A C C C B 1 1
# 9 A C C C C B 0 0
# 10 A B A 2 2
# 11 A B A A 3 3
# 12 A B A A A 4 4
# 13 A B A B A 6 6
And it is symmetric as well.
There is one notable pitfall here, however:
phrase_finder(r'AA B A C AAA', r'A', r'B', 3)
# -> 6
The correct way to call it in this case is by supplying word boundaries for regexes (note the r
prefix as well!):
phrase_finder(r'AA B A C AAA', r'\bA\b', r'\bB\b', 3)
# -> 1