python-3.xfindall

Calculate the frequency of all the keywords (single word and multi word) appearing in a document


I want to calculate the frequency of some keywords (single word or multi word) appearing in a document. I am using regex for this purpose. Below is my implementation:

def calculate_keyword_frequency(keyword_list, text):
    frequency = {}
    for keyword in keyword_list:
        frequency[keyword] = len(re.findall(keyword, text))
    return frequency

keyword_list = ["your work", "bodily injury"]
text = "your work needs to be finished. before you leave, your work should be done!"

result = calculate_keyword_frequency(keyword_list, text)

# Print the frequency for each keyword
for keyword, frequency in result.items():
    print(f"{keyword} = {frequency}")

This will use the re.findall() function to find all the keywords and their frequency appearing in text.

I have 2 issues with the above approach:

1.) The logic will return the frequency of all the keywords present in the list. But if the keywords are present in the text but in a different way, your work or your, work instead of your work(extra character) or any other case where the keyword is present but not in the form of the matching keyword present in the list, the logic will not detect that keyword. Basically the above logic is not robust enough.

2.) Is there any other way or library I can use to calculate the frequency of the keywords in the list? The workaround should be more robust than the above logic.

Thank you!

EDIT1: : I know I can write a regex pattern which can solve the first issue. But this brings up another issue in that, the keyword_list can have hundreds of keywords. Writing a regex pattern for all of them is not feasible!


Solution

  • You can replace spaces in each keyword with a regex pattern of \W+ so that it will match one or more non-word characters instead of just a space:

    import re
    
    def calculate_keyword_frequency(keyword_list, text):
        frequency = {}
        for keyword in keyword_list:
            frequency[keyword] = len(re.findall(keyword, text))
        return frequency
    
    keyword_list = ["your work", "bodily injury"]
    patterns = {r'\W+'.join(k.split()): k for k in keyword_list}
    text = "your  work needs to be finished. before you leave, your, work should be done!"
    
    result = calculate_keyword_frequency(patterns, text)
    
    # Print the frequency for each keyword
    for keyword, frequency in result.items():
        print(f"{patterns[keyword]} = {frequency}")
    

    This outputs:

    your work = 2
    bodily injury = 0
    

    Demo: https://replit.com/@blhsing/RepentantCheerfulShoutcast