pythonregexnested-liststextmatching

Is there a better way to capture all the regex patterns in matching with nested lists within a dictionary?


I am trying out a simple text-matching activity where I scraped titles of blog posts and try to match it with my pre-defined categories once I find specific keywords.

So for example, the title of the blog post is

"Capture Perfect Night Shots with the Oppo Reno8 Series"

Once I ensure that "Oppo" is included in my categories, "Oppo" should match with my "phone" category like so:

categories = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}

Then my dataframe would look like this

Fortunately I found a reference to how to use regex in mapping to nested dictionaries but it can't seem to work past the first couple of words

Reference is here

So once I use the code

def put_category(cats, text):

    regex = re.compile("(%s)" % "|".join(map(re.escape, categories.keys())))

    if regex.search(text):
        ret = regex.search(text)
        return ret[0]
    else:
        return 'general'

It usually reverts to put "general" as the category, even when doing it in lowercase as seen here

I'd prefer to use the current method of inputting values inside the dictionary for this matching activity instead of running pure regex patterns and then putting it through fuzzy matching for the result.


Solution

  • You can create a reverse mapping that maps keywords to categories instead, so that you can efficiently return the corresponding category when a match is found:

    mapping = {keyword: category for category, keywords in categories.items() for keyword in keywords}
    
    def put_category(mapping, text):
        match = re.search(rf'\b(?:{"|".join(map(re.escape, mapping))})\b', text, re.I)
        if match:
            return mapping[match[0].lower()]
        return 'general'
    
    print(put_category(mapping, "Capture Perfect Night Shots with the Oppo Reno8 Series"))
    

    This outputs:

    phone
    

    Demo: https://replit.com/@blhsing/BlandAdoredParser