[SOLVED] filtering a pandas dataframe string column using the `str.contains` method

filtering a pandas dataframe string column using the `str.contains` method

I have dataframe looking like this, where long_category reflects the category of businesses in the rows:

df = pd.DataFrame({
 'long_category': {0: 'Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists',
  1: 'Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services',
  2: 'Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores',
  3: 'Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries',
  4: 'Brewpubs, Breweries, Food',
  5: 'Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants',
  6: 'Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories',
  7: 'Synagogues, Religious Organizations',
  8: 'Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek',
  9: 'Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food'}})

df:

My goal is to shorten those categories into the below known_categories based on whether or not a long category contains a keyword in the known_categories:

  known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical', 'shopping', 'coffee & tea','automotive', 'pets|veterinian', 'services', 'stores',  'grocery', 'ice cream']

So, I do:

df['short_category'] = np.nan
for cat in known_categories:
    excluded_cats = [x for x in known_categories if x!= cat]
    df['short_category'] [ ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) & (df.long_category.str.contains(cat, regex = True, case = False, na = False))] = cat

It is important that the short categories are mutually exclusive. For example, row indexed 3 should be put in either "restaurant" or "coffee & tea" short category, hence the ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) condition above.

But this is not working as can be seen below. For example, both of the last two rows have "restaurant" in their long category but only the first one's short category has captured that. I was expecting the last one to be either 'restaurant' or 'ice cream' because it has those keywords from short_categories. So, where have I gone wrong? As a side note, I would like to be able to impact the frequency of the short categories by moving the category further to the front or back of known_categories if possible. For example, with the current short_categories, I'd like the last row to have 'restaurant' as short category. But if I move 'ice cream' before 'restaurant' in short_categories, I would like the last row to show 'ice cream' and not 'restaurant'

Solution

Go through your categories list backwards (assuming more specific categories come later) and only set it as the short category if no previous category has been assigned:

df['short_category'] = ''

for cat in known_categories[::-1]:
    contains_cat = df['long_category'].str.contains(cat, case = False, regex = True)
    no_category = df['short_category'] == ''

    df.loc[contains_cat & no_category, 'short_category'] = cat