pythonpython-3.xpandasnltknltk-trainer

Finding matching words with ngrams


Dataset:

df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]

Id       bigram
1952043  [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916  [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751  [(Flat,available),(available,sale),(sale,Medavakkam),
1270503  [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638  [(near,medavakkam),(medavakkam,junction),(junction,calm),

I have a python file(Categories.py) containing the unsupervised classification of the property/Land features.

category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
        ('Swimming Pool', 'IN','Recreation_Ammenities'),
        ('Toddler Pool', 'IN', 'Recreation_Ammenities'),
        ('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

To find the matching words from bigram column nd category list:

tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))

While running the above code, I am getting this error:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Need help on this.

My desired output is:

 Id       bigram                                  Recreation_Amenities
1952043  [(Swimming,Pool),(Pool,in),(in,the),..   Swimming Pool
1918916  [(Luxury,Apartments),(Apartments,..      Luxury Apartments
1645751  [(Flat,available),(available,sale)..     
1270503  [(Toddler,Pool),(Jogging,Tracks)..      Toddler Pool,Jogging Tracks
1495638  [(near,medavakkam),..

Solution

  • Something along those lines should work for you:

    def match_bigrams(row):
        categories = []
    
        for bigram in row.bigram:
            joined = ' '.join(list(bigram))
            if joined in Recreation:
                categories.append(joined)
    
        return categories
    
    df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1)
    print(df)
    
    
    Id  bigram  Recreation_Amenities
    0   1952043 [(Swimming, Pool), (Pool, in), (in, the), (the...   [Swimming Pool]
    1   1918916 [(Luxury, Apartments), (Apartments, consisting...   [Luxury Apartments]
    2   1645751 [(Flat, available), (available, sale), (sale, ...   []
    3   1270503 [(Toddler, Pool), (Pool, with), (with, Jogging...   [Toddler Pool, Jogging Tracks]
    4   1495638 [(near, medavakkam), (medavakkam, junction), (...   []
    

    Each bigram is joined by a space so that it can be tested whether that bigram is contained in your list of categories (i.e. if joined in Recreation).