pythonpandasnlpnamed-entity-extraction

Data Extraction in Python


I've been given a data set consisting of three columns. One column has transaction information, one has a store number, and one has sections. My goal is to extract the store number from the transaction information column for 300 different stores using entity extraction. My thought process behind this was to make something similar to how companies search resumes for key words using a word bank, since I have the store numbers in a separate column already. I have the .csv file read into my program, and I have the store numbers stored into their own array. I'm trying to figure out how to search the transaction information column for those store numbers.

Code so far:

import pandas as pd
import numpy as np

file = pd.read_csv(r'C:\Users\cspea\Desktop\assignment.csv')
print(file)

store_number_array = file['store_number'].to_numpy()
print(store_number_array)

Sample data set (in .csv format):

transaction_descriptor,store_number,dataset
DOLRTREE 2257 00022574 ROSWELL,2257,train
AUTOZONE #3547,3547,train
TGI FRIDAYS 1485 0000,1485,train
BUFFALO WILD WINGS 003,3,train
J. CREW #568 0,568,train

Any tips would be greatly appreciated. Thanks for your time and assistance in advance :)


Solution

  • try this :

    df['c'] = df['transaction_descriptor'].apply(lambda x: (df[df['transaction_descriptor'].str.contains(x)]['store_number']))[0]
    for index,row in df.loc[df['c'].isna(),:].iterrows():
        test_=df.loc[index,'store_number']
        test=df.loc[index,'transaction_descriptor']
        result=[s for s in test.split() if str(test_) in s]
        
        df.loc[index,'c']=result