pythonpandasspacynamed-entity-recognitionnamed-entity-extraction

Extracting SpaCy DATE entities and adding to new pandas column


I have a collection of social media comments that I want to explore based on their reference to dates. For this purpose, I am using SpaCy's Named Entity Recognizer to search for DATE entities. I have the comments in a pandas dataframe called df_test under the column comment. I would like to add a new column dates to this dataframe consisting of all the date entities found in each comment. Some comments are not going to have any date entities in which case None should be added here instead. So for example:

comment
'bla bla 21st century'
'bla 1999 bla bla 2022'
'bla bla bla'

Should be:

comment                        dates
'bla bla 21st century'         '21st century'
'bla 1999 bla bla 2022'        '1999', '2022'
'bla bla bla'                  'None'

Based on Is their a way to add the new NER tag found in a new column? I have tried a list approach:

date_label = ['DATE']
dates_list = []

def get_dates(row):
    comment = str(df_test.comment.tolist())
    doc = nlp(comment)
    for ent in doc.ents:
        if ent.label_ in date_label:
            dates_list.append([ent.text])
        else:
            dates_list.append(['None'])

df_test.apply(lambda row: get_dates(row))
date_df_test = pd.DataFrame(dates_list, columns=['dates'])

However, this then produces a column that would be longer than the original dataframe, like:

comment                        dates
'bla bla 21st century'         '21st century'
'bla 1999 bla bla 2022'        '1999'
'bla bla bla'                  '2022'
                               'None'

Which doesn't work, since the entries of dates no longer matches with their corresponding comments. I understand that it is because I am for-looping across all entities, but I don't know how to work around this. Is there any way to solve this, so that I can extract all date entities and connect them in some way to the comment their were found in for the purpose of later analysis? Any help is much appreciated!


Solution

  • I managed to find a solution to my own problem by using this function.

    date_label = ['DATE']
    
    def extract_dates(text):
        doc = nlp(text)
        results = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in date_label]
        return results
    
    df_test['dates'] = df_test['comment'].apply(extract_dates)
    

    I hope this may help anyone who face a similar issue.