[SOLVED] Extracting dates from a sentence in spaCy

Extracting dates from a sentence in spaCy

I have a string like so:

"The dates are from 30 June 2019 to 1 January 2022 inclusive"

I want to extract the dates from this string using spaCy.

Here is my function so far:

def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            dates_with_year.append(ent.text)
    return dates_with_year

This returns the following output:

['30 June 2019 to 1 January 2022']

However, I want output like:

['30 June 2019', '1 January 2022']

Solution

The issue is that "to" is considered part of the date. So when you do for ent in doc.ents, your loop only has one iteration, as "30 June 2019 to 1 January 2022" is considered one entity.

As you don't want this behaviour, you can amend your function to split on "to":

def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            for ent_txt in ent.text.split("to"):
                dates_with_year.append(ent_txt.strip())
    return dates_with_year

This will correctly handle dates like these, as well as single dates, and strings with multiple dates:

txt = """
     The dates are from 30 June 2019 to 1 January 2022 inclusive.
     And oddly also 5 January 2024.
     And exclude 21 July 2019 until 23 July 2019.
"""

extract_dates_with_year(txt)

# Output:
[
 '30 June 2019',
 '1 January 2022',
 '5 January 2024',
 '21 July 2019',
 '23 July 2019'
]