I have a string like so:
"The dates are from 30 June 2019 to 1 January 2022 inclusive"
I want to extract the dates from this string using spaCy.
Here is my function so far:
def extract_dates_with_year(text):
doc = nlp(text)
dates_with_year = []
for ent in doc.ents:
if ent.label_ == "DATE":
dates_with_year.append(ent.text)
return dates_with_year
This returns the following output:
['30 June 2019 to 1 January 2022']
However, I want output like:
['30 June 2019', '1 January 2022']
The issue is that "to"
is considered part of the date. So when you do for ent in doc.ents
, your loop only has one iteration, as "30 June 2019 to 1 January 2022"
is considered one entity.
As you don't want this behaviour, you can amend your function to split on "to"
:
def extract_dates_with_year(text):
doc = nlp(text)
dates_with_year = []
for ent in doc.ents:
if ent.label_ == "DATE":
for ent_txt in ent.text.split("to"):
dates_with_year.append(ent_txt.strip())
return dates_with_year
This will correctly handle dates like these, as well as single dates, and strings with multiple dates:
txt = """
The dates are from 30 June 2019 to 1 January 2022 inclusive.
And oddly also 5 January 2024.
And exclude 21 July 2019 until 23 July 2019.
"""
extract_dates_with_year(txt)
# Output:
[
'30 June 2019',
'1 January 2022',
'5 January 2024',
'21 July 2019',
'23 July 2019'
]