What I want to Achieve
My goal is to extract a date that has at least a day and a Month, but could also have a minute, hour, and year.
I want to avoid the parser finding integers and thinking that implies a day of the current month.
Furthermore, I also want the parser to find a date that is only a small part of a larger string.
Think:
'Today is the most wonderful 27th of March 2025' = datetime.datetime(2025, 3, 27, 0, 0)
'2gb of ram' != datetime.datetime(2025, 3, 2, 0, 0) (Assuming we are currently in March)
What I tried so far
Using the fuzzy=True argument in dateutils
from dateutil import parser
#OK: Correct Datetime is returned: datetime.datetime(2025, 3, 30, 0, 0)
parser.parse('Today is the most wonderful 30th March 2025', fuzzy=True)
#NOT OK: Integer is not ignored, Datetime is returned: datetime.datetime(2025, 3, 2, 0, 0)
parser.parse('2 is my lucky number', fuzzy=True)
Using the REQUIRED_PARTS setting in dateparser
frome dateparser import parse
# OK: Returns correct datetime: datetime.datetime(2025, 3, 30, 0, 0
parse('30 March', settings={'REQUIRE_PARTS': ['month', 'day']})
#OK: Integer is Ignored, no datetime returned
parse('30', settings={'REQUIRE_PARTS': ['month', 'day']})
#NOT OK: Datetime Should be Found
parse('Today is the most wonderful 30th of March', settings={'REQUIRE_PARTS': ['month', 'day']})
It would be great if I could combine the fuzzy=True argument from the dateutils module with the settings argument from the dateparser module, but seeing as they are separate modules, that is not feasible.
Is there another way to achieve the same functionality?
Use
from dateparser.search import search_dates
Here's a quick function:
from dateparser import parse
from dateparser.search import search_dates
def extract_date(text:str, exclusions:list=['now', 'today', 'tomorrow', 'yesterday', 'hour', 'minute', 'seconds', 'month', 'months','year', 'years'], required:list=['month', 'day']):
'''
Check if the text contains at least Day and Month to parse date off
If yes, return datetime object. If not, return None.
- Inputs:
* text: string to extract date from
* exclusions: list of words to exclude from parsing (e.g. ['today', 'tomorrow'])
* required: list of required date components (e.g. ['day', 'month'])
'''
# Parse the date
# It will return only the first result, if found
try:
return search_dates(text.lower(),
settings={'REQUIRE_PARTS': required,
'SKIP_TOKENS': exclusions})[0][1]
# Error, return None
except (IndexError, TypeError):
return None
Testing here, it worked ok.
Text | Date Extracted
-----------------------------
TEXT: Today is the most wonderful 30th March 2025 || ** DATE PARSED: 2025-03-30
TEXT: 2 is my lucky number || ** DATE PARSED: None
TEXT: I was born on 1990-01-01 || ** DATE PARSED: 1990-01-01 00:00:00
TEXT: I will go to Paris on 2025-01-01 || ** DATE PARSED: 2025-01-01 00:00:00
TEXT: I will go to Paris on 2040-09 missing day || ** DATE PARSED: None
TEXT: 25 thousand days || ** DATE PARSED: None
TEXT: It costs 25 dollars || ** DATE PARSED: None
TEXT: I will go to NYC in 25 days || ** DATE PARSED: 2025-04-21 16:47:31.137955
TEXT: I will go to Rome in 1 month || ** DATE PARSED: None