I have a basic question about parsing using Python's parsec.py library.
I would like to extract the date somewhere inside a text. For e.g,
Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?
or
Lorem ipsum dolor sit amet.
A number 42 is present here.
But here is a date 11/05/2017. Can you extract this?
In both cases I want the parser to return 11/05/2017
.
I only want to use parsec.py
parsing library and I don't want to use regex. parsec
's built in regex function is okay.
I tried something like
from parsec import *
ss = "Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?"
date_parser = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')
date = date_parser.parse(ss)
I get ParseError: expected [0-9]{2}/[0-9]{2}/[0-9]{4} at 0:0
Is there a way to ignore the text until the date_parser pattern has reached? Without erroring?
What you want is a parser which skip any unmatched chars, then parse a regex pattern followed.
The date pattern could be defined with regex
parser,
date_pattern = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')
We first define a parser which consumle an arbitrary char (which would be included in the library (edit: has been included in v3.9)),
def any():
'''Parse a random character.'''
@Parser
def any_parser(text, index=0):
if index < len(text):
return Value.success(index + 1, text[index])
else:
return Value.failure(index, 'a random char')
return any_parser
To express the idea about "skip any chars and match a pattern", we need to define a recursive parser as
date_parser = date_pattern ^ (any() >> date_parser)
But it is not a valid python expression, thus we need
@generate
def date_with_prefix():
matched = yield(any() >> date_parser)
return matched
date_parser = date_pattern ^ date_with_prefix
(Here the combinator ^
means try_choice
, you could find it in the docs.)
Then it would work as expected:
>>> date_parser.parse("Lorem ipsum dolor sit amet.")
---------------------------------------------------------------------------
ParseError Traceback (most recent call last)
...
ParseError: expected date_with_prefix at 0:27
>>> date_parser.parse("A number 42 is present here.")
---------------------------------------------------------------------------
ParseError Traceback (most recent call last)
...
ParseError: expected date_with_prefix at 0:28
>>> date_parser.parse("But here is a date 11/05/2017. Can you extract this?")
'11/05/2017'
To avoid the expection on invalid input and returns a None
instead, you could define it as an optional
parser:
date_parser = optional(date_pattern ^ date_with_prefix)