parsinghaskelltrifecta

Skip everything until a successful parse


I'd like to parse all days from a text like this:

Ignore this
Also this

2019-09-05

More to ignore
2019-09-06
2019-09-07

Using Trifecta, I've defined a function to parse a day:

dayParser :: Parser Day
dayParser = do
  dayString <- tillEnd
  parseDay dayString

tillEnd :: Parser String
tillEnd = manyTill anyChar (try eof <|> eol)

parseDay :: String -> Parser Day
parseDay s = maybe failure return dayMaybe
 where
  dayMaybe = parseTime' dayFormat s
  failure = unexpected $ "Failed to parse date. Expected format: " ++ dayFormat
  -- %-m makes the parser accept months consisting of a single digit
  dayFormat = "%Y-%-m-%-d"

eol :: Parser ()
eol = char '\n' <|> char '\r' >> return ()

-- "%Y-%-m-%-d" for example
type TimeFormat = String

-- Given a time format and a string, parses the string to a time.
parseTime' :: (Monad m, ParseTime t) => TimeFormat -> String -> m t
-- True means that the parser tolerates whitespace before and after the date
parseTime' = parseTimeM True defaultTimeLocale

Parsing a day this way works. What I'm having trouble with is ignoring anything in the text that's not a day.

The following can't work since it assumes the number of text blocks that aren't a day:

daysParser :: Parser [Day]
daysParser = do
  -- Ignore everything that's not a day
  _    <- manyTill anyChar $ try dayParser
  days <- many $ token dayParser
  _    <- manyTill anyChar $ try dayParser
  -- There might be more days after this...
  return days

I reckon there's a straightforward way to express this with Trifecta but I can't seem to find it.


Here's the whole module including an example text to parse:

{-# LANGUAGE QuasiQuotes #-}
module DateParser where
import           Text.RawString.QQ
import           Data.Time
import           Text.Trifecta
import           Control.Applicative            ( (<|>) )

-- "%Y-%-m-%-d" for example
type TimeFormat = String

dayParser :: Parser Day
dayParser = do
  dayString <- tillEnd
  parseDay dayString

tillEnd :: Parser String
tillEnd = manyTill anyChar (try eof <|> eol)

parseDay :: String -> Parser Day
parseDay s = maybe failure return dayMaybe
 where
  dayMaybe = parseTime' dayFormat s
  failure = unexpected $ "Failed to parse date. Expected format: " ++ dayFormat
  -- %-m makes the parser accept months consisting of a single digit
  dayFormat = "%Y-%-m-%-d"

eol :: Parser ()
eol = char '\n' <|> char '\r' >> return ()

-- Given a time format and a string, parses the string to a time.
parseTime' :: (Monad m, ParseTime t) => TimeFormat -> String -> m t
-- True means that the parser tolerates whitespace before and after the date
parseTime' = parseTimeM True defaultTimeLocale

daysParser :: Parser [Day]
daysParser = do
  -- Ignore everything that's not a day
  _    <- manyTill anyChar $ try dayParser
  days <- many $ token dayParser
  _    <- manyTill anyChar $ try dayParser
  -- There might be more days after this...
  return days

test = parseString daysParser mempty text1

text1 = [r|
Ignore this
Also this

2019-09-05

More to ignore
2019-09-06
2019-09-07|]

Solution

  • There are three large problems here.

    First, the way you're defining dayParser, it's always trying to parse the rest of the text as a date. For example, if your input text is "2019-01-01 foo bar", then dayParser would first consume the whole string, so that dayString == "2019-01-01 foo bar", and then will try to parse that string as a date. Which, of course, would fail.

    In order to have a saner behavior, you could only bite off the beginning of the string that kinda looks like a date and try to parse that, like:

    dayParser =
      parseDay =<< many (digit <|> char '-')
    

    This implementation bites off the beginning of the input consisting of digits and dashes, and tries to parse that as a date.

    Note that this is a quick-n-dirty implementation. It is imprecise. For example, this implementation would accept input like "2019-01-0123456" and try to parse that as a date, and of course will fail. From your question, it is not clear whether you'd want to still parse 2019-01-01 and leave the rest, or whether you want to not consider that a proper date. If you wanted to be super-precise about this, you could specify the exact format as precisely as you want, e.g.:

    dayParser = do
      y <- count 4 digit
      void $ char '-'
      m <- try (count 2 digit) <|> count 1 digit
      void $ char '-'
      d <- try (count 2 digit) <|> count 1 digit
      parseDay $ y ++ "-" ++ m ++ "-" ++ d
    

    This implementation expects exactly the format of the date.

    Second, there is a logical problem: your daysParser tries to first parse some garbage, then parse many days, and then parse some garbage again. This logic does not admit a case where the many dates have some garbage between them.

    Third problem is much more tricky. You see, the way the try combinator works - if the parser fails, then try will roll back the input position, but if the parser succeeds, then the input remains consumed! This means that you cannot use try as a zero-consumption lookahead, the way you're trying to do in manyTill anyChar $ try dayParser. Such a parser will parse until it finds a date, and then it will consume the date, leaving nothing for the next parser and causing it to fail.

    I will illustrate with a simpler example. Consider this:

    > parseString (many (char 'a')) mempty "aaa"
    Success "aaa"
    

    Cool, it parses three 'a's. Now let's add a try at the beginning:

    > parseString (try (char 'b') *> many (char 'a')) mempty "aaa"
    Success "aaa"
    

    Awesome, this still works: the try fails, and then we parse three 'a's as before.

    Now let's change the try from 'b' to 'a':

    > parseString (try (char 'a') *> many (char 'a')) mempty "aaa"
    Success "aa"
    

    Look what happened: the try has consumed the first 'a', leaving only two to be parsed by many.

    We can even extend it to more fully resemble your approach:

    > p = manyTill anyChar (try (char 'a')) *> many (char 'a')
    
    > parseString p mempty "aaa"
    Success "aa"
    
    > parseString p mempty "cccaaa"
    Success "aa"
    

    See what happens? manyTill correctly skips all the 'c's up to the first 'a', but then it also consumes that first 'a'!


    There appears to be no sane way (that I see) to have a zero-consumption lookahead like this. You always have to consume the first successful hit.

    If I had this problem, I would probably resort to recursion: parsing chars one by one, at every step looking if I can get a day, and concatenating in a list. Something like this:

    data WhatsThis = AChar Char | ADay Day | EOF
    
    daysParser = do
      r <- (ADay <$> dayParser) <|> (AChar <$> anyChar) <|> (EOF <$ eof)
      case r of
        ADay d -> do
          rest <- daysParser
          pure $ d : rest
        AChar _ ->
          daysParser
        EOF ->
          pure []
    

    It tries to parse a day, and if that fails, just skips a char, unless there are no more chars. If day parsing succeeded, it calls itself recursively, then prepends the day to the result of the recursive call.

    Note that this approach is not very composable: it always consumes everything till the end of the input. If you want to compose it with something else, you may want consider replacing eof with a parameter:

    daysParser stop = do
      r <- (ADay <$> dayParser) <|> (AChar <$> anyChar) <|> (EOF <$ stop)
      ...