parsinghaskellrfc5322

Choosing the right ReadP parse result


I'm trying to parse a RFC5322 email address. My parser works in the sense that among the results, one of them is correct. However, how do I go about selecting the “correct” result?

Given the string Foo Bar <foo@bar.com>, my parser should produce a value of Address (Just "Foo Bar") "foo@bar.com".

Alternatively, given the string foo@bar.com, my parser should produce a value of Address Nothing "foo@bar.com".

The value with the name included is preferred.

My parser looks like this:

import           Control.Applicative
import           Data.Char
import qualified Data.Text                     as T
import           Text.ParserCombinators.ReadP

onlyEmail :: ReadP Address
onlyEmail = do
  skipSpaces
  email <- many1 $ satisfy isAscii
  skipSpaces
  return $ Address Nothing (T.pack email)

withName :: ReadP Address
withName = do
  skipSpaces
  name <- many1 (satisfy isAscii)
  skipSpaces
  email <- between (char '<') (char '>') (many1 $ satisfy isAscii)
  skipSpaces
  return $ Address (Just $ T.pack name) (T.pack email)

rfc5322 :: ReadP Address
rfc5322 = withName <|> onlyEmail

When I run the parser with readP_to_S rfc5322 "Foo Bar <foo@bar.com>", it produces the following results:

[ (Address {addressName = Nothing, addressEmail = "F"},"oo Bar <foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Fo"},"o Bar <foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo"},"Bar <foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo "},"Bar <foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo B"},"ar <foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Ba"},"r <foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar"},"<foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar "},"<foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <"},"foo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <f"},"oo@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <fo"},"o@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo"},"@bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@"},"bar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@b"},"ar.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@ba"},"r.com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@bar"},".com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@bar."},"com>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@bar.c"},"om>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@bar.co"},"m>")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@bar.com"},">")
, (Address {addressName = Just "Foo Bar", addressEmail = "foo@bar.com"},"")
, (Address {addressName = Just "Foo Bar ", addressEmail = "foo@bar.com"},"")
, (Address {addressName = Nothing, addressEmail = "Foo Bar <foo@bar.com>"},"")
]

In this case, the result I actually want appears third-last in the list. How do I express that preference?


Solution

  • You should not to do preference. Your problem is that your partial parsers are accepting the more bigger string set than really need.

    For example, my solution:

    import           Control.Bool
    import           Control.Applicative
    import           Data.Char
    import qualified Data.Text                     as T
    import           Data.Text (Text)
    import           Text.ParserCombinators.ReadP
    
    email :: ReadP Text
    email = do
        l <- part
        a <- char '@'
        d <- part
        return . T.pack $ l ++ a:d
      where
        part = munch1 (isAscii <&&> (/='@') <&&> (/='<') <&&> (/='>'))
    
    name :: ReadP Text
    name = T.pack <$> chainr1 part sep
      where
        part = munch1 (isAlpha <||> isDigit <||> (=='\''))
        sep  = (\xs ys -> xs ++ ' ':ys) <$ munch1 (==' ')
    
    onlyEmail :: ReadP Address
    onlyEmail = Address Nothing <$> email
    
    withName :: ReadP Address
    withName = do
        n <- name
        skipSpaces
        e <- between (char '<') (char '>') email
        return $ Address (Just n) e
    
    address :: ReadP Address
    address = skipSpaces *> (withName <|> onlyEmail)
    
    main = print $ readP_to_S address "Foo Bar <foo@bar.com>"
    

    Will be printed:

    [(Address (Just "Foo Bar") "foo@bar.com","")]