htmlparsinghaskellparsec

How to stop Haskell Parsec parser at EOF


So, I'm writing a small parser that will extract all <td> tag content with specific class, like this one <td class="liste">some content</td> --> Right "some content"

I will be parsing large html file but I don't really care about all the noise, so idea was to consume all characters until I reach <td class="liste"> than I'd consume all characters (content) until </td> and return content string.

This works fine if last element in a file is my td.liste tag, but if I have some text after it or eof than my parser consumes it and throws unexpected end of input if you execute parseMyTest test3.

See end of test3 to understand what is the edge case.

Here is my code so far :

import Text.Parsec
import Text.Parsec.String

import Data.ByteString.Lazy (ByteString)
import Data.ByteString.Char8 (pack)

colOP :: Parser String
colOP = string "<td class=\"liste\">"

colCL :: Parser String
colCL = string "</td>"

col :: Parser String
col = do
  manyTill anyChar (try colOP)
  content <- manyTill anyChar $ try colCL
  return content

cols :: Parser [String]
cols = many col

test1 :: String
test1 = "<td class=\"liste\">Hello world!</td>"

test2 :: String
test2 = read $ show $ pack test1

test3 :: String
test3 = "\n\r<html>asdfasd\n\r<td class=\"liste\">Hello world 1!</td>\n<td class=\"liste\">Hello world 2!</td>\n\rasldjfasldjf<td class=\"liste\">Hello world 3!</td><td class=\"liste\">Hello world 4!</td>adsafasd"

parseMyTest :: String -> Either ParseError [String]
parseMyTest test = parse cols "test" test

btos :: ByteString -> String
btos = read . show

Solution

  • I created a combinator skipTill p end which applies p until end matches and then returns what end returns.

    By contrast, manyTill p end applies p until end matches and then returns what the p parsers matched.

    import Text.Parsec
    import Text.Parsec.String
    
    skipTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m end
    skipTill p end = scan
        where
          scan  = end  <|> do { p; scan }
    
    td :: Parser String
    td = do
      string "("
      manyTill anyChar (try (string ")"))
    
    tds = do r <- many (try (skipTill anyChar (try td)))
             many anyChar -- discard stuff at end
             return r
    
    test1 = parse tds "" "111(abc)222(def)333" -- Right ["abc", "def"]
    
    test2 = parse tds "" "111"                 -- Right []
    
    test3 = parse tds "" "111(abc"             -- Right []
    
    test4 = parse tds "" "111(abc)222(de"      -- Right ["abc"]
    

    Update

    This also appears to work:

    tds' = scan
      where scan = (eof >> return [])
                   <|> do { r <- try td; rs <- scan; return (r:rs) }
                   <|> do { anyChar; scan }