So, I'm writing a small parser that will extract all <td>
tag content with specific class, like this one <td class="liste">some content</td> --> Right "some content"
I will be parsing large html
file but I don't really care about all the noise, so idea was to consume all characters until I reach <td class="liste">
than I'd consume all characters (content) until </td>
and return content string.
This works fine if last element in a file is my td.liste
tag, but if I have some text after it or eof
than my parser consumes it and throws unexpected end of input
if you execute parseMyTest test3
.
See end of test3
to understand what is the edge case.
Here is my code so far :
import Text.Parsec
import Text.Parsec.String
import Data.ByteString.Lazy (ByteString)
import Data.ByteString.Char8 (pack)
colOP :: Parser String
colOP = string "<td class=\"liste\">"
colCL :: Parser String
colCL = string "</td>"
col :: Parser String
col = do
manyTill anyChar (try colOP)
content <- manyTill anyChar $ try colCL
return content
cols :: Parser [String]
cols = many col
test1 :: String
test1 = "<td class=\"liste\">Hello world!</td>"
test2 :: String
test2 = read $ show $ pack test1
test3 :: String
test3 = "\n\r<html>asdfasd\n\r<td class=\"liste\">Hello world 1!</td>\n<td class=\"liste\">Hello world 2!</td>\n\rasldjfasldjf<td class=\"liste\">Hello world 3!</td><td class=\"liste\">Hello world 4!</td>adsafasd"
parseMyTest :: String -> Either ParseError [String]
parseMyTest test = parse cols "test" test
btos :: ByteString -> String
btos = read . show
I created a combinator skipTill p end
which applies p
until end
matches and then returns what end
returns.
By contrast, manyTill p end
applies p
until end
matches and then
returns what the p
parsers matched.
import Text.Parsec
import Text.Parsec.String
skipTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m end
skipTill p end = scan
where
scan = end <|> do { p; scan }
td :: Parser String
td = do
string "("
manyTill anyChar (try (string ")"))
tds = do r <- many (try (skipTill anyChar (try td)))
many anyChar -- discard stuff at end
return r
test1 = parse tds "" "111(abc)222(def)333" -- Right ["abc", "def"]
test2 = parse tds "" "111" -- Right []
test3 = parse tds "" "111(abc" -- Right []
test4 = parse tds "" "111(abc)222(de" -- Right ["abc"]
Update
This also appears to work:
tds' = scan
where scan = (eof >> return [])
<|> do { r <- try td; rs <- scan; return (r:rs) }
<|> do { anyChar; scan }