I am using Parsec library to parse a string. The problem I have is that can't differentiate some tokens because they are words with the same prefix. Simplifying the whole grammar (in my case it's not a regular one), say we have the following:
T0 := <empty> | tag0 T1 T0
T1 := tag1 | tag1 T1
So I may have strings like "tag0tag1tag1" or like "tag0tag1tag0tag1" etc., basically we have a "tag0" string followed by an arbitrary (not zero) number of "tag1" string, and all this could also be repeated any number of times.
So what I tried was something like:
wrongParser :: Parser String
wrongParser = do
string "tag0"
many $ string "tag1"
return "Ok"
And tested with
ghci> parse wrongParser "Error" "tag0tag1tag1tag0tag1"
Left "Error" (line 1, column 13):
unexpected "0"
expecting "tag1"
So what seems to happen here is that the parser read "tag" from "tag0", but it is expecting "tag1" instead (because is still reading many
of "tag1").
Is there a way to make the parser to take the tag string as a whole so that instead of failing it just assumes that all the many tag1 are already read and stop with no error (maybe another function than string
)? Or what is the correct way to handle this case?
This is a common misconception with Parsec
:
Parsec
)Parsec
is a parser which comsumes less output than expectedTherefore, your parser works like this
string to consume: "tag0tag1tag1tag0tag1"
string consumed: ""
-- *********************
-- Parser: string "tag0"
-- *********************
-- 1st token
string to consume: "ag0tag1tag1tag0tag1"
string consumed: "t"
-- 2nd token
string to consume: "g0tag1tag1tag0tag1"
string consumed: "ta"
-- 3th token
string to consume: "0tag1tag1tag0tag1"
string consumed: "tag"
-- 4th token
string to consume: "tag1tag1tag0tag1"
string consumed: "tag0"
-- ***********************
-- done with string "tag0"
-- ***********************
-- ****************************
-- Parser: many $ string "tag1"
-- ****************************
-- convince yourself you'll reach this point
string to consume: "tag0tag1"
string consumed: "tag0tag1tag1"
-- keep in mind we are executing (string "tag1")
-- because you put it inside many.
-- So you end up with
string to consume: "0tag1"
string consumed: "tag0tag1tag1tag"
-- after this, you consume a 0 and fail.
string to consume: "tag1"
string consumed: "tag0tag1tag1tag" 0 ???
Probably you want to say, "if you fail reading tag1, just pretend you haven't read it at all". This is called backtracking and you use the try
combinator to make it happen
rightParser :: Parser String
rightParser = do
string "tag0"
many1 $ try (string "tag1") -- many1 because tag0 must be followed by at least one tag1. try because you want to backtrack tag1 if failure.
return "Ok"
Now if you want to parse the whole string you use many rightParser
. For the sake of example, let's say you want to return all "tag1"s. Then
rightParser :: Parser [String]
rightParser = do
string "tag0"
many1 $ try (string "tag1") -- same as before, by returns a list of tag1 instead of "Ok"
wholeParser = many rightParser
main = print $ parse wholeParser "Error" "tag0tag1tag1tag0tag1"
> Right [["tag1","tag1"],["tag1"]]
Notice, that other libraries (ex: Attoparsec) do always backtrack on failure. This is a design decision made by each library. Also, backtracking can be an expensive operation, so you may want to write your parser differently (example: always parse "tag" and backtrack only on "0" or "1")