Using Attoparsec, I am trying to match strings containing exactly 1 'x', 1 'y', and 1 'z', and any number of 'a', 'b', or 'c', without any constraint on the order of each char.
For instance, "abbbzacyaaaxcba" and "abbbzacxaaaycba" should be a match but the following should not :
The best I've been able to do so far is this:
import qualified Data.Attoparsec.ByteString.Char8 as A8
import qualified Data.ByteString.Char8 as B8 (pack)
p ch = do
abcs <- A8.many' (A8.choice [A8.char 'a', A8.char 'b', A8.char 'c'])
x <- A8.char ch
return $ concat [[x],abcs]
parse = do
xyz1 <- A8.choice [p 'x', p 'y', p 'z']
xyz2 <- A8.choice [p 'x', p 'y', p 'z']
xyz3 <- A8.choice [p 'x', p 'y', p 'z']
final <- A8.manyTill (A8.choice [A8.char 'a', A8.char 'b', A8.char 'c']) $ A8.char '\n'
return (xyz1, xyz2, xyz3, final)
(arbitrarily, I chose to stop the parsing with '\n', but that's just to pick a simple example).
then trying in ghci:
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaaxcba\n"
Done "" ("zabbb","yac","xaaa", "cba")
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaacba\n"
Fail "aaacba\n" [] "Failed reading: empty"
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaaxcbxa\n"
Fail "xa\n" [] "Failed reading: empty"
But it looks very clunky, and it does not scale easily to a list of unique characters (eg I am given a givenchars :: [Char] list without duplicates and I want to match any string made of all the givenchars and of any 'a','b','c's in between, in any order).
Is there a better, more elegant, and scalable way to do this?
PS: I am not looking for a regex solution as it would not apply to my real-life problem. I need to use a parser.
There are a couple of issues with your code:
Firstly, it doesn't enforce exactly one x
, y
and z
in the string. For instance, it will accept xxx\n
Secondly, it's very inefficient. Consider how xyz1
is parsed when given the string aaaaaaaaaz
:
We first try p 'x'
. This starts by parsing all of the a
characters and then finds a z
. Since z
is not x
, the entire p 'x'
parser fails.
Then we try p 'y'
. This re-parses all of the a
characters and then finds the z
again. Since z
is not y
the entire p 'y'
parser fails.
On the third try we succeed, only after re-parsing all of the a
characters for the third time.
It would be better to write something like:
parse = do
s <- A8.takeWhile (\x -> elem x "abcxyz")
let xs = count 'x' s
ys = count 'y' s
zs = count 'z' s
guard $ xs == 1 && ys == 1 && zs == 1
return s
The function count
is from Data.ByteString.Char8
and guard
is from Control.Monad