haskellattoparsec

Parser: Enforcing unicity when order is not known


Using Attoparsec, I am trying to match strings containing exactly 1 'x', 1 'y', and 1 'z', and any number of 'a', 'b', or 'c', without any constraint on the order of each char.

For instance, "abbbzacyaaaxcba" and "abbbzacxaaaycba" should be a match but the following should not :

The best I've been able to do so far is this:

import qualified Data.Attoparsec.ByteString.Char8 as A8
import qualified Data.ByteString.Char8 as B8 (pack)

p ch = do
    abcs <- A8.many' (A8.choice [A8.char 'a', A8.char 'b', A8.char 'c'])
    x    <- A8.char ch
    return $ concat [[x],abcs]

parse = do
    xyz1 <- A8.choice [p 'x', p 'y', p 'z']
    xyz2 <- A8.choice [p 'x', p 'y', p 'z']
    xyz3 <- A8.choice [p 'x', p 'y', p 'z']
    final <- A8.manyTill (A8.choice [A8.char 'a', A8.char 'b', A8.char 'c']) $ A8.char '\n'
    return (xyz1, xyz2, xyz3, final)

(arbitrarily, I chose to stop the parsing with '\n', but that's just to pick a simple example).

then trying in ghci:

Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaaxcba\n"
Done "" ("zabbb","yac","xaaa", "cba")
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaacba\n"
Fail "aaacba\n" [] "Failed reading: empty"
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaaxcbxa\n"
Fail "xa\n" [] "Failed reading: empty"

But it looks very clunky, and it does not scale easily to a list of unique characters (eg I am given a givenchars :: [Char] list without duplicates and I want to match any string made of all the givenchars and of any 'a','b','c's in between, in any order).

Is there a better, more elegant, and scalable way to do this?

PS: I am not looking for a regex solution as it would not apply to my real-life problem. I need to use a parser.


Solution

  • There are a couple of issues with your code:

    Firstly, it doesn't enforce exactly one x, y and z in the string. For instance, it will accept xxx\n

    Secondly, it's very inefficient. Consider how xyz1 is parsed when given the string aaaaaaaaaz:

    1. We first try p 'x'. This starts by parsing all of the a characters and then finds a z. Since z is not x, the entire p 'x' parser fails.

    2. Then we try p 'y'. This re-parses all of the a characters and then finds the z again. Since z is not y the entire p 'y' parser fails.

    3. On the third try we succeed, only after re-parsing all of the a characters for the third time.

    It would be better to write something like:

    parse = do
       s <- A8.takeWhile (\x -> elem x "abcxyz")
       let xs = count 'x' s
           ys = count 'y' s
           zs = count 'z' s
       guard $ xs == 1 && ys == 1 && zs == 1
       return s
    

    The function count is from Data.ByteString.Char8 and guard is from Control.Monad