I'm writing a parser to parse huge chunks of English text using attoparsec. Everything has been great so far, except for parsing this char "――"
. I know it is just 2 dashes together "--"
. The weird thing is, the parser catches it in this code:
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass "――?!,:")) >> pure ()
but not in this case:
specialChars = ['――', '?', '!', ',', ':']
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass specialChars)) >> pure ()
The reason I'm using the list specialChars
is because I have a lot of characters to consider and I apply it multiple cases. And for the input consider: "I am ――Walt Whitman._"
and the output is supposed to be {"I", "am", "Walt", "Whiteman."}
I believe it's mostly because "――"
is not a Char? How do I fix this?
A Char
is one character, full stop. ――
is two characters, so it is two Char
s. You can fit as many Char
s as you want into a String
, but you certainly cannot fit two Char
s into one Char
.
Since satisfy
considers individual characters at a time, it probably isn’t what you want if you need to parse a sequence of two characters as a single unit. The inClass
function just produces a predicate on characters (inClass
partially applied to one argument produces a function of type Char -> Bool
), so inClass "――"
is the same as inClass ['―', '―']
, which is just the same as inClass ['―']
since duplicates are irrelevant. That won’t help you much.
Consider using string
instead of or in combination with inClass
, since it is designed to handle sequences of characters. For example, something like this might better suit your needs:
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> string "――" <|> satisfy (inClass "?!,:")) >> pure ()