haskellattoparsec

Error parsing a char (――) in Haskell


I'm writing a parser to parse huge chunks of English text using attoparsec. Everything has been great so far, except for parsing this char "――". I know it is just 2 dashes together "--". The weird thing is, the parser catches it in this code:

wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass "――?!,:")) >> pure () 

but not in this case:

specialChars = ['――', '?', '!', ',', ':']
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass specialChars)) >> pure ()

The reason I'm using the list specialChars is because I have a lot of characters to consider and I apply it multiple cases. And for the input consider: "I am ――Walt Whitman._" and the output is supposed to be {"I", "am", "Walt", "Whiteman."} I believe it's mostly because "――" is not a Char? How do I fix this?


Solution

  • A Char is one character, full stop. ―― is two characters, so it is two Chars. You can fit as many Chars as you want into a String, but you certainly cannot fit two Chars into one Char.

    Since satisfy considers individual characters at a time, it probably isn’t what you want if you need to parse a sequence of two characters as a single unit. The inClass function just produces a predicate on characters (inClass partially applied to one argument produces a function of type Char -> Bool), so inClass "――" is the same as inClass ['―', '―'], which is just the same as inClass ['―'] since duplicates are irrelevant. That won’t help you much.

    Consider using string instead of or in combination with inClass, since it is designed to handle sequences of characters. For example, something like this might better suit your needs:

    wordSeparator :: Parser ()
    wordSeparator = many1 (space <|> string "――" <|> satisfy (inClass "?!,:")) >> pure ()