haskellattoparsec

Parsing the first occurrence of a word that is not precded by white space


Setting

I need to find the first occurrence of a word in some .txt file that is not preceded by white space. Here are the possible cases:

-- * should succed
t1 = "hello\t999\nworld\t\900"
t2 = "world\t\900\nhello\t999\n"
t3 = "world world\t\900\nhello\t999\n"

-- * should fail
t4 = "world\t\900\nhello world\t999\n"
t5 = "hello world\t999\nworld\t\900"
t6 = "world hello\t999\nworld\t\900"

Right now t6 is succeeding even though it should fail, because my parser will consume any character until it reaches hello. Here is my parser:

My Solution

import Control.Applicative

import Data.Attoparsec.Text.Lazy
import Data.Attoparsec.Combinator
import Data.Text hiding (foldr)
import qualified Data.Text.Lazy as L (Text, pack)



-- * should succed
t1 = L.pack "hello\t999\nworld\t\900"
t2 = L.pack "world\t\900\nhello\t999\n"

-- * should fail
t3 = L.pack "world\t\900\nhello world\t999\n"
t4 = L.pack "hello world\t999\nworld\t\900"
t5 = L.pack "world hello\t999\nworld\t\900"

p = occur "hello"    

---- * discard all text until word `w` occurs, and find its only field `n`
occur :: String -> Parser (String, Int)
occur w = do
    pUntil w
    string . pack $ w
    string "\t"
    n <- natural 
    string "\n"
    return (w, read n)


-- * Parse a natural number
natural :: Parser String
natural = many1' digit

-- * skip over all words in Text stream until the word we want
pUntil :: String -> Parser String 
pUntil = manyTill anyChar . lookAhead . string . pack 

Solution

  • Here's an approach to consider:

    {-# LANGUAGE OverloadedStrings #-}
    
    import Control.Applicative
    
    import Data.Attoparsec.Text.Lazy
    import Data.Attoparsec.Combinator
    import Data.Text hiding (foldr)
    import qualified Data.Text.Lazy as L (Text, pack)
    import Data.Monoid
    
    natural = many1' digit
    
    -- manyTill anyChar (try $ char c <* eof)
    
    pair0 w = do
      string (w <> "\t")
      n <- natural
      string "\n"
      return n
    
    pair1 w = do
      manyTill anyChar (try $ string ("\n" <> w <> "\t"))
      n <- natural
      string "\n"
      return n
    
    pair w = pair0 w <|> pair1 w
    
    t1 = "hello\t999\nworld\t\900"
    t2 = "world\t\900\nhello\t999\n"
    t3 = "world world\t\900\nhello\t999\n"
    
    -- * should fail
    t4 = "world\t\900\nhello world\t999\n"
    t5 = "hello world\t999\nworld\t\900"
    t6 = "world hello\t999\nworld\t\900"
    
    test t = parseTest (pair "hello") (L.pack t)
    
    main = do
      test t1; test t2; test t3
      test t4; test t5; test t6
    

    The idea is that pair0 matches a pair with the given key at the beginning of the input and pair1 matches the pair just after a newline.

    The key is the use of manyTill anyChar (try p) which will keep skipping characters until the parser p succeeds.

    (btw - I learned about this use of manyTill and try by reading an answer written by @Cactus.)