parsinghaskellattoparsec

Recursively return all words from .txt file using attoparsec


I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say

"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"

I only get back:

World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word = "Hello"})

This is my current code:

{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)

{-
This is how I would usually get the length of the list of words in a .txt file normally.

countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)

-}

data Prose = Prose {
  word :: String
} deriving Show

prose :: Parser Prose
prose = do
  word <- many' $ letter
  return $ Prose word

main :: IO()
main = do
  input <- Txt.readFile "small.txt"
  print $ parse prose input

Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?


Solution

  • You have a pretty good start already - you can parse a word.
    What you need next is a Parser [Prose], which can be expressed by combining your prose parser with another one which consumes the "not prose" parts, using sepBy or sepBy1, which you can look up in the Data.Attoparsec.Text documentation.

    From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose].

    EDIT:

    Here is a minimal working example. The Parser runner has been swapped for parseOnly to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.

    {-# LANGUAGE OverloadedStrings #-}
    
    module Atto where
    
    --import qualified Data.Text.IO as Txt
    import Data.Attoparsec.Text
    import Control.Applicative ((*>), (<$>), (<|>), pure)
    
    import qualified Data.Text as T
    
    data Prose = Prose {
      word :: String
    } deriving Show
    
    optional :: Parser a -> Parser ()
    optional p = option () (try p *> pure ())
    
    -- Modified to disallow empty words, switched to applicative style
    prose :: Parser Prose
    prose = Prose <$> many1' letter
    
    separator :: Parser ()
    separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()
    
    wordParser :: String -> [Prose]
    wordParser str = case parseOnly wp (T.pack str) of
        Left err -> error err
        Right x -> x
        where
            wp = optional separator *> prose `sepBy1` separator
    
    main :: IO ()
    main = do
      let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
      let words = wordParser input
      print words
      print $ length words
    

    The provided parser does not give the exact same result as concatMap words . lines since it also breaks words on .,'. Modifying this behaviour is left as a simple exercise.

    Hope it helps! :)