I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say
"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
I only get back:
World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word = "Hello"})
This is my current code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)
{-
This is how I would usually get the length of the list of words in a .txt file normally.
countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)
-}
data Prose = Prose {
word :: String
} deriving Show
prose :: Parser Prose
prose = do
word <- many' $ letter
return $ Prose word
main :: IO()
main = do
input <- Txt.readFile "small.txt"
print $ parse prose input
Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?
You have a pretty good start already - you can parse a word.
What you need next is a Parser [Prose]
, which can be expressed by combining your prose
parser with another one which consumes the "not prose" parts, using sepBy
or sepBy1
, which you can look up in the Data.Attoparsec.Text
documentation.
From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose]
.
EDIT:
Here is a minimal working example. The Parser
runner has been swapped for parseOnly
to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.
{-# LANGUAGE OverloadedStrings #-}
module Atto where
--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)
import qualified Data.Text as T
data Prose = Prose {
word :: String
} deriving Show
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter
separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()
wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional separator *> prose `sepBy1` separator
main :: IO ()
main = do
let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
let words = wordParser input
print words
print $ length words
The provided parser does not give the exact same result as concatMap words . lines
since it also breaks words on .,'
. Modifying this behaviour is left as a simple exercise.
Hope it helps! :)