haskellconduit

How to regex per line with Conduit


Based on provided example, we can get length of each line

import Conduit
import Data.Text (Text, pack)
import Text.Regex.TDFA ((=~), getAllTextMatches)
import Control.Monad.IO.Class (liftIO)

wc :: IO ()
wc = runResourceT
       $ runConduit
       $ sourceFile "input.txt"
       .| decodeUtf8C
       .| peekForeverE (lineC lengthCE >>= liftIO . print)

However, how would I get all matches based on regex? and in the end write them to a file?

regex :: IO ()
regex = runResourceT
      $ runConduit
      $ sourceFile "input.txt"
      .| decodeUtf8C
      .| do
         line <- mapCE (\l -> getAllTextMatches (l =~ "^foo") :: [Text])
         liftIO $ print $ line

Update:

Figured out there's built-in lines function, but is there a way to print a line and pass it along without consuming it?

grep :: IO ()
grep = runResourceT
    $ runConduit
    $ yield "foo\ndoo"
    .| decodeUtf8C
    .| Data.Conduit.Text.lines
    .| mapC (\a -> a =~ ("[fd]oo" :: Text))
    .| mapM_C (liftIO . (print :: Text -> IO ()))
    .| encodeUtf8C
    .| stdoutC

The above does print per line, but stdoutC ends up being not consumed

ghci> grep
"foo"
"doo"

Update 2: Figured out how to print in a pipeline

grep :: IO ()
grep = runResourceT
    $ runConduit
    $ yieldMany ["foo\ndoo", "\nduh"]
    .| decodeUtf8C
    .| Data.Conduit.Text.lines
    .| mapC (\a -> a =~ ("[fd]oo" :: Text) :: Text)
    .| log1
    .| unlinesC
    .| encodeUtf8C
    .| stdoutC

But why does order of await matters?

log1 :: ConduitT Text Text (ResourceT IO) ()
log1 = do
       Just l <- await -- <- has to be first
       liftIO $ print l
       yield l

Solution

  • It's not that clear from your question what you're trying to do, but if you are trying to copy all matching lines from "input.txt" to "output.txt", kind of like the grep command line utility, then you probably want a conduit that looks something like this:

    sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
      .| filterC (=~ ("[fd]oo" :: Text))
      .| unlinesC .| encodeUtf8C .| sinkFile "output.txt"
    

    Note that linesUnboundedC is a function in the "conduit" package that's equivalent to the deprecated lines function from "conduit-extra". Also, using filterC here is probably more natural than your mapC for filtering matching lines, rather than generating empty matches.

    Operating on the text file:

    A famous linguist once said
    that of all the phrases in the English language,
    of all the endless combinations of words in all of history, that
    "cellar door"
    is the most beautiful.
    That's some food for thought.
    

    this conduit will copy the two matching lines to the output:

    "cellar door"
    That's some food for thought.
    

    If you want to write the matching lines to both standard output and output.txt simultaneously, the conduit-friendly method is probably to end your conduit with a sequenceSinks component. (The void call here is needed to get the return type right.)

    import Control.Monad (void)
    
    ... .| void (sequenceSinks [stdoutC, sinkFile "output.txt"])
    

    If you prefer a log conduit that you can insert in the middle to write a copy to stdout, then the following ought to work:

    log1 :: (MonadIO m) => ConduitT Text Text m ()
    log1 = passthroughSink (unlinesC .| encodeUtf8C .| stdoutC) pure
    

    or, if you're okay with having the Haskell quoted representations printed (i.e., surrounded by quotation marks with character escaping), then:

    log2 :: (MonadIO m, Show a) => ConduitT a a m ()
    log2 = passthroughSink printC pure
    

    Some code to play around with:

    {-# LANGUAGE OverloadedStrings #-}
    
    import Conduit
    import Control.Monad (void)
    import Data.ByteString (ByteString)
    import Data.Text (Text)
    import Text.Regex.TDFA
    
    c1, c2, c3, c4 :: ConduitT () Void (ResourceT IO) ()
    
    -- copy matching lines from input.txt to output.txt
    c1 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
          .| filterC (=~ ("[fd]oo" :: Text))
          .| unlinesC .| encodeUtf8C .| sinkFile "output.txt" 
    
    -- copy final stream to both output.txt and stdout
    c2 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
          .| filterC (=~ ("[fd]oo" :: Text))
          .| unlinesC .| encodeUtf8C
          .| void (sequenceSinks [stdoutC, sinkFile "output.txt"])
    
    -- log Text to stdout in the middle of a conduit
    c3 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
          .| filterC (=~ ("[fd]oo" :: Text)) .| log1
          .| unlinesC .| encodeUtf8C .| sinkFile "output.txt"
      where log1 :: (MonadIO m) => ConduitT Text Text m ()
            log1 = passthroughSink (unlinesC .| encodeUtf8C .| stdoutC) pure
    
    -- log Haskell representations of stream in middle of a conduit
    c4 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
          .| filterC (=~ ("[fd]oo" :: Text)) .| log2
          .| unlinesC .| encodeUtf8C .| sinkFile "output.txt"
      where log2 :: (MonadIO m, Show a) => ConduitT a a m ()
            log2 = passthroughSink printC pure
    
    main :: IO ()
    main = runResourceT $ runConduit $ c4  -- pick your conduit here