haskellweb-scrapingmonads

How do I make a do block return early?


I'm trying to scrape for a webpage using Haskell and compile the results into an object.

If, for whatever reason, I can't get all the items from the pages, I want to stop trying to process the page and return early.

For example:

scrapePage :: String -> IO ()
scrapePage url = do
  doc <- fromUrl url
  title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
  when (isNothing title) (return ())
  date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
  when (isNothing date) (return ())
  -- etc
  -- make page object and send it to db
  return ()

The problem is the when doesn't stop the do block or keep the other parts from being executed.

What is the right way to do this?


Solution

  • return in haskell does not do the same thing as return in other languages. Instead, what return does is to inject a value into a monad (in this case IO). You have a couple of options

    the most simple is to use if

    scrapePage :: String -> IO ()
    scrapePage url = do
      doc <- fromUrl url
      title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
      if (isNothing title) then return () else do
       date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
       if (isNothing date) then return () else do
         -- etc
         -- make page object and send it to db
         return ()
    

    another option is to use unless

    scrapePage url = do
      doc <- fromUrl url
      title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
      unless (isNothing title) do
        date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
        unless (isNothing date) do
          -- etc
          -- make page object and send it to db
          return ()
    

    the general problem here is that the IO monad doesn't have control effects (except for exceptions). On the other hand, you could use the maybe monad transformer

    scrapePage url = liftM (maybe () id) . runMaybeT $ do
      doc <- liftIO $ fromUrl url
      title <- liftIO $ liftM headMay $ runX $ doc >>> css "head.title" >>> getText
      guard (isJust title)
      date <- liftIO $ liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
      guard (isJust date)
      -- etc
      -- make page object and send it to db
      return ()
    

    if you really want to get full blown control effects you need to use ContT

    scrapePage :: String -> IO ()
    scrapePage url = evalContT $ callCC $ \earlyReturn ->  do
      doc <- fromUrl url
      title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
      when (isNothing title) $ earlyReturn ()
      date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
      when (isNothing date) $ earlyReturn ()
      -- etc
      -- make page object and send it to db
      return ()
    

    WARNING: none of the above code has been tested, or even type checked!