htmlhaskellarrowshxt

Records from <tr>s in an Html table using Arrows and HXT in Haskell


Looking to extract records from a table in a very well formed HTMl table using HXT. I've reviewed a couple of examples on SO and the HXT documentation, such as:

My problem is:

I want to identify a table uniquely by a known id, and then for each tr within that table, create a record object and return this as a list of records.

Here's my HTML

<!DOCTYPE html>
<head>
  <title>FakeHTML</title>
</head>
<body>
  <table id="fakeout-dont-get-me">
    <thead><tr><td>Null</td></tr></thead>
    <tbody><tr><td>Junk!</td></tr></tbody>
  </table>
  <table id="Greatest-Table">
    <thead>
      <tr><td>Name</td><td>Favorite Rock</td></tr>
    </thead>
    <tbody>
      <tr id="rock1">
        <td>Fred</td>
        <td>Igneous</td>
      </tr>
      <tr id="rock2">
        <td>Bill</td>
        <td>Sedimentary</td>
      </tr>
    </tbody>
  </table>
</body>
</html>

Here's the code I'm trying, along with 2 different approaches to parsing this. First, imports ...

{-# LANGUAGE Arrows, OverloadedStrings, DeriveDataTypeable, FlexibleContexts  #-}
import Text.XML.HXT.Core
import Text.HandsomeSoup
import Text.XML.HXT.XPath.XPathEval
import Data.Tree.NTree.TypeDefs
import Text.XML.HXT.XPath.Arrows

What I want is a list of Rockrecs, eg from...

recs = [("rock1", "Name", "Fred", "Favorite Rock", "Igneous"),
        ("rock2", "Name", "Bill", "Favorite Rock", "Sedimentary")]

data Rockrec = Rockrec { rockID:: String,
                         rockName :: String,
                         rockFav :: String} deriving Show

rocks = [(\(a,_,b,_,c) -> Rockrec a b c ) r | r <- recs]
-- [Rockrec {rockID = "rock1", rockName = "Fred", rockFav = "Igneous"},
--  Rockrec {rockID = "rock2", rockName = "Bill", rockFav = "Sedimentary"}]

Here's my first way, which uses a bind on runLA after I return a bunch of [XMLTree]. That is, I do a first parse just to get the right table, then I process the tree rows after that first grab.

Attempt 1

getTab = do
  dt <- Prelude.readFile "fake.html"
  let html = parseHtml dt
  tab <- runX $ html //> hasAttrValue "id" (== "Greatest-Table")
  return tab
  -- hmm, now this gets tricky...

-- table <- getTab

node tag = multi (hasName tag)

-- a la https://stackoverflow.com/questions/3901492/running-haskell-hxt-outside-of-io?rq=1
getIt  :: ArrowXml cat => cat (Data.Tree.NTree.TypeDefs.NTree XNode) (String, String)
getIt = (node "tr" >>>
         (getAttrValue "id" &&& (node "td" //> getText)))

This kinda works. I need to massage a bit, but can get it to run...

-- table >>= runLA getIt
-- [("","Name"),("","Favorite Rock"),("rock1","Fred"),("rock1","Igneous"),("rock2","Bill"),("rock2","Sedimentary")]

This is a second approach, inspired by https://wiki.haskell.org/HXT/Practical/Simple1. Here, I think I'm relying on something in {-# LANGUAGE Arrows -} (which coincidentally breaks my list comprehension for rec above), to use the proc function to do this in a more readable do block. That said, I can't even get a minimal version of this to compile:

Attempt 2

 getR :: ArrowXml cat => cat XmlTree Rockrec
 getR = (hasAttrValue "id" (== "Greatest-Table")) >>>
   proc x -> do
     rockId <- getText -< x
     rockName <- getText -< x
     rockFav <- getText -< x
     returnA -< Rockrec rockId rockName rockFav

EDIT

Trouble with the types, in response to the comment below from Alec

λ> getR [table]

<interactive>:56:1-12: error:
    • Couldn't match type ‘NTree XNode’ with ‘[[XmlTree]]’
      Expected type: [[XmlTree]] -> Rockrec
        Actual type: XmlTree -> Rockrec
    • The function ‘getR’ is applied to one argument,
      its type is ‘cat0 XmlTree Rockrec’,
      it is specialized to ‘XmlTree -> Rockrec’
      In the expression: getR [table]
      In an equation for ‘it’: it = getR [table]
λ> getR table

<interactive>:57:1-10: error:
    • Couldn't match type ‘NTree XNode’ with ‘[XmlTree]’
      Expected type: [XmlTree] -> Rockrec
        Actual type: XmlTree -> Rockrec
    • The function ‘getR’ is applied to one argument,
      its type is ‘cat0 XmlTree Rockrec’,
      it is specialized to ‘XmlTree -> Rockrec’
      In the expression: getR table
      In an equation for ‘it’: it = getR table

END EDIT

Even if I'm not selecting elements, I can't get the above to run. I'm also a little puzzled at how I should do something like put the first td in rockName and the second td in rockFav, how to include an iterator on these (supposing I have a lot of td fields, instead of just 2.)

Any further general tips on how to do this more painlessly appreciated.


Solution

  • From HXT/Practical/Google1 I think I am able to piece together a solution.

    {-# LANGUAGE Arrows #-}
    {-# LANGUAGE ScopedTypeVariables #-}
    module Hanzo where
    import Text.HandsomeSoup
    import Text.XML.HXT.Cor
    
    atTag tag =
      deep (isElem >>> hasName tag)
    text =
      deep isText >>> getText
    
    data Rock = Rock String String String deriving Show    
    rocks =
      atTag "tbody" //> atTag "tr"
      >>> proc x -> do
            rowID <- x >- getAttrValue "id"
            name <- x >- atTag "td" >. (!! 0) >>> text
            kind <- x >- atTag "td" >. (!! 1) >>> text
            returnA -< Rock rowID name kind
    
    main = do
      dt <- readFile "html.html"
      result <- runX $ parseHtml dt
                       //> hasAttrValue "id" (== "Greatest-Table")
                       >>> rocks
      print result
    

    The key takeways are these: