haskellhtml-tablehxt

Group html table rows with HXT in Haskell


I want to process a (very poorly defined) html, which has the information grouped in pairs of rows, like this:

<html>
<body>
<table>
 <tr>
     <td>
         <font >
         <a href="a">ABC</a></font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-1-2013</font>
          <b><font>&nbsp; </font></b>
         <font>Where:</font><font>Here</font>
         <font>Who:</font><font>Me</font>
     </td>
 </tr>
 <tr>
     <td>
        <font >
             <a href="b">EFG</a>
        </font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-2-2013</font>
         <b><font>&nbsp; </font></b>
         <font>Where:</font><font>There</font>
         <font>Who:</font><font>You</font>
     </td>
 </tr>
 <tr>
     <td>
        <font >
            <a href="c">HIJ</a>
        </font>
     </td>
 </tr>
 <tr>
     <td height="50">
         <font>When:</font><font>19-3-2013</font><b>
         <font>&nbsp; </font></b>
         <font>Where:</font><font>Far away</font>
         <font>Who:</font><font>Him</font>
     </td>
 </tr>
</table>
</body>
</html>

To this, after several iterations, I arrived at this code to achieve what I want:

import Data.List
import Control.Arrow.ArrowNavigatableTree
import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 [] = []
group2 (x0:x1:xs) = [x0,x1]:(group2 xs)

countRows html = html >>> deep (hasName "tr") >. length

parsePage sz html = let
  n x = deep (hasName "tr") >. (( -> a !! x) . group2 ) >>> unlistA
  m = deep (hasName "td") >>> css "a" /> getText
  o = deep (hasName "td") >>> hasAttr "height" >>> (css "font" >. (take 1 . drop 4)) >>> unlistA /> getText
  p x = (((n x) >>> m) &&& ((n x) >>> o))
  in html >>> catA [p x | x <- [0..sz]]

main = do
    dt <- readFile "test.html"
    let html = parseHtml dt
    count <- (runX . countRows) html
    let cnt = ((head count) `div` 2) - 1
    prcssd <- (runX . (parsePage cnt)) html
    print prcssd

And the result is: [("ABC","Here"),("EFG","There"),("HIJ","Far away")]

However, I don't think this is a very good aproach, having to count the rows first. Is there a better way of doing this grouping using HXT? I've tried the &&& operator with little luck.

The question at extract multiples html tables with hxt, while useful, presents a simpler situation, I believe.


Solution

  • Here's a somewhat simpler implementation.

    import Text.XML.HXT.Core
    import Text.HandsomeSoup
    
    group2 :: [a] -> [(a, a)]
    group2 [] = []
    group2 (x0:x1:xs) = (x0, x1) : group2 xs
    
    parsePage :: ArrowXml a => a XmlTree (String, String)
    parsePage = let
        trPairs    = deep (hasName "tr") >>. group2
        insideLink = deep (hasName "a") /> getText
        insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText
    
        in trPairs >>> (insideLink *** insideFont)
    
    
    main = do
        dt <- readFile "test.html"
        let html = parseHtml dt
        prcssd <- runX $ html >>> parsePage
        print prcssd
    

    The >>. operator can be used instead of >. so that you don't need to call unlistA afterwards.

    I changed the group2 function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.

    The type of trPairs is

    trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)
    

    i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr> nodes). Now we can use the *** operator from Control.Arrow to apply a transformation to either element of the pair, insideLink for the first one and insideFont for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.